05.08.2020 г.

Simple classification method for recognized pages of business documents using Template Matching

As we all know classification’s objective is to include any object from any selection to one or a few different categories of a number of predetermined categories.

Simple classification method for recognized pages of business documents using Template Matching

1. Problem statement

Classification algorithm can be created based on machine learning that uses a training selection of some objects that have been categorized beforehand.

Template matching that consists of the preliminary preparation of templates for all the possible categories is widely known in that regard. Whether a test object belongs to a certain category is decided based on some minimum (maximum) value of the comparison function for an object and its template. This method is one of the easiest for understanding and implementation. The latter only requires to represent an object by a template and select the comparison function.

Let us review the problem of the classification of business documents’ recognized pages. Business documents used in a typical workflow, such as the exchange of documents between the organizations, are standardized in a certain way and can be both unstructured and structured. Banks and insurance agencies regularly use such documents as a power of attorney, an agreement, a specimen signature card, a charter, a contract, an invoice, a registration certificate, etc. When creating and keeping a digital archive, the paper copies of the documents are digitized and the scanned pages can be recognized and analyzed. One of the objectives of this analysis is to categorize the scanned page: which means it should be checked whether it meets the criteria of belonging to a certain group. This becomes necessary when verifying the scanned images that are attached as the digital copies.

2. Using the OCR software

In order to analyze the data of a scanned document (particularly, to categorize it), the Optical Character Recognition (OCR) software can be used. Today one of the most widely used solutions would be OCR Tesseract. For instance, this OCR was chosen for the recognition of the archive documents database in the thesis [2] due to the following reasons: it being able to be distributed freely as well as the presentation of the recognition results in HOCR format (HTML OCR) while retaining the data for the recognized words’ coordinates.

Tesseract OCR software might both be effective and fail when recognizing the scanned pages depending on how noisy the text is and its own internal capabilities. Systematic failure can be linked to the objective difficulties when recognizing particular types of documents, e.g. Russian national passports.

Its typical errors can be divided into a few groups:

  • E1: complete inability to recognize a page, when not a single word is recognized
  • E2: there are so many errors that a person can’t recognize the document based on its text representation in HOCR format (for example, see in the picture above: it’s an example of the failed proof of taxpayer registration document recognition using one of the previous versions of OCR Tesseract);
  • E3: there is an error in the page structure recognition, for example when the text fragments situated next to each other ended up too spaced-apart;
  • E4: the number of errors is not significant when most words that are recognized incorrectly have 1-2 mistakes.

This article will be addressing only the page classification of the documents that are recognized by Tesseract OCR software, assuming there is a possibility of E3 and E4 kind of errors.

Let’s talk about one more objective of the scanned documents’ classification. When working with the multi-page image files, where each file contains pages of one or more documents, it is required to define the beginning and end of each document and categorize all of them into predetermined groups. The pages produced as a result are saved in a digital archive as separate documents.

Now we’ll touch upon a simple method of the documents’ pages classification that is based on their comparison to the pre-set templates of the categories. This method allows to process single-page and multi-page documents and take into account some of the OCR recognition errors.

3. Template description and template matching

We are suggesting making category templates based on the keywords and the combination of a few keywords.

The model of an unstructured document as a structure over a number of keywords is fairly popular. For example, see the descriptions of the document models as:

  • vector model or bag-of-words model, i.e. the multisets of words in the document [4],
  • language model that takes into consideration the word order (considering the possible presence of the word sequences) [5],
  • thematic model, using themes that consist of the sets of words [5],

We’ll outline an approach to template creating that uses the mechanisms mentioned above and takes into account the errors and the specifics of document recognition.

The recognized words in the form of word symbols and its features will act as the basic characteristics:

W = (T(W), m1(W), m2(W), m3(W), m4(W), m5(W), m6(W)),

where T(W) – the word core, that is the sequence of symbols and characters “?” and “*”, the latter ones are used to signify a set of words, for example, “*ab?c” means a set of words with any number of symbols, when “?” can be substituted with any symbol.

m1(W) – distance threshold when comparing two words T(W) and Wr. In order to compare two words, we need to choose a metric (distance function). We used Levenstein distance [8] or the simplified function that takes into account only the number of symbol substitutions. In the former case, the distance d(T(W), Wr) is calculated only for the words that are the same length; when the words have the “?” characters, the symbols are not compared; when the words have the “*” characters, only the proper substrings are compared. When d(T(W), Wr)<m2(W), the words are identical, otherwise they are considered to be different words. In the simplest case, m2(W) is the maximum number of substitutions when transforming T(W) into Wr.

m2(W) – restriction on the word length for Wr when comparing T(W) and Wr, applicable to the words containing the “*” character.

m3(W) – case sensitivity function when comparing the symbols.

m4(W) – rectangle of a word that consists of the coordinates in the range [0, 1], the coordinates are used to verify if the word matches the mentioned rectangle.

m5(W) – word rejection function that means the word is not supposed to be present in a given text. We can check both the lack or the presence of the W word in a text with the help of m5(W).

m6(W) – we’ll be described later.

When searching for a T(W) in the text of the recognized document T, we check if

∃ Wr ∈ Td(T(W), Wr) < m2(W) ∧ (F(Wr) ∩ m4(W) = F(Wr))

where F(Wr) – Wr rectangle, i.e. the word coordinates that are the results of recognition in HOCR format, x- and y-coordinates are normalized to the width and height of the original image. When comparing we use the m3(W) parameter. Generally speaking, there might be found a few {Wr} words that are identical to the key word T(W).

Let’s determine the predicate P(WT)=1 in case m5(W)=0, when there is at least one word identical to the word W in the text of the recognized document T, and the predicate P(WT)=0 when there are no words identical to the word W. If the word has the m5(W)=1 function, then P(WT)=0, if there is at least one word identical to the word W in the text of the recognized document T, and P(WT)=1, if there are no words identical to the word W in the text.

We are going to need the measurements of the words d(T(W), Wr) that are identical to the word T(W).

We’ll define the words’ placement as a structured set of words R = W1,W2,…, where the presence of each word in the recognized text T is verified:

P(W1, T) ∧ P(Wr, T) ∧ … (1)

and additionally each pair of words Wi and Wi+1 are checked for the following

r(Wi+1) – r(Wi) < m5(W), (2)

where the r(W) function gives us the W word number in the text T that is ordered by the OCR mechanisms. The mr6(W) parameter determines the distance between the adjacent words, when mr(Wi+1) = ∞, the requirement (2) doesn’t need to be verified, only the word order needs to be checked.

In order to locate R, we can determine m7(R) – the location rectangle, which is analogous to the word rectangle, and we check if there any entire words found in the text T that are identical to the words W1,W2…, in the rectangle m6(R).

When the conditions (1), (2) are being met and matching the m7(R) rectangle edges, it determines the belonging predicate of the R placement in the text T: P(RT)=1. In the simplest case, the placement can consist of one single word, generally, in order to calculate P(RT), we need to search through the sets identical to the words W1,W2… .

Let’s define the location estimation d(RT) as the minimum of the word measurements: min(d(W1, T), d(W2, T), …).

Now we can define a combination as a set of locations S = R1,R2,… where the presence of each location will be verified in the recognized document T:

P(R1, T) ∧ P(R2, T) ∧ … (3)

The order of locations is of no importance. In addition to the condition (3), the comparison of all the word rectangles of the combination to the rectangle of the m8(S) combination can be used as well. The above mentioned conditions determine the predicate of S belonging to the text T: P(ST)=1.

Let’s define the estimation d(ST) of the combination as the minimum of the location measurements that are included in the combination: min(d(R1, T), d(R2, T), …).

And, finally, let’s define the template M as a set of combinations S1,S2,…, where the template belonging to the recognized text is determined by verifying the equation

P(MT) = P(S1, T) ∨ P(S2, T) ∨ … (4)

In addition to the condition (4), we can also use a comparison of all the word rectangles of the templates to the m8(M) combination rectangle.

Let’s set the estimation of the template d(MT) as the maximum of the combination measurements that are included in the template: max(d(S1, T), d(S2, T), …).

In addition to the existing template matching to the recognized text, we’ll add compliance check of a few text features (number of symbols on the page m9(T), number of text columns m10(T)) with the similar features of the m9(M), m10(M) template.

If we have a set of templates M1, …, Mn for n categories, then the task of matching the recognized page of the document T to the category Mi consists of calculating the distance d(MiT) according to the outlined scheme and then comparing that distance with the already known threshold d1. If max(Mi) < d1 is true, then the document T matches the category i.

Let us explain the need to use the elements of the suggested templates.

Simple sporadic E4-type errors that often appear in some word can be disregarded using the characters “?” and “*” in a word mask. In order to control the length of the word that contains the “*” character, we can use the m2 parameter, for example, the key words “ДОГОВОР”, “Договора” (Russian for “agreement”, “agreements”) can be described with the core “ДОГОВОР*” with the m2=8 restriction which excludes the words like “договороспособность” (Russian for “the ability to sign the agreement”). The core “?ОГОВОР” allows to not distinguish the words “ДОГОВОР” and “АОГОВОР”.

The m1 threshold can be used when the keyword and the word from the text have to be an exact match. For example, when m2=8, the core word “ДОГОВОР” is not going to match with the words “АОГОВОР” or “ДОГО8ОР” in the text, even though they have a single symbol substitution.

The m3 parameter allows disregarding the case recognition errors as well.

The rectangle of the m4 word (its location, its matching, its template) provides a partial description in the template of the document structure through extracting the keywords from the specified parts of the document image. It is obvious that the rectangle borders can’t be determined precisely due to the variability of some parts of the documents.

The E3-type errors, i.e. incorrectly recognized page structure, can be disregarded with the help of the distribution setting. Naturally, distribution works with a phrase or a few adjacent words. When the text column is being split into two because of the recognition errors, this distribution will be found if we don’t set the distance between the m6 words.

The ability to check both the presence and the lack of words set by the m5 parameter allows us to create templates for similar documents. These documents are being distinguished by the keywords that are present in one category of the documents and in a different category as well.

All the mentioned clarifications prove the usefulness of the described template method for classification of the texts that contain the recognition errors. This method is simple considering that creating a template is easy in some cases, and the template itself is intuitively understandable. For example, the contract “Non-residential property rent agreement” (In Russian: “Договор аренды нежилого помещения”) can be described by this template:

  • keywords:
Simple classification method for recognized pages of business documents using Template Matching
  • locations: R1=W1, R2=W2&W3&W4, R3=W2&W5&W6, R4=W7&W8&W9&W10, the edges in all the locations do not exist.
  • combinations: S1 =R1∧R2, S2 =R1&R3, S3 = R3, the edges for any of the combinations do not exist.
  • template: M = S1&S2&S3, the template has an edge m8(M) = {(0.0,0.0), (0.3,0.9)}.

In order to choose the best category, we need to calculate the distances d(M1, T), …, d(MnT), to discard the Mj categories, when d(MjT) > d1 is true, to organize the resulting set in ascending order (essentially we want to pick one or a few categories with the minimal penalty for not matching the text template) and to save one or a few options d(Mi1, T), d(Mi2, T),…. Which means that the category i1 matches best with the text T, and the other categories i2, … are sorted by the distance. In the simplest case, we can perform conflict resolution d(Mi1, T) = d(Mi2, T) without classification, in some cases the conflict is resolved with the help of additional functions m9, m10.

4. Creating template for OCR document scanning

It is obvious that the described approach of template matching is simple. However, there are certain potholes when creating the templates, i.e. the training process.

We’ll review the process of template creating on a real example of the database of documents of 45 different categories. We created the templates in several stages, focusing on the keywords from the first page of the multi-page documents.

Stage 1. First, we reviewed a set of reference documents. A few samples of the perfect documents of each category that did not have any recognition errors were prepared. Text representations of these documents were transformed into the bags of words, i.e. multisets of words, and all the stop-words were removed (short words, names and last names, numerical data, dates, etc.). Second, single words and phrases specific to one of the categories were picked out from those multistes with the help of simple exhaustive search algorithms. The focus was on the specific words in the headings and the titles of the document sections. That way a few lists that clearly separated one category from another were created. We were looking only for the cores of the keywords, and the other parameters were set by default. A few categories appeared to be tricky right off the bat, for example, the first page of the documents of the category “Statute” might be represented by a single word “Statute” that can often be found in other documents, for example, the documents of the category “Contract”. In order to avoid this problem we used two approaches:

  • using the words that are not supposed to be on the first page of the statute documents;
  • using the function m9(T) – the number of symbols on the page – that distinguished the pages of the statute documents well.

Stage 2. Although the produced template classification has its own imperfections, as a next step we’ll start working on the selection of 50-100 actual documents (single-page and multi-page). These documents are supposed to have a wide range of E1-E4 type errors when recognized. These error analysis identified some pages that required to modify the templates, and other pages that couldn’t be classified using our approach (mostly due to a high number of recognition errors). Template modifying consisted of searching for the new locations and combinations, searching for the new words that were not supposed to exist in the document, and determining the parameters for the keywords, particularly m4 rectangles and distances between m6 words. Assessment of the classification results was made according to the following criteria:

Simple classification method for recognized pages of business documents using Template Matching

The pages that were classified correctly will be assessed with the help of the following formula – (n1+m2)/(n1+ n2+ n3+m1+m2), the errors of false classification – (n2+m2)/(n1+n2+n3+m1+m2), and the classification failures (rejects) – (n3)/(n1+ n2+ n3+m1+m2).

The worst classification error is a false classification of a non-first page (any page after the first one, including the last one) of a multi-page document. It’s so inconvenient since a multi-page document is saved as a few separate parts which requires searching for a missing part of the document when we need to make a correction. In order to categorize the non-first pages, we use the approach mentioned earlier that requires the templates of the intermediate and last pages. The rest of the classification errors (incorrect classification of the first page of a multi-page document or the only page of a single-page document) can be fixed by a simple substitutions using a reference book.

Stage 3. First two stages are based on the application of the rules (templates) for the purposes of classification. When working with a large selection (3000 – 30000 samples of each category), machine learning can be used, for example, the well-known method CART of building binary decision trees (Classification and Regression Trees[9]). The input data for training would be the keywords descriptions from the previous stages that consist of the core and features. Using these feature functions a few trees are being built for each category within “one against all” principle. If a document was not recognized by any of the trees during the classification process, there will be a classification failure (reject). One of the known specifics of the CART method is a need for a high quantity of the training selection for sustainable learning. This is why we won’t be mentioning the classification results produced with the help of CART method. Although, overall, they are superior to the produced classification results using training during the first and second stage.

5. Experiments

Two training sets were used for the experimental evaluation:

  • A set containing the scanned document images of medium and poor quality that were selected for the training stage (884 pages),
  • A set containing the scanned document images of medium quality produced irrespective of the training stage (3014 pages).

The classification results are shown in the two following tables:

Simple classification method for recognized pages of business documents using Template Matching

These tables show that the described classification method is 0.86 – 0.95 accurate, the false classification does not exceed 0.01, the rest of the errors are the classification failure type (reject). Therefore, the proposed method doesn’t work every time, but it rarely gives the wrong classification.

This classification speed (Release assembly using Microsoft Visual Studio 2013) is fairly high: 3000 recognized pages in approximately 1 minute using the processor Intel® Core(TM) i7-4790 CPU 3.60 GHz, 16,0 GB, Windows 7 prof 64-bit. However the OCR Tesseract recognition necessary for the original HOCR file takes a few seconds.

In order to decrease the rate of pages that remained uncategorized, the following steps can be taken:

  • applying the binarization methods that neutralize the “noisy” background;
  • using different OCR scanning software;
  • Using more effective classification methods that are based on the keywords search, for example the CART method, briefly mentioned above.

Conclusion

The classification technique described in this article was implemented in the Smart Engines projects of input and analysis of the scanned document stream.

This method is simple for understanding and implementation when addressing similar tasks.

Improve your business with Smart Engines technologies

Send Request

Please fill out the form to get more information about the products,pricing and trial SDK for Android, iOS, Linux, Windows.

    Send Request

    Please fill out the form to get more information about the products,pricing and trial SDK for Android, iOS, Linux, Windows.