14.09.2020 г.

Automation of the modification detection in the images of contract documents using an N-gram model

Any reasonable person today knows not to sign a document before they read it. Those who didn’t follow this simple rule might have to deal with unexpected consequences that could’ve been avoided if only they paid attention to the content of a document including the fine print. The tricks and the gimmicks used by service providers in their contracts inspire anecdotes and storylines in the movies. For example, the main character in “Bedazzled” backed out of the deal with the devil while not knowing the consequences of a breach of the contract described in the article 147, in paragraph 3, in the 3rd part of the contract. Sometimes a similar situation can happen with service providers in real life. You can come across some funny incidents described online when the bank customer changed the contract terms to their advantage, and the bank had no clue about it. In this article, we’ll talk about an algorithm that can be very useful to banks and other credit organizations. This algorithm allows us to identify the changes made in the images of contract documents.

Automation of the modification detection in the images of contract documents using an N-gram model

These days a lot of companies that get a high number of clients offer to download a contract template from their website and fill it out in the comfort of their customers’ homes. After the contract has been printed, filled out and signed, it’s offered to the second part for signing. Of course, the second part checks the contracts prepared by potential clients. For example, they can be checked manually. Since there might be a fairly high number of such contracts, a few attentive employees would be needed to go through them. However, when checking a few dozens of similar contracts a day, even a careful employee can miss something. Such little undetected modifications are where these cases of fraud come from. We’ll cover the process of automating the routine check of a large number of contract documents using the optical recognition technologies.
A contract document is a kind of a business document that is created to circulate in certain record-keeping and document management systems. A distinctive feature of contract documents is a limited vocabulary used and a fairly defined structure. It can be explained by a tendency to standardize the document outlines which makes it easier to understand business documents, first of all, by a person.
The template or the form of a document is set up beforehand and consists of static text and lines that have to be filled out. Let’s look at two common kinds of templates: a fixed template and a flexible template. A fixed template does not allow to modify the static text, for example, when using the PDF format. The flexible templates, on the other hand, make changes to the static text possible, for example, templates in the Microsoft Office format. Accordingly, we’ll have fixed and flexible documents.
There are some methods for automatic comparison of an image (a scan or a photo) of a signed document to its prototype [1]. They check for possible content modifications:

  • character substitutions;
  • word substitutions;
  • Character, word or a group of words addition;
  • character, word, or group of words removal.

Modifications in document formatting are also possible:

  • changing the word styles (font size, font typeface, font style);
  • changes to text aligning;
  • changes to  the number of paragraphs;
  • changes to margins.

When a fixed template is altered, intentional tampering becomes obvious. There are no other possible explanations for wanting to change a protected text. At the same time, modification of a flexible template can be either an attempt of tampering or an accidental typo or even a result of improved formatting.
Next, we’ll be describing the models and the methods for detection of tampering in the copies of business documents that were made using fixed and flexible templates.
A reference for comparison of a text image (copy) and a prototype (original) would be the images of words that were produced using a certain method. An image of a word is represented by some description (descriptor), the most demonstrable descriptors would be recognized characters of a word. The word W is defined as a text keypoint W = (T(W), B(W)), where – T (W) – the kernel of a text keypoint, i.e. a sequence of characters that consists of symbols of a certain alphabet or a sequence of character locations with an matching estimations of these character locations with the alphabet characters, B(W) – bounding box of a text keypoint that consists of the coordinates of its boundaries Bx1(W), By1(W), Bx2(W), By2(W) which can be normalized in a certain range, and F(W) – features of the text keypoint (for example, font family and font style).

A  text keypoint is similar to a “graphic” keypoint which is defined as a point that satisfies a few conditions:

  • distinguishing the location from from other points next to it;
  • robustness against noise;
  • stability to some transformations (for example, affine transformations and image scaling) [2].

The characteristics of a keypoint are:

  • repeatability – a keypoint is supposed to have the same location in the image regardless of a vantage point location and lighting;
  • distinctiveness/informativeness – surroundings of keypoints are supposed to have considerable variations from one another;
  • locality  – keypoint and its surroundings are supposed to occupy a small area of an image;
  • quantity – the number of detected keypoints has to be large enough in order to detect objects;
  • accuracy – detected keypoints has to be localized precisely both in an original image and in an image scaled up or down;
  • efficiency – the time that is required to detect the keypoints in an image is supposed to be acceptable for time-sensitive applications.

It is assumed that a text keypoint is different from other text keypoints in its surroundings. If the surroundings imply a text line, then most adjacent words in business documents will be different. A few identical words within the same line are not going to be text keypoints. However, if the surroundings imply one or two adjacent words, then two identical words within one line that have different adjacent words are considered to be text keypoints. Comparison of keypoints is made using the resemblance degree d which considers close to zero values when comparing two points that match the same location in the image, and higher values when comparing two points from different locations in the image. Comparison of two kernels of text keypoints in this article is based on the Levenshtein distance ρLev [3] and its modification. The comparison threshold d(W) of the word T(W) to other words is calculated beforehand. If ρLev(W,Wr)<d(W), then the word Wr and the text keypoint W are identical, otherwise, they are different.

A keypoint descriptor is an identifier that is used when comparing keypoints. The descriptor is expected to be invariant when comparing keypoints to image transformations.

The method used to extract keypoints from the image is called a detector. The detector of a text keypoint implements a detection procedure that employs an OCR method that extracts the descriptors of keypoints from the document images. The keypoint characteristics are true for the text keypoints when state-of-the-art OCRs compensate for various types of image distortions. Uniqueness of the text keypoint descriptors is determined by the document structure (clear division of a document into constellations of layout objects – chapters, paragraphs, and lines) and the traits of natural language (two adjacent words rarely match in documents). The fact that the text keypoints can be situated differently in relation to one another (higher – lower, right – left, or geometric distance between the frames) allows to group these points into constellations with the help of clustering algorithms.

In an ideal case OCR software detects all the text keypoints from the copy image and the prototype without any errors. That allows to form the constellations, particularly the lines. Comparison of a copy and an original image consists of determining a clear match for all or some of the text keypoints of an original image and those of a copy image. The process of determining whether the points or the constellations of the points match will be referred to as coordination.

Coordination of fixed documents includes the following:

  • search for a correspondence between any point in an original image and a point in a copy;
  • search for a correspondence between any point in a copy and a point in an original image;
  • search for a correspondence between any static point  of an original image and a point in a copy;
  • search for a correspondence between any static point in a copy and a point in an original image;
  • verification of image identity for each pair of coordinated images.

Any detected inconsistency is a possible modification. Of course,  these inconsistencies could be a result of the detector errors (OCR) or the distortions of a document image. The problem statement would be detection of all the modifications in a copy of a document.

Coordination of flexible documents requires match determination for all the words of the static text. However, unlike fixed documents, these don’t require searching for a correspondence between the lines of static text. There could be some legitimate changes in  flexible documents that don’t alter the meaning of the text such as font changes, margin changes, word wrapping changes. Similar modifications might lead to the line jumping to the next page, which means that the comparison of multi-page flexible documents has to be done for the whole page sequence.

Generally, when the document structure is not clear, the coordination of all the words in a tested document and an original document is required. A definite flaw of the total word coordination is the inevitable recognition errors, especially when working with photos (see an example of a text image fragment with distortions in the picture below), that will be interpreted as modifications. An employee who checks them will have to spend extra time for the review of the false modifications.

Automation of the modification detection in the images of contract documents using an N-gram model

When using the total coordination of the words of the copy and the original, there might be other insignificant inconsistencies in addition to false recognition errors. From the point of view of a functional program user, not all the words are of the same value when it comes to comparing the copy and the original. In fact, the words that are important here are a subset of words in a document page that define the significant terms of contract. We can assume that a scammer’s goal is to make such modifications that could be damaging to an organization that signed this contract with him when it goes to trial or pre-trial. To define these significant words formally would be problematic, if not impossible. Only specialists would be competent to do so. And furthermore, some words become significant only when used with other certain words. For example, the negative particle “not” when used with the word “guarantee” would be considered significant. Modification of the word “contract” to the word “noncontract” is not significant since it can’t be beneficial to a con artist in legal proceedings.

Using the knowledge about the document structure and the location of the significant words that need to be checked, the problem statement could be rephrased accordingly. In this statement the document model consists of paragraphs and lines. Each line and each paragraph are represented by a set of text keypoints. The sequence of these points is unique for each paragraph or line. There might be words that are not unique there, which means they could be repetitive and even be adjacent to each other. In some cases, we could possibly know the distance between these unique words that is determined using the number of intermediate symbols and geometric distance between the word images.

The use of a simple N-gram model proved to be effective. The N-gram model is applied for various tasks such as text compression and text encryption. When processing texts written in a natural language, N-grams models could be used to detect and correct errors.

The following formats of the word N-grams are used to search for the keywords:

Automation of the modification detection in the images of contract documents using an N-gram model

where rk(wi), lq (wi) is a word located to the right or to the left of the center word wi, the acceptable distances ρBT(wi,r1(wi)), ρBT(r1(wi),r2(wi)), ρBT(l1(wi),wi), ρBT(l2(wi),l1(wi)) between adjacent words are available. Let’s call the index k in the identification of k(wi)N-gram the length of this N-gram.
The paragraph model consists of an organized N-gram sequence n1 (w1), n2 (w2),…,n m (wm) with a predetermined tuples of words ni(wiand predetermined distances between the pairs {nj−1(wj−1),nj(wj)} . We should note that some N-grams are unique for a paragraph, and others can be repetitive. In order to make sure they are unique, N-grams of various lengths can be used: bigrams, trigrams, tetragrams, and pentagrams.
When creating the paragraph model, N-grams are formed so that the number of unique N-grams is as high as possible. The N-gram application compared to the selected keyword application provides uniqueness for most paragraphs of contract documents, first of all, due to significant word scarcity in static text that has been mentioned earlier.
It makes sense to perform training and parameter optimization on real datasets. It should be mentioned that we won’t see possible modifications even on real datasets, first of all, due to classifying such information by their owners. That’s why we’ll have to make such modifications manually.
The trigram detection algorithm comes down to choosing a few consecutive words. Of course, a set of text keypoints has to be created before that. And in order to do that, we’ll be taking the following steps:
• greyscale picture processing (MinImage library);
• image normalization by the angle using the methods based on the Fast Hough Transform [4] (API Smart IDReader);
• highlighting the word edges using the “erosion” and “dilation” methods (MinImage library);
• character recognition in the edges of the detected words (API Smart IDReader).

The paragraph is presented as a single long line.

Comparison of the original words and the recognized words was performed using the modified Levenshtein distance. The algorithms for calculating the Levenshtein distance are widely known, they calculate not only the number of editorial changes but the editorial sequences as well.

We used the modified Levenshtein distance. First, we set a unique threshold for the comparison of a specific word to other words. We applied a different rule for rejecting of the identification of the pairs of words as “МОРЕ”=”ГОРА” and for identifiers such as «ИДЕНТИФИКАТОР196»,«ИДЕНТИФИКАТОР296»,«ИДЕНТИФИКАТОР199». The segments that are supposed to stay unaltered were specified for such words. There could be a lot of errors in the beginning of the words «ИДЕНТИФИКАТОРddd» but the identification was prohibited when there were editorial changes found in the last 3 characters of the word.

Another modification consisted in compensation for the OCR substitutions of some characters for the characters with similar lettering. Technically speaking, the substitutions of the Latin alphabet characters B8, DO, 1I would be errors, but by decreasing the penalty for such substitutions we can increase the accuracy of word identification. The letter substitution price for characters with similar lettering was set during training.

The heuristic estimate of linking the N-gram in general is made based on a few distances from the center and from the N-gram adjacent objects to the chosen analogs.

The model parameters (thresholds, N-gram lengths) were picked out when training so that the number of the N-gram linkage errors is minimized and the number of the N-grams linked correctly is maximized.

After the N-gram is linked to the words from paragraphs, the following tests can be done:

  • presence of all expected N-grams;
  • presence of all unique N-grams in a single copy;
  • the order of N-grams;
  • the distance between adjacent N-grams.

Failure to do any of these checks means that we’ll have to search for modification of a significant keyword.

Described method was tested on a dataset consisting of 161 “Agreement” document images that were scanned with the resolution from 100 to 300 dpi. The model made up of 33 keywords was reviewed. Some of the keywords were deliberately deleted or modified. There were 740 word removals and 140 word modifications done. The OCR software Smart IDReader was used for recognition purposes [5].

The algorithm quality was estimated based on precison and recall that were determined using the following numbers:

  • the number of detected modified words TP;
  • the number of correct words classified as modifications FP;
  • the number of undetected modified words FN;
  • the number of correct words classified as correct ones TN.

The results are presented in the table. The characteristics calculated for a few thresholds d(wi) of the word correctness estimate compared to a reference word.

3 and more21654109800,801,00

Let’s mention that during the OCR Smart IDReader recognition process all the modified words were detected. The method errors are linked to the recognition errors which happened mostly because of the scanning flaws (areas that were overexposed).

It’s not hard to figure out that this method has its own limitation related to the accuracy of highlighting the word edges. The scanning flaws mentioned earlier led to a small number of errors when searching for the word edges (approximately 1-1.5% for some keywords). In order to eliminate this limitation, we suggest an additional approach for the word detection. For some undetected N-gram we choose a subset of words from the recognized paragraph where we expect to find this N-gram. As the next step, we remove all the space symbols from the chosen subset and form a line of characters. The N-gram words are concatenated and create a substring for search. Then we search for the substrings with the help of the modified algorithm bitup that uses the modified Levenshtein distance. This approach allows us to decrease the number of the N-gram check errors (related to the errors of the word edge detection) two- or three-fold.

Brief conclusion: how the OCR software reduces the routine

We talked about one of the OCR instruments used for detection of fraudulent contract documents. It goes without saying that this instrument does not solve the problem entirely and manual checking of alleged modifications is still required. This method allows to automate the search for modifications and significantly reduce the number of routine manual checks. The main challenge we faced when developing this method was obtaining real datasets with fraudulent documents.

  1. Sidere N. et al. A dataset for forgery detection and spotting in document images // 2017 Seventh International Conference on Emerging Security Technologies (EST). – IEEE, 2017. – P. 26-31.
  2. Bertrand R. et al. A conditional random field model for font forgery detection // 2015 13th International Conference on Document Analysis and Recognition (ICDAR). – IEEE, 2015. – P. 576-580.
  3. Левенштейн В. И. Двоичные коды с исправлением выпадений, вставок и замещений символов //Доклады Академии наук. – Российская академия наук, 1965. – Т. 163. – №. 4. – С. 845-848.
  4. Bezmaternykh P. V., Nikolaev D. P. A document skew detection method using fast Hough transform // Twelfth International Conference on Machine Vision (ICMV 2019). – International Society for Optics and Photonics, 2020. – Vol. 11433. – P. 114330J.
  5. Bulatov K. et al. Smart IDReader: Document recognition in video stream // 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). – IEEE, 2017. – Vol. 6. – P. 39-44.

Improve your business with Smart Engines technologies

Send Request

Please fill out the form to get more information about the products,pricing and trial SDK for Android, iOS, Linux, Windows.

    Send Request

    Please fill out the form to get more information about the products,pricing and trial SDK for Android, iOS, Linux, Windows.