In this article, we give a brief description of the design of our identity document optical character recognition system and the algorithms that are used at different stages of the process. If you’d like to know how the Smart IDReader system works, please keep reading.
The idea of using a mobile device for document scanning, digitization, and recognition has been worked on since the appearance of the first camera phone. But for quite a long time, the poor quality of a camera on mobile devices and the low performance of mobile processors didn’t allow developing optical character recognition systems precise enough for practical use. Today smartphones and tablets are considered to be one of the best data entry options – both in terms of the camera quality and the processor. The only thing left to do is to design a decent recognition system.
The core principles of optical document recognition software
First of all, let’s outline the requirements we had in mind when we were designing the system. Four main principles behind Smart IDReader are:
- Accuracy (thanks, Cap’n). Even when a document is recognized in the photos or video-frames taken using a small digital camera, recognition accuracy has to be high enough to make automated data entry possible.
- Speed. The whole point of automated data entry on a mobile device is to make the process faster, not to slow it down. Therefore, the fundamental quality of a mobile recognition system is the speed at which the recognition result is produced.
- Autonomy. Data security is paramount. The safest way for data storage is to not store anything at all, the safest data transmission way is to not transfer any data. In our opinion, a mobile recognition system has to be designed so that no data goes beyond the RAM of a device during the recognition process. This means that image processing and image recognition have to happen on the device itself, and that imposes additional restrictions on the algorithms used.
- Scalability. Our first recognition system for mobile phones supported only the European national passports. Meanwhile, there are thousands of different identification documents in the world, and the system has to be designed so that support for a new document could be added quickly. Moreover, it shouldn’t require a large number of samples to add a new type of document because these samples are often limited.
The challenges of ID optical character recognition
When using a mobile device as a tool for taking pictures, we face a few problems that are not common for the “traditional” optical character recognition systems that work with flat document scans. First of all, image scans usually have a monochromatic uniform background, whereas a picture or a video frame can be taken against any random background. This can make the search for a document and its identification way more complicated. Secondly, when working with photos and videos, we have to deal with different lighting conditions: there might be not enough light, or too much light, there could be a glare on the document, etc. Finally, the images produced by a mobile device can be defocused and smeared.
Perhaps, the key difference of a mobile recognition system is the range of all the possible geometrical positions of a document. On the flat scans a small document can be tilted or rotated (geometric transformations of a document are thus limited to a motion group). But when taking a picture with a mobile phone camera, a document can be moved relative to the optical system at any of the three Euler angles. Sometimes the user moves the image in such a way on purpose – for example, to avoid glare. And that expands the group of expected geometrical distortions to the projective one.
One of the ways to improve the ID document recognition quality on mobile devices (taking into account the above-mentioned problems) is to use not a single image as an input for the system, but a sequence of images: if we have algorithms that are fast enough to process one image, then after we accumulate the data from a set of images, we can not only filter some “bad” input images but improve the final OCR accuracy as well.
So let’s keep in mind the following: everything has to be performed accurately and fast at the same time, using a mobile device only, while taking into account random unpredictable circumstances when taking a picture. But we can use more than one image.
Today the Smart IDReader system supports a large number of various documents, but when we were designing it, we mostly worked with documents with a fixed structure. Roughly speaking, a document with a fixed structure (or, to be exact, a page of one) is a document where all the static elements are always in the same places – if we were to remove all the personal details from two different copies of the document, the produced blanks would be identical.
During the first stage of image processing with the help of the SmartID Reader system, we detect pages of a document, we call them ‘templates’. Document templates we are looking for are identified beforehand, and originally they are represented as flat rectangles with known physical dimensions, but they can be projectively distorted in the image. Besides, one document can have a few templates: the front and the backside of an ID-card, or the two pages of a passport.
We use three key approaches when searching for templates in an image:
- A quick search for a projectively distorted rectangular object with a known aspect ratio. This method is based on the preliminary search for straight lines using the Fast Hough Transform and a careful combinatorial search. We have already talked about this method and you can read about it in the paper by Smart Engine scientists.
- The method based on the Viola-Jones object detection framework, or rather its generalization as a decision tree of classifiers. In order to fix a “projectivity”, first, we need to analyze straight lines in an image, calculate vanishing points, and to correct its perspective. That’s how we detect pages of a Russian national passport (its blank pages contain few static elements that could be sufficient to use when employing the method that we describe in paragraph 3 below).
- The most general approach based on the combination: keypoints+descriptors+RANSAC. First, we search for the key points in the input image (we use the modification of the YAPE keypoint detection algorithm for wide local contrast range images, you can read about this modification in this paper) and determine local descriptors (in our case with the help of the RFD method that is modified for acceleration purposes). Next, we detect template candidates from the index of the “known” descriptors by a series of nearest-neighbor searches. Then these candidates are verified using the RANSAC method considering the geometrical layout of the keypoints. This method is used by SmartIDReader for searching and classification when working with the vast majority of the identity document types. You can read more about it in this paper.
Searching for text fields in an ID document
Now that we know the coordinates of the template in the image and the precise type of this template (which means we know its aspect ratio, its actual physical dimensions, etc), one would think that we can simply cut out each necessary text field by their coordinates? It would be possible if text fields in documents with a fixed structure were always located at the exact same spots. Unfortunately, the fact that there is static text printed in a document (indicative of where a certain text field is supposed to be) or even if there are underlines doesn’t mean that text fields will be printed in an identical manner. Both in Russia and in other countries the printing quality of the documents with a fixed structure is far from perfect. Text fields can be printed at an angle, over the static text, over the guiding underlines, or even at the places, they are not supposed to be. This means that we’ll have to use special algorithms designed to search for text fields in a given document.
During the text field search stage, we refer to the concept of a “document zone” (an ill-defined concept) – a region of the template that contains a set of the text fields we need, and these text fields have a very similar structure. For example, the fields are rotated at a certain angle due to the printing defects, but all the fields within a document zone are rotated at the same angle. In order to detect the fields, a document zone image is projectively cut out from the original document image with a minimum resolution that is adequate for the performance of the subsequent processing algorithms.
The text field search is carried out in two stages: first, the noise is removed by the means of morphological changes of an image, and the text fields are transformed into practically monochrome squares.
And after that, the template of the text field geometrical layout is “pulled over” the document zone image using dynamic programming (we keep in mind that we know which document we are looking at, and which text fields are supposed to be there, and approximately where these text fields are located).
You can read more about the text field localization method in the zone with a somewhat certain text field layout template in this article.
Text field processing and recognition
Now that we found all the text fields, it’s time for their recognition, right? Not quite yet. First, we can facilitate the task for the text recognizer by pre-processing the images of these text fields based on a priori knowledge about the document and the structure of the recognized text fields. For example, if we know that this specific text field is in the mesh division, or if it’s underlined, we can remove the straight lines using morphological operations. Or if we know that the text fields are printed in italics, or are handwritten, we could fix the slope of lettering using the slant identification with the help of the Fast Hough Transform (you can read about it here ). During the text field pre-processing, a group of several text fields is usually analyzed at once since a few text fields within one zone are usually slanted at the same angle, or written in the same handwriting.
By the time we are ready to perform the text field recognition, we already know what kind of text field it is, and which document it belongs to, which alphabet or non-alphabet script is used in the text field, and we possibly know some font characteristics. We use the two-step framework for text line recognition, and you can find a detailed explanation of it in this article.
First, we search for the baselines in the text field image using the vertical projection analysis. After that, we make a horizontal projection, not just by the image pixels, but with the help of a special network trained to search for the cuts between adjacent characters. Then taking into consideration all the information about the acceptable alphabet (if we have this kind of information), we identify all the character candidates, and all of these candidates are recognized with the help of the second classifying network. And finally, using dynamic programming we determine the optimal path through the cuts that correspond to the text field recognition result.
The networks themselves are being trained on artificially synthesized data (you can read about this issue in an article about the two-step framework ). It means that if we need to add support for recognition of a new document and we don’t yet have the applicable networks, we simply need to set a selection of fonts, similar backgrounds, and the type of script.
We can apply the algorithms for statistical post-processing of the recognition results to some text fields using a priori knowledge about their syntax and semantics. These algorithms allow us to increase the expected result accuracy. We’ve already talked about post-processing algorithms in this article.
Now that we have the recognition results of the text fields in the document image, we can make up the result structure that we need to produce for the user. However, it will be just a recognition result for one image frame.
Using several pictures for optical character recognition
When using a sequence of images, Smart IDReader assumes that it works with the same document in the different images within one recognition session (however, it doesn’t mean that it’s the same template – for example, when taking pictures of an ID-card within one session, the document might be flipped to show its backside). The accumulation of information from a few different images could be helpful at virtually any stage of the process of document analysis – from the search of a document (for additional filtering of the set of possible candidates and for the adjustment of geometric positioning) to the specific text field recognition.
The Smart IDReader system uses a method that combines text field recognition results from multiple pictures, and it allows to considerably improve recognition accuracy in a sequence of images and to apply even the partial recognition data (if, for example, there was glare over the text field in some image). You can read about the methods that use combined results here, in this article.
One more major analysis aspect when working with a sequence of images is the decision on when to stop. Generally speaking, there is no natural stopping point for a video stream recognition, the stopping criterion needs to be thought about and implemented. It is essential to understand that besides the obvious purpose of the stopping rule to be done with the image capturing and to provide the user with the final result, it also serves as an important mechanism for reducing the time needed for document recognition. If one text field “stopped”, we don’t need to waste time for its recognition in the next image, and if all the text fields of the same zone “stopped”, we don’t need to process the whole zone (we don’t need to waste our time for the search of text fields).
The last but not least aspect of work with the video stream is the analysis of holograms and other OVDs (OVD stands for Optical Variable Devices – the security features of a document that change their appearance depending on the lighting and the viewing angle). In order to detect them and verify their presence, a sequence of images is absolutely required by definition. It’s impossible to verify the presence of such an element using only one picture.
In this article, we attempted to describe the core elements and the key principles of our Optical Character Recognition system that we focused on during its development. It goes without saying that we are not able to cover all the components and subtleties of the identity document recognition system in this format. For this reason, the explanation of such components as the copy detector, key object detection (crests, photos, signatures, stamps), fast detection of a machine-readable zone (MRZ) and its usage for helping document location and classification, text field formatting before inter-frame combination, determining the level of certainty of recognition and many others, weren’t included in this article. Even the rough description of the system required dozens of links to the scientific papers, and the solution to a problem, which might seem easy to some people, is achieved only through serious scientific advances of a large group of people. We publish about 40 scientific papers on the document recognition subject every year and file patent applications regularly.
We hope that the reasons for our choice of recognition approaches became clear after reading this article. In order to achieve maximal accuracy and optical character recognition speed, we use not only machine learning methods, but classic computer vision as well, and start the recognition process only when we have ample information about the recognized object, and do our best not to recognize any unnecessary information. We are constantly making progress and improving our methods and algorithms, thereby increasing document processing accuracy and speed, as well as adding new features for our estimated clients.