There are a lot of articles dedicated to image recognition by machine learning methods such as neural networks, support vector machines, random trees. All these require a significant number of examples for learning and parameters setting. To create learning and testing datasets of the adequate volume is not quite trivial. In fact, the point is not related to technical obstacles in the way of collecting and storing millions of images but about the age-old situation when initially you only have a handful of images to start from. Apart from that, we’d better mind that the content of the training set can impact the quality of the recognition system more than all other factors. Despite this, such a critical stage is fully omitted in the majority of articles on this topic.
It is important to specify a technical task before the creation of the database of samples and training a neural network. It is clear that the recognition of a handwritten text, human facial expressions or the detection of a location by a photograph are completely different tasks. It is also evident that the architecture of the neural network used will affect the platform selection: in a ‘cloud’, on PC, on a mobile device – available computational resources differ by orders of magnitude.
Furthermore, the story thickens. Recognition of images obtained from high-resolution cameras or blurred images taken from a web camera without autofocus will require absolutely different data for training, testing, and validation. This is implied by a ‘no free lunch theorem’. This is why exactly openly available training databases (for example [1, 2, 3]) fit perfectly for academic research but rarely applicable in real tasks due to their «generality».
The more precise training set approximates the full statistical population of the images which will arrive at the input of your system, the higher the maximum achievable final quality will be. Hence, a properly constructed training set must be an extremely specific technical task! For instance, if we want to recognize the printed characters on a photograph made using a mobile device the samples database must contain the photographs of documents from different sources with different lighting, made by different telephone models and cameras. All these complicate the collecting of a necessary number of samples for recognizer training.
Let us consider several possible ways to prepare image sets to create a recognition system.
The training examples from natural images are created based on real data. The process consists of the following stages:
These operations are quite time-consuming, therefore, such a way of a training database creation is rather expensive. Besides data is to be collected in different conditions – the lighting, phone and camera models which are involved in the process, different document sources (printing works) and so on.
All these things complicate the collection of the necessary number of samples for recognizer training. On the other hand, such data is paramount for the system’s efficiency in real conditions.
Another approach for the training data creation is their artificial generation. Several templates can be used / ‘ideal’ examples (e.g. sets of fonts) and a necessary number of training samples can be created by applying various distortions. The following distortions can be used:
The image distortion examples for characters recognition task:
Extra lines in the images:
Squeezing and straining along the axes:
Such an approach does not demand a huge amount of human resources and is relatively cheap as it does not need any annotation or data collecting – the whole process of image database creation is defined by algorithms and their parameters.
The drawback of such a method is a weak connection between the quality of generated data and those occurred in real conditions. Moreover, to create enough samples the method requires high computational capability.
Here is an example of totally artificial database creation.
An initial set of font characters images:
Images examples with no distortion:
Addition of small distortions:
Consecutive extrapolation of the previous method is an artificial image generation with the use of real data instead of templates and initial ‘ideal’ examples. It is possible to achieve a substantial recognition system improvement by adding distortion. To conclude what distortions should be applied a part of real data must be used for validation. They can be used for an assessment of more common errors and to add images in correspondence with distortions into the training database.
Such a way of training samples creation contains the advantages of both methods above: there are no extra expenses needed and it allows to create a big number of samples necessary for recognizer training.
The most complicated component of the method is the accurate selection of the augmentation parameters for generating the training dataset from the initial samples. On the one hand, some samples must be sufficient for the neural network to learn even noisy images, on the other hand, it is necessary for the response for the other types of non-trivial input images to remain intact.
To give it a go a neural network is created on the MRZ characters images. MRZ — Machine-Readable Zone is called a field of a document verifying ID performed in accordance with international recommendation stated in Doc 9303 — Machine Readable Travel Documents of International Organization of Civil Aviation.
MRZ contains 88 characters. Two system performance quality characteristics will be assessed:
The neural network is supposed to be used on mobile devices where computational capabilities are limited, therefore, we will use a relatively small number of layers and weights.
800.000 character samples were collected for the experiment, they were divided into 3 groups: 200.000 samples for training, 300.000 samples for validation, and 300.000 for testing. Such a division is unnatural as the prevailing part of samples is ‘wasted’ (validation and testing), however, it provides the best way to show the advantages and disadvantages of different methods.
For testing selection the distribution of classes is close to the real one and appears to be the following:
The class name (character): the number of samples
|0: 22416||1: 17602||2: 13746||3: 8115||4: 8587|
|5: 9383||6: 8697||7: 8082||8: 9734||9: 8847|
|<: 110438||A: 12022||B: 1834||C: 3891||D: 2952|
|E: 7349||F: 3282||G: 2169||H: 3309||I: 6737|
|J: 934||K: 2702||L: 4989||M: 6244||N: 7897|
|O: 4515||P: 4944||Q: 109||R: 7717||S: 5499|
|T: 3730||U: 4224||V: 3117||W: 744||X: 331|
|Y: 1834||Z: 1246|
If the network is trained only on natural samples an average per character error value in 25 experiments resulted in 0.25 % which means that the total of incorrectly recognized characters was 750 images out of 300.000. Such quality is not applicable for a practical purpose as the amount of properly-recognized fields is consequently equal to 80%.
Let us consider the most common types of errors made by such a neural network.
Examples of incorrectly recognized images:
The following types of errors can be highlighted:
The table of more common errors:
|Original character||Error number||Characters a network more likely to mix a given character up and the number of times|
|’0’||437||’O’: 419, ’U’: 5, ’J’: 4, ’2’: 2, ’1’: 1|
|‘<‘||71||’2’: 29, ’K’: 6, ’P’: 6, ’4’: 4, ’6’: 4|
|‘8’||35||’B’: 10, ’6’: 10, ’D’: 4, ’E’: 2, ’M’: 2|
|’O’||20||’0’: 19, ’Q’: 1|
|’4’||19||’6’: 5, ’N’: 3, ’¡’: 2, ’A’: 1, ’D’: 1|
|’6’||18||’G’: 4, ’S’: 4, ’D’: 3, ’O’: 2, ’4’: 2|
|’1’||17||’T’: 6, ’Y’: 5, ’7’: 2, ’3’: 1, ’6’: 1|
|’L’||14||’I’: 9, ’4’: 4, ’C’: 1|
|’M’||14||’H’: 7, ’P’: 5, ’3’: 1, ’N’: 1|
|’E’||14||’C’: 5, ’I’: 3, ’B’: 2, ’F’: 2, ’A’: 1|
Gradually different types of distortion will be added, these distortions will correspond to the most common types of errors and will be added to the given selection. The number of ‘distorted’ images added must be varied and selected considering the feedback response of a validation set.
The following must be done:
As an example, to solve the given task the following actions were taken:
Such ‘iterations’ can be applied multiple times – to achieve a required quality or till the quality stops increasing.
Thus, a recognition quality of 94.5% accurately recognized MRZ fields was achieved. By using post-processing (Markovian models, finite automata, N-gram-based or dictionary-based methods and so on) a further quality increase can be achieved.
By training with only the artificial data in a given task a quality of only 81.72% accurately recognized MRZ fields was achieved. Along with that, the main issue is the difficulty of distortion parameters selection.
|Percentage of accurate|
|Natural images + images with a shift||89,68%||0,13%|
|+images with extra lines||93,19%||0,1%|
To sum up, it is necessary to highlight that a specific algorithm of obtaining training data must be chosen in every precise case. If the initial data is absent – an artificial selection must be applied. If real data can be obtained, a training dataset created solely on it can be used. But if the real data is not enough or there are rarely detectable errors, the most appropriate way is the augmentation of the set of natural images. From our own experience, the latter case occurs most frequently.
Please fill out the form to get more information about the products,
pricing and trial SDK for Android, iOS, Linux, Windows.