It is important to specify a technical task before the creation of the database of samples and training a neural network. It is clear that the recognition of a handwritten text, human facial expressions or the detection of a location by a photograph are completely different tasks. It is also evident that the architecture of the neural network used will affect the platform selection: in a ‘cloud’, on PC, on a mobile device – available computational resources differ by orders of magnitude.
Furthermore, the story thickens. Recognition of images obtained from high-resolution cameras or blurred images taken from a web camera without autofocus will require absolutely different data for training, testing, and validation. This is implied by a ‘no free lunch theorem’. This is why exactly openly available training databases (for example [1, 2, 3]) fit perfectly for academic research but rarely applicable in real tasks due to their ‘generality’.
The more precise training set approximates the full statistical population of the images which will arrive at the input of your system, the higher the maximum achievable final quality will be. Hence, a properly constructed training set must be an extremely specific technical task! For instance, if we want to recognize the printed characters on a photograph made using a mobile device the samples database must contain the photographs of documents from different sources with different lighting, made by different telephone models and cameras. All these complicate the collecting of a necessary number of samples for recognizer training.
Let us consider several possible ways to prepare image sets to create a recognition system.
The creation of training samples from natural images
The training examples from natural images are created based on real data. The process consists of the following stages:
- collection of graphic data (photographing of objects of interest, capturing the video stream using a camera, selecting a part of an image from a web page).
- filtering — the image verification with some requirements: a proper level of objects lighting on the images, a necessary object presence, etc.
- preparing annotation instruments (custom development or optimization of the existing tools)
- annotation (the selection of quadrangles, necessary character placements, interesting image fields)
- labelling of every image (where the label is a letter or a title of an object on the image)
These operations are quite time-consuming, therefore, such a way of a training database creation is rather expensive. Besides data is to be collected in different conditions – the lighting, phone and camera models which are involved in the process, different document sources (printing works) and so on.
All these things complicate the collection of the necessary number of samples for recognizer training. On the other hand, such data is paramount for the system’s efficiency in real conditions.
Training samples creation out of artificial images
Another approach for the training data creation is their artificial generation. Several templates can be used / ‘ideal’ examples (e.g. sets of fonts) and a necessary number of training samples can be created by applying various distortions. The following distortions can be used:
- geometric (affine, projective …).
- brightness and color modification
- background replacement
- distortions typical for the task to be solved: light specks, noise, blur, etc.
The image distortion examples for characters recognition task:
Extra lines in the images:
Squeezing and straining along the axes:
Such an approach does not demand a huge amount of human resources and is relatively cheap as it does not need any annotation or data collecting – the whole process of image database creation is defined by algorithms and their parameters.
The drawback of such a method is a weak connection between the quality of generated data and those occurred in real conditions. Moreover, to create enough samples the method requires high computational capability.
Here is an example of totally artificial database creation.
An initial set of font characters images:
Images examples with no distortion:
Addition of small distortions:
Creation of artificial training examples based on natural images
Consecutive extrapolation of the previous method is an artificial image generation with the use of real data instead of templates and initial ‘ideal’ examples. It is possible to achieve a substantial recognition system improvement by adding distortion. To conclude what distortions should be applied a part of real data must be used for validation. They can be used for an assessment of more common errors and to add images in correspondence with distortions into the training database.
Such a way of training samples creation contains the advantages of both methods above: there are no extra expenses needed and it allows to create a big number of samples necessary for recognizer training.
The most complicated component of the method is the accurate selection of the augmentation parameters for generating the training dataset from the initial samples. On the one hand, some samples must be sufficient for the neural network to learn even noisy images, on the other hand, it is necessary for the response for the other types of non-trivial input images to remain intact.
The comparison of neural network training quality on the natural samples, fully artificial, and generated from the natural ones
To give it a go a neural network is created on the MRZ characters images. MRZ — Machine-Readable Zone is called a field of a document verifying ID performed in accordance with international recommendation stated in Doc 9303 — Machine Readable Travel Documents of International Organization of Civil Aviation. For further details of MRZ recognition, it is advisable to read our article here.
MRZ contains 88 characters. Two system performance quality characteristics will be assessed:
- errantly recognized characters percentage
- percentage of fully and correctly recognized fields (MRZ is considered to be fully recognized if all characters are recognized correctly).
The neural network is supposed to be used on mobile devices where computational capabilities are limited, therefore, we will use a relatively small number of layers and weights.
800.000 character samples were collected for the experiment, they were divided into 3 groups: 200.000 samples for training, 300.000 samples for validation, and 300.000 for testing. Such a division is unnatural as the prevailing part of samples is ‘wasted’ (validation and testing), however, it provides the best way to show the advantages and disadvantages of different methods.
For testing selection the distribution of classes is close to the real one and appears to be the following:
The class name (character): the number of samples
|0: 22416||1: 17602||2: 13746||3: 8115||4: 8587|
|5: 9383||6: 8697||7: 8082||8: 9734||9: 8847|
|<: 110438||A: 12022||B: 1834||C: 3891||D: 2952|
|E: 7349||F: 3282||G: 2169||H: 3309||I: 6737|
|J: 934||K: 2702||L: 4989||M: 6244||N: 7897|
|O: 4515||P: 4944||Q: 109||R: 7717||S: 5499|
|T: 3730||U: 4224||V: 3117||W: 744||X: 331|
|Y: 1834||Z: 1246|
If the network is trained only on natural samples an average per character error value in 25 experiments resulted in 0.25 % which means that the total of incorrectly recognized characters was 750 images out of 300.000. Such quality is not applicable for a practical purpose as the amount of properly-recognized fields is consequently equal to 80%.
Let us consider the most common types of errors made by such a neural network.
Examples of incorrectly recognized images:
The following types of errors can be highlighted:
- errors in acentric images
- errors in rotated images
- errors in images with lines
- errors in images with highlights
- errors in other complicated cases
The table of more common errors:
|Original character||Error number||Characters a network more likely to mix a given character up and the number of times|
|’0’||437||’O’: 419, ’U’: 5, ’J’: 4, ’2’: 2, ’1’: 1|
|‘<‘||71||’2’: 29, ’K’: 6, ’P’: 6, ’4’: 4, ’6’: 4|
|‘8’||35||’B’: 10, ’6’: 10, ’D’: 4, ’E’: 2, ’M’: 2|
|’O’||20||’0’: 19, ’Q’: 1|
|’4’||19||’6’: 5, ’N’: 3, ’¡’: 2, ’A’: 1, ’D’: 1|
|’6’||18||’G’: 4, ’S’: 4, ’D’: 3, ’O’: 2, ’4’: 2|
|’1’||17||’T’: 6, ’Y’: 5, ’7’: 2, ’3’: 1, ’6’: 1|
|’L’||14||’I’: 9, ’4’: 4, ’C’: 1|
|’M’||14||’H’: 7, ’P’: 5, ’3’: 1, ’N’: 1|
|’E’||14||’C’: 5, ’I’: 3, ’B’: 2, ’F’: 2, ’A’: 1|
Gradually different types of distortion will be added, these distortions will correspond to the most common types of errors and will be added to the given selection. The number of ‘distorted’ images added must be varied and selected considering the feedback response of a validation set.
The following must be done:
As an example, to solve the given task the following actions were taken:
- Adding ‘shift’ type of distortion which corresponds to the error in the ‘acentric’ image.
- A series of experiments: training several neural networks
- Quality assessment using a testing set. MRZ recognition quality has increased by 9%.
- Analysis of the most frequent recognition errors using the validation set.
- Adding images with extra lines into the training database
- Another series of experiments
- Testing. MRZ recognition quality on a testing set has increased by 3.5%.
Such ‘iterations’ can be applied multiple times – to achieve a required quality or till the quality stops increasing.
Thus, a recognition quality of 94.5% accurately recognized MRZ fields was achieved. By using post-processing (Markovian models, finite automata, N-gram-based or dictionary-based methods and so on) a further quality increase can be achieved.
By training with only the artificial data in a given task a quality of only 81.72% accurately recognized MRZ fields was achieved. Along with that, the main issue is the difficulty of distortion parameters selection.
|Percentage of accurate
|Natural images + images with a shift||89,68%||0,13%|
|+images with extra lines||93,19%||0,1%|
To sum up, it is necessary to highlight that a specific algorithm of obtaining training data must be chosen in every precise case. If the initial data is absent – an artificial selection must be applied. If real data can be obtained, a training dataset created solely on it can be used. But if the real data is not enough or there are rarely detectable errors, the most appropriate way is the augmentation of the set of natural images. From our own experience, the latter case occurs most frequently.