July 1, 2020 | Sci-Dev

Text line segmentation in documents using convolutional and recurrent neural networks

Text line segmentation is a crucial step for Optical Character Recognition (OCR), particularly when working with the document images. Text line segmentation is defined as the decomposition of an image comprised of the character sequence into fragments containing single characters.

The importance of segmentation stems from the fact that most advanced optical text recognition systems are based on the character classifiers (such as neural networks), rather than the word or text fragments classifiers. In such systems, the incorrect cuts between characters are typically accounted for the bulk of the final recognition errors.

The process of the character edge detection gets more complex due to the printing and digitizing (scanning) artifacts of the document which cause “cutting” and “gluing” of the characters. When using stationary or mobile compact video cameras, the range of digitizing artifacts significantly increases: there may be defocusing and motion blur, projective distortion, warping and bending of a document. There often will be parasitic brightness fluctuations (shadows, reflections), as well as colour distortions and digital noise due to the low-light conditions when shooting in natural lighting. The figure below demonstrates the complex cases of text line segmentation in Russian passports.


In this article, we will be covering our method for the text line segmentation into characters developed in Smart Engines and based on convolutional and recurrent neural network training. The main document described in our work will be a Russian passport.

“End-to-end” segmentation using machine learning methods

Machine learning methods are widely used in modern segmentation algorithms. However, they are typically combined with additional algorithms such as the initial cut generation for a trained character recognition model, or dynamic programming based on the output solutions of this model.

Development of the segmentation algorithm using machine learning methods for the image analysis virtually with no additional pre- and post-processing (end-to-end) is especially interesting. This approach stands out because it does not require manual fine-tuning each time (depending on the font, the margin, the type of document). But it requires a representative marked training set that is sufficiently large which, in turn, simplifies and accelerates the segmentation algorithm creation for new types of document lines, as well as improves the accuracy and the resilience to various distortions that occur while shooting.

Quality assessment of segmentation methods

Just like with any other algorithms, the process of developing the segmentation methods requires recording how their work’s quality is assessed. Preferably this quality assessment method should allow comparing the method in development to other existing algorithms. Let’s go over the quality markers that we used in this research for the quality assessment of the Russian passport text line segmentation.

The purpose of the text segmentation into characters is its subsequent recognition. It explains why the final recognition quality used as a quality assessment of the segmentation algorithm is popular. Both the recognition accuracy of certain characters and words and the average Levenshtein distance can be used as quality assessment of the segmentation algorithm. The recognition quality indicators for the Russian passport in this research is considered to be the accuracy of each passport line’s complete recognition (first name, last name, place of birth, etc.) down to the symbol as even a single error in one of the lines that identify a person would be unforgivable.

However, when the recognition quality is used to assess the segmentation quality, the assessment will be dependent on the specific recognition model. It will lead to complications when developing an integrated system since various segmentation algorithms and recognition algorithms won’t be interchangeable. This explains why we used the additional quality metrics based only on the inter-character edge assessment as a result of the segmentation algorithm such as precision, recall, and F-measure. This approach is going to work only when there is both a “perfect” character layout and cuts between them annotated by people (which we did here). But it will not always be the case.

Input dataset preparation and artificial expansion 

Input dataset contains the chosen images of the Russian passport lines that were cut along the text baselines and marked with precise cuts between the characters. The annotation can be performed manually or semi-automatically since the existing segmentation algorithm marks the cuts that are being checked later and can be corrected manually if needed.

Russian passport dataset preparation and its annotation is an expensive procedure. A large amount of manual labour is not the only reason for its cost. The personal identification documents contain personal information that is regulated by legislation. It means that there is no open access to the databases containing a considerable quantity of Russian passport images. We must add that creating a dataset manually is not easy either due to the lack of the background security features’ open specifications in Russian passports such as guilloche, holograms, etc. 

It is obvious that creating a dataset that is sufficient enough for training a high precision approximator that can work with the imperfect shot quality and line detection errors is challenging. Training dataset artificial expansion (augmentation) via data transformation can be used to increase stability. Synthesis of each sample is carried out through a random set of transformations which mimic the real image transformation.

The following methods have been implemented to expand the input dataset of Russian passport text lines: adding Gaussian noise distortions, projective distortions to model imperfect conditions when searching for the document edges and Gaussian smoothing to model defocusing, lengthwise and widthwise character stretching (these character parameters can vary greatly depending on the region), and vertical and horizontal shifts (these model the margin of error in real systems). These additional methods to expand the dataset have been used as well: letter shuffle and mirror images (these don’t occur in natural settings but they were proved to increase the accuracy when used with such an expansion). The figures of the described transformations are listed below:


Transformation      Figure
Original image
Gaussian noise
Projective distortions
Gaussian smoothing
Letter shuffling
Combination of transformations


The format of the markup and the output vectors of the network 

Regardless of the universal approximator type we are working with, we need to determine the loss function (cost function) that will be minimized for the training dataset during the model parameters training phase.

The standard deviation (SD) is a loss function that is characterized by continuity and differentiability. The segmentation algorithm final values will be the cuts locations. Therefore, theoretically, the standard deviation (SD) could be approximated between the actual cuts and the “perfect” cuts as the average distance between those. Unfortunately, this approach could lead to difficulties, especially when using the neural network.

The number of cuts between characters as well as the number of characters is not fixed, and the segmentation algorithm doesn’t know how many of them it will have to produce. When the number of cuts in the ground truth and in the algorithm match, we can use the distance error function. However, if those don’t match, it will be necessary to calculate the losses of the deficit and surplus cuts, especially when there are reasonable expectations of the loss function. Besides, the number of neural network exits is fixed within the same architecture, and the maintenance of the dynamic number of exits requires unnecessary complications of the model.

Therefore, we need to find the ground truth formats and the network output that support any acceptable quantity of the cuts. At the same time, these cuts’ respective loss function needs to be suitable for using in training the neural network by the gradient methods. It is suggested to use the following model: instead of reviewing the list of cuts coordinates we’ll be looking at probabilistic estimations for the cuts locations in each of the image pixel columns. In this case, the cuts markup will be presented the following way: zeros for all the entries, except for the cuts entries which will have unit values.  The standard deviation can be used as the loss function here but has to be calculated for the probabilistic estimation vectors. The final cuts locations are determined by transforming the output probabilistic estimations of the algorithm which will be elaborated on later in the text.

Small fluctuations of the cuts in the ground truth normally don’t affect the recognition quality much, especially when the segmentation algorithm marks the cuts not on the character edges (so that there are two cuts between two characters), but in between the characters (there is one cut between the characters). In this case, the suggested loss function will issue penalties for all the output potentials that don’t match the perfect cuts, regardless of the distance. In order to decrease the penalty for the small fluctuations of the output network, we propose to use Gaussian smoothing  with the radius proportional to an average character width in the image.

Segmentation using convolutional neural networks

One of the most popular neural network architectures for image analysis is a deep convolutional network architecture that was used for the first substantial experiments when creating our training segmentation method. The classic model of a convolutional network consists of a few convolutional layers that form the feature maps using convolution operations with trained kernel alternated with max pooling in order to reduce the feature map dimension. Last layers (ending with the output layer) have a fully connected architecture.

The neural network input data (feature vectors) for this research will be the greyscale bitmap images of the Russain passport information lines that were scaled to a certain width and height, for example, 200×20 pixels. The output layer size giving probabilistic estimations of the cuts’ presence in the image columns is fixed and is 200 pixels respectively. The figure below is a model describing a convolutional neural network operation.

The hidden part of the neural network consists of the convolutional layers which are followed by two fully connected layers. Each convolutional layer is followed by a pooling layer. The hyperbolic tangent function was used as an activation function. The dropout of the hidden layers  was used for training purposes.

Additional segmentation using recurrent neural LSTM-networks

To increase the accuracy of the trained segmentation network, we used convolutional network output post-processing via additional bidirectional recurrent network.

Recurrent neural networks are developed specifically for working with sequences: besides the following character they adopt a specific latent state as an input. In this work we used the recurrent neural networks with Long Short-Term Memory architecture (LSTM) that were proved successful when applied to the sequence analysis such as printed and handwritten text recognition, speech recognition and others. The LSTM networks are able to “memorize” the sequence structure; the structure here is considered to be an average character width and the inter-character distance.

The bidirectional recurrent LSTM network accepts the input sequence that was compiled by application of a  certain fixed sized sliding window (10, for example) to an output vector of probabilistic estimations of a convolutional network. Therefore, the i position of the recurrent network input vector contains the last 10 outputs of the convolutional network. The bidirectionality of the network refers to creating two unidirectional networks. One of those goes over the sequence from left to right, and the other – from right to left. As a next step, both networks’ output vectors related to the same positions of the initial sequence are concatenated getting transmitted to fully connected layer input. The fully connected layer is followed by the final layer which returns equivalent final probabilistic estimations. It’s crucial to note that the convolutional network outputs that train the recurrent network are calculated beforehand so that the convolutional network doesn’t change which accelerates the recurrent network training significantly. The figure below is a chart for the architecture of the used recurrent network at the convolutional network outputs.

After adding the recurrent network at the convolutional network outputs, the overall final recognition quality increased significantly, which is demonstrated in the table at the end of the article. The hyperbolic tangent was an activation function in the LSTM-network as well.

Converting probabilistic estimations into final cuts

The probabilistic estimations conversion needs to be performed at the algorithm output in order to get the final cuts’ positions. A simple threshold filtering won’t work in this case given that the output network estimations are heavily concentrated around the estimated cuts. Since the neural network approximated the markup that went through Gaussian smoothing where the cuts were located, filtration can be a simple and sustainable method for the estimations post-processing. This filtration will get us the local maximum values that follow the threshold cut-off with low threshold. In order to remove noise activation, additional Gaussian smoothing can be used. It doesn’t change the strong maximum positions.

The local maximum filtration method is simple and demonstrates good results, but we set out to find out if we will be able to fully get rid of “engineering” approach when adjusting probabilities. For this purpose we trained one more network with a fully connected architecture and a modest amount of training weights that registers the final probable outputs of the network at the input, and returns similar probable valuations at the output. This network is different because it trains on the original cuts’ markup that weren’t changed via Gaussian smoothing. Probabilistic estimations at the last network output go through simple threshold filtration with no additional processing in order to get the cuts’ final positions. The following figure contains the trained segmentation algorithm examples with intermediate results.

Red background signifies probabilistic estimations at the convolutional network output, yellow – the recurrent network that follows. Green areas indicate estimations of the recurrent network, filtered by the fixed threshold, blue areas are for the remaining cuts – filtered values that are the local maximums.

Trials and results

Main lines of the Russian passport in the conducted trials were the lines of last name, first name, and the patronymic. The baseline training dataset consisted of 6000 images, and it increased up to 150000 images after its expansion using the data synthesis. The test dataset for segmentation assessment using additional metrics without recognition consisted of 630 images. The test dataset for line recognition consisted of 1300 Russian passport images, one image of each line of the document.

The same recognition system for the Russian passports that wasn’t adjusted was used for the line recognition of the segmentation product. We have to mention that the neural network used for the line recognition was trained by the characters compressed with the classic ‘engineering” segmentation tools, and cuts produced by the trained segmentation algorithm were not additionally processed and compressed in the recognition system. The following table exhibits the trial results for the final line recognition accuracy (some of the lines are recognized correctly in their entirety).

Segmentation algorithmLast name, %First name , %Patronymic,%
Convolutional network68.5376.0078.30
Recurrent network at the convolutional network outputs86.2390.6991.38


The table shows that adding the recurrent network at the convolutional network outputs during the segmentation subsystem increases the line recognition accuracy significantly. The processing consists of searching for the passport edges in a natural lighting and using a mobile device for shooting. That explains the imperfect recognition accuracy when choosing an unfavorable environment that can lead to various distortions.

We used the realization of neural networks from the Lasagne package in Python code for the experiments with trained segmentation methods. Trained models were further converted into the internal format for our C++ neural network inference library.


This article covered the text line segmentation method in printed documents and the exploratory study of the recognition system using the passport of a citizen of the Russian Federation. This method uses machine learning approaches (artificial neural networks) throughout virtually every stage, which makes the setup for the new types of lines and documents fully automatic as long as there is a training sample. The automated setup makes this method promising.

As a further study of the segmentation method based on machine learning methods we are planning to conduct experiments on other types of lines and documents, to analyze and classify errors in order to discover new ways of expansion the training data sets, as well as to profile and optimize the productivity of the method on mobile devices, for example, by decreasing the number of training parameters. Furthermore, we are interested in researching the recognition methods for the entire words or the unsegmented lines using the recurrent neural networks.


More Sci-Dev posts

Data augmentation for neural network training – example for printed characters recognition

Data augmentation for neural network training – example for printed characters recognition

The most complicated component of the method is an accurate selection of the augmentation parameters for generating the training dataset from the initial samples. On the one hand, a number of samples must be sufficient for the neural network to learn even noisy images, on the other hand, it is necessary for the response for the other types of non-trivial input images to remain intact.

Test Drive Our Smart Engines

Free demo apps allow you to experience the power of Smart Engines software for intelligent document scanning in a real-world context.

Why not experience the power of Smart Engines for yourself? Our demo apps allow you to test the capabilities of our identity document recognition software on mobile devices in videostream or in a single image (photo, scan).

Simply display any document to the camera in real-time or choose a photo from the gallery, and the app will recognize and capture the necessary data.

Demo apps Privacy Policy

id documents enginge by Smart Engines
Apple App Store Badge
Google Play Badge
id documents enginge by Smart Engines

Get in Touch

For questions about our products, research, people or project proposals, please get in touch.

Contact Form
Warning before submitting your request:

Smart Engines is fully committed to provide an answer within 2 working days. However, it is your responsibility that your IT infrastructure does not block our reply or redirect it into your spams. If you haven’t received any answer from us within 2 working days, please check your spams or simply call us.

Smart Engines guarantees that the provided information will not be made public and will be used only internally.