10.08.2020 г.

While the global object recognition community trains tens and even hundreds of successful artificial neural network (ANN) architectures that heating up the planet with the powerful video cards and developing a “panacea” for any computer vision problem, we, the Smart Engines team, don’t veer off our research path and keep coming up with new effective ANN architectures to solve the problems. Today we’ll talk about HoughNet – a new approach to finding vanishing points in the images, *which is the part of document recognition*.

Technologies were constantly being developed and improved, the list of necessary basic algorithmic tools was growing as well. Such tools as morphological filtering, efficient calculating of convolution functions, edge detection, color adaptation are considered to be essential for any framework for image processing and image recognition. The Hough transform technique (the discrete Radon transform) is increasingly being utilized as one of those essential tools. The Hough transform (HT) is used for detection of straight objects and their various configurations in the image: road markings detection, finding the document borders, color segmentation, computed tomography, etc. This algorithm proved its high rate of performance at detection and is much less susceptible to coexisting additive coordinate and outlier-born noises than the method of least squares.

Let’s take a look at an arbitrary point (x_{i},y_{i}). There is an infinite number of straight lines that pass through this point that satisfy the equation y_{i}=x_{i}a+b where a and b can have various parameter values. If we rewrite the equation into b=-x_{i}a+y_{i} and review the plane ab, called the parameter space, we’ll get the equation for a unique line in the parameter space for the point (x_{i},y_{i}) in the original image. The most important feature of the equation is as following: if three points lie on the same line in the image, it means that the respective three lines intersect at a single point in the parameter space, and the coordinates of this point clearly define the parameters of the line on which the points lie in the original image. There is one flaw in the parametrization described above: if the desired line is close to vertical, its angular coefficient approaches infinity. For this reason, other ways of line parameterization are used in practice.

As a result of the Hough transform we get a new image, the size of which is defined by the valid values of the line parameters, and the pixel value is a sum of pixel intensity values on a respective line in the original image.

If there were lines in the original image, there will be distinct local maxima in the new image that is the result of the Hough transform. The coordinates of these maxima define the parameters of the corresponding lines in the original image. The following image illustrates the results of the Hough transform for the image with two intersecting lines.

Despite its high practical value, the Hough transform has one serious flaw that often slows down its application: its classic implementation has a high computational complexity – O(*n ^{3}*), where n is a linear dimension of the input image.

However, the Fast Hough Transform (FHT) is often used in practice. FHT is an algorithm that calculates an approximate value of the Hough Transform using the dyadic pattern (DP) for the approximation of the lines in the image. FHT complexity is O(n^{2} log(n)), which is considerably easier from the computational point of view. The description of the FHT algorithm would go way beyond this article’s objectives, but it can be found in the following research work [5], for example. We would like to mention an important characteristic of the FHT algorithm here since we are going to need to be aware of it further when describing the proposed neural network architecture: the FHT algorithm sets four-line quadrants: “predominantly vertical with a negative shift” (we’ll call it H1), “predominantly vertical with a positive shift” (H2), “predominantly horizontal with a negative shift” (H3), “predominantly horizontal with a positive shift” (H4). Also, we’ll be using H12 and H34 for predominantly vertical and predominantly horizontal ones that are a result of joining respective quadrants.

Now we are ready to dive into the principle of vanishing point detection in the image using the Hough transform (to be exact, using the FHT algorithm). A vanishing point is a point in a perspective image where mutually parallel lines appear to converge. A multitude of such lines is called a “pencil” or a “bundle”.

Depending on the specifics of the scene, there might be a different number of vanishing points. For example, when we take a picture of some document, there are normally two vanishing points in the image: the first one is a result of horizontal lines (both real and imaginary lines, for example, top and bottom edges of a document), the second one appears due to vertical lines. Consistent double application of FHT makes it possible to find these vanishing points which can be useful for fixing the perspective errors in the future. Here’s an example. Let’s review a bundle of four lines that converge somewhere far beyond the image border. If we apply FHT to this image, we will get four local maxima in the image H12, each one of those will correspond to a line in the original image. If we apply HFT to the same image again, the image H34 will have only one local maximum, the coordinates of which will clearly determine the coordinates of an original vanishing point. Another point worth mentioning is that the sum for all the lines in the image H12 would be the same which means that it doesn’t make sense to calculate the Hough transform for the horizontal lines. That’s why it’s important to bring down the noise (in this case, we squared the intensity of H12 image).

The second vanishing point (formed by the horizontal lines) can be found using the same procedure. Once we have both vanishing points, we can fix the perspective of an image and as a result of that, we’ll have a rectified document (which is a prerequisite for a high-quality document recognition).

If you made it this far in our article, you are probably wondering why there was no mention of neural networks yet. The algorithm described here seems to be pretty straightforward and does not require any machine learning… You will get an answer to your burning question in the next part!

The whole point of the algorithm for vanishing point detection described earlier is that it can be successfully used only for the images where relevant lines are segmented the same way (for example, as described earlier – lines are represented by white pixels, and everything else is blacked out). But you would come across such a “perfect” image extremely rarely (honestly speaking, you’d never come across such a case). That’s why we need to apply the ”right” filtering before we challenge the image to the Hough transform. The question is how to determine the “right” filtering if we know practically nothing about the image content.

In order to address this issue, we developed a neural network architecture (we named it HoughNet) that consists of two layers which calculate Hough-image, and three groups of convolutional layers. The convolutional layers are responsible for initial filtering, finding local maxima and calculating additional statistics, the Hough transform transforms the coordinates which allows for the neural network to work not with the local pixel intensities, but with integral statistics along with straight objects. We must add that the approach required to train such a neural network is not so trivial (you can find more information about it in [1]).

Layer number | Layer type | Parameters | Activation function |

1 | conv | 12 filters 5×5, stride 1×1, no padding | relu |

2 | conv | 12 filters 5×5, stride 2×2, no padding | relu |

3 | conv | 12 filters 5×5, stride 1×1, no padding | relu |

4 | FHT | H12 for vertical, H34 for horizontal | – |

5 | conv | 12 filters 3×9, stride 1×1, no padding | relu |

6 | conv | 12 filters 3×5, stride 1×1, no padding | relu |

7 | conv | 12 filters 3×9, stride 1×1, no padding | relu |

8 | conv | 12 filters 3×5, stride 1×1, no padding | relu |

9 | FHT | H34 for both branchesg | – |

10 | conv | 16 filters 5×5, stride 3×3, no padding | relu |

11 | conv | 16 filters 5×5, stride 3×3, no padding | relu |

12 | conv | 1 filter 5×5, stride 3×3, no padding | 1 – rbf |

The presented neural network generates two images: one of them is for “vertical” vanishing point detection, and the second one is for the “horizontal” vanishing point detection. The point with the highest brightness value in each image will be what we are looking for.

Let’s proceed to the trials. We trained our network on an MIDV-500. This dataset contains images of various documents that were photographed using mobile devices. There are 50 types of documents in this dataset. A little more than half of these documents were used for training (to be exact, the first 30 types), the rest of them were used to assess quality of work performed by the trained network.

Although we were training on the photographs of documents, we applied the trained network as a preliminary recognition stage on a dataset ICDAR 2011 dewarping contest dataset (which contains 100 black-and-white images of the documents that were produced in different ways and that have various geometric distortions) and assessed the full-text document recognition quality.

It surpassed not only the “raw” recognition (the original image recognition that wasn’t corrected), but the current state-of-the-art as well [7] and [8].

Original image | Algorithm [7] | Algorithm [8] | HoughNet | |

Recognition quality | 31.3% | 49.6% | 50.1% | 59.7% |

[1] Sheshkus A. et al. HoughNet: neural network architecture for vanishing points detection // 2019 International Conference on Document Analysis and Recognition (ICDAR). – 2020. doi: 10.1109/ICDAR.2019.00140.

[2] Алиев М. А., Николаев Д. П., Сараев А. А. Построение быстрых вычислительных схем настройки алгоритма бинаризации Ниблэка // Труды Института системного анализа Российской академии наук. – 2014. – Т. 64. – №. 3. – С. 25-34.

[3] Ершов Е.И. Быстрое преобразование Хафа как инструмент анализа двумерных и трехмерных изображений в задачах поиска прямых и линейной кластеризации: Автореф. … дис. канд. физ.-мат. наук. – М., 2019. – 24 с.

[4] Преобразование Хафа [Электронный ресурс]: Википедия. Свободная энциклопедия. – Режим доступа: https://ru.wikipedia.org/wiki/Преобразование_Хафа/ (дата обращения: 13.03.2020).

[5] Nikolaev D. P., Karpenko S. M., Nikolaev I. P., Nikolayev P. P. Hough Transform: Underestimated Tool in the Computer Vision Field // 22st European Conference on Modelling and Simulation, ECMS 2008. – Nicosia, Cyprys, 2008. – P. 238–243.

[6] Arlazarov V. V. et al. MIDV-500: a dataset for identity document analysis and recognition on mobile devices in video stream // Компьютерная оптика. – 2019. – Т. 43. – №. 5.

[7] Y. Takezawa, M. Hasegawa, and S. Tabbone, “Cameracaptured document image perspective distortion correction using vanishing point detection based on radon transform,” in Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016, pp. 3968–3974.

[8] Y. Takezawa, M. Hasegawa, and S. Tabbone, “Robust perspective rectification of camera-captured document images,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 6. IEEE, 2017, pp. 27–3

**Send Request**

Please fill out the form to get more information about the products,

pricing and trial SDK for Android, iOS, Linux, Windows.