Some of the document recognition modules developed by our company are designed to detect the placement of an object in the image or video stream as a priority task. This article is dedicated to one of the implemented solutions we choose for this problem.
First and foremost, we need to select the data which we can use. Applications usually rather strictly specify the types of target documents. Let’s imagine that nobody is going to recognize their passport by application designed for bank cards recognition or vice versa, therefore, at least the proportions of the targeted object are known. Moreover, let’s keep in mind that the vast majority of mobile devices have cameras with a fixed focus distance.
It is not worth detecting any image of the document as the detected area must also be recognised. To avoid getting an input photograph with a little piece of a document skewed in the corner somewhere, some limitations have to be imposed and enforced with UX visualization which would help the user to get a proper frame configuration at the capturing stage.
In particular, it would be nice to do the following:
1 — the document must be fully visible in the frame
2 — the document should take a fairly large area of the frame
Such conditions can be implemented differently. To request directly for the document to take no less than X% of the frame area, it is possible to see annoyed users grinding the teeth in as using the naked eye it is difficult to detect whether the document will take 80% or, say, only 75%.
However, there are easier alternatives. Thus, it is possible to set acceptable quadrangle areas for each side of the document and to offer the user to put their camera in such a way that every single side of the document could be fully in the designated zone.
For driver license scanning that could look, for example, in the following way:
Apart from the UX visualization, such way adds up an additional useful limitation in the document position: considering the area sizes it is possible to calculate maximally possible angle inclinations of sides in relation to coordinate axes, and, as a result, a range of possible deviations of the quadrangle corner angles from the 90 degrees. Or vice versa, zone sizes could be set in such a way for these parameters deviations to stay within some threshold values.
Starting from here and furthermore we can observe the maneuvering of any image processing algorithm between Scylla and Charybdis — the speed and precision/quality of the product.
To start with, we will reduce the image to make the process less time-consuming. We set the 320px at the longer side of the image as an optimum for certain use cases.
We need only image areas corresponding to the 4 document sides for the search. We must search for such borders which could become the edges of the document.
Further actions will be considered with the example of an upper area — as for the rest it is basically the same (with an exception of a preliminary 90 degrees rotation).
and then choose a higher value from these
thus obtaining a greyscale image-derivative as a result.
The next step is to extract straight lines by using Fast Hough Transform (FHT) *. The Hough transform converts borders map into an accumulator matrix each cell (i, j) of which corresponds to a parametrically defined straight line and in such an accumulator local maximums correspond to more significant straight lines in the map.
A single best straight line is not enough in many cases — the limitations will not prevent a random table/keyboard/anything that is longer and more contrasting than our document from appearing in the frame. However, to take a number of largest values from the accumulator is of no avail: each straight line, in fact, creates a particular locality in this accumulator (see the figure below) and several highest maximum values will correspond to the same straight line.
As our main goal is to get a number of different straight lines but not to locate with high proximity a single one, such an approach is useless..
Thus, it is better to use the following iterative operation:
Each of the lines will be associated with a weight equal to the corresponding Hough accumulator cell.
(*) By the Fast Hough Transform we imply precise and fast implementation, not an approximate transformation.
a) borders map
b) FHT accumulator
|Example of FHT operation. (Did you get which of the lines does each peak in the accumulator correspond to?)|
We get 4 sets of lines after the previous stage — one per each side of the document. We comprise all possible quadrangles out of these lines by combinatorial election and search for the most probable. As for the initial estimation C, we consider the sum of weights of the lines from each quadrangle. All further steps are dedicated to the correction of this estimation taking into account emergent quadrangle characteristics.
If the lines in bordering areas do not end in the point of their intersection but continue afterward, it is likely to be not the corner of the searched object but something else. Clearly, some pixels of the real corner during the processing are possible to ‘crawl’ away when the border of the document is located along with a particular line of the background which generally appears to be a bad sign. Such ‘crawling’ of the lines out of the corner will be penalized as:
where pi — a penalty for the i-corner being a sum of intensities of the first n pixels in the border map behind the intersection point and along the lines forming the corner.
Example for the upper left corner. The yellow colour defines the penalized areas for horizontal lines.
Next, for each of the quadrangles, we will restore the parallelogram  (unique down to a homothetic transformation) with the use of a given focus distance, for which the quadrangle is a projective image. Let us compare the sides ratio of the obtained parallelogram with the known aspect ratio of the searched document, and its angles with 90-degree angles which the document should have. The deviation of these parameters is also penalized:
where A, B, α, β — trainable coefficients,
Ta, Tr — threshold values.
That is all. The quadrangle with a higher weight is considered to be the location of the document in the image.
The test set for the quality assessment consisted of 6000 photographs with different backgrounds and 6000 correct quadrangles correspondingly. Initial resolution of the images varied from 640х480 to 800х600. The test device was iPhone 4S (the photographs were received from it and compressed up to the given resolutions outside of the given algorithm.)
To assess the precision of the document location an error function was used:
where ∆dmax — maximum of the distances between corresponding corners of found quadrangles and the ground truth quadrangles,
min(s) — length of the minimal side of the ground truth quadrangle.
By the experimental approach, it was seen that the err value must be less than 0.06 for good recognition of characters in the detected document with the sides areas covering 30% from the corresponding size of the frame.
The general quality of algorithm performance was calculated as
With such definition, on the test set of 6000 images the quadrangles were detected with the quality of 98.5%
The average run time of the algorithm on iPhone 4S was measured to be 0.043sec (23.2FPS), out of which:
scaling — 0.014 sec,
edge search — 0.023 sec,
line search — 0.005 sec,
combinatorial search — 0.001 sec.
 S. Di Zenzo. A note on the gradient of a multi-image.
 J. Canny. A computational approach to edge detection.
 D.P. Nikolaev et al. Hough transform: underestimated tool in the computer vision field.
 Z. Zhang. Whiteboard scanning and image enhancement.
* The search of borders (and then straight lines) occurs independently in each region, which means it can be executed in a parallel mode.
Please fill out the form to get more information about the products,
pricing and trial SDK for Android, iOS, Linux, Windows.