Today we’d like to talk about optical recognition again. A simple recognizer model of a cascade classifier will be the main subject of our article. The cascade classifier is used in the popular Viola-Jones framework. The sad part is that despite the vast amount of articles, no one studied cascade classifiers thoroughly enough. Although the cascade classifier seems to be simple, it has many pitfalls that need to be avoided and some interesting features that are worth mentioning. We can’t wait to share our knowledge with you. If you’d like to learn more, keep reading!
The cascade classifier has a very simple model. It consists of a few progressive stages, each of which is presented as a binary classifier. The object is entered at the first stage and then keeps “travelling” through all the consecutive stages. If at some stage the classifier discards the object (classifies it as a negative example), it’s not going to be analyzed by the remaining classifiers of this cascade. This simple model became widely popular after the Viola-Jones object detection framework was published , where, as it was claimed, it was used to provide high performance. Was it the only reason for its use though? Is the cascade classifier really that simple? Let’s figure it out!
On the first page of the original article  there is the following line:
«The third major contribution of this paper is a method for combining successively more complex classifiers in a cascade structure which dramatically increases the speed of the detector by focussing attention on promising regions of the image.»
Indeed, the original Viola-Jones framework was designed for face detection in an image. And the detection problem is solved using the sliding window technique with the help of a binary classifier. The classifier is applied to each point of an image at different scales. The imbalanced nature of data during the detection stage (in each reviewed image there will be millions and even billions times more of the “empty” areas without the object in question than the areas containing the object) led to the use of a cascade — an approach that allows to quickly discard the “empty” areas.
That’s what happens if we have an already trained classifier. Let’s talk about how the classifier is trained. It turns out there is the same problem of imbalanced data here: the number of negative images is much higher (millions and even billions times higher) than the number of positive samples. Due to its architecture, each new stage of the cascade is trained using the AdaBoost method not with all the negative samples, but only with a limited set of errors from the previous stages of the cascade! This allows to run the AdaBoost algorithm on a balanced and limited selection of data!
As you can see, the use of a cascade classifier in the Viola-Jones method has two major advantages:
1. It makes it possible to train the classifier easily without having to deal with the problem of an “infinite” training dataset.
2. It ensures rapid discarding of the “empty” areas during the detection process, which leads to a higher performance rate, on average.
Let’s continue reviewing the classic cascade and address the question of the performance.
Let’s return to the subject of the cascade purpose, but we’ll review it from a different angle this time. If we look at the cascading classifier from the mathematical point of view, we can see that a cascade is a conjunctive form of strong classifiers (each of which is presented as a linear combination of features):
where is an indicator function.
When we have a limited number of available features (which is a normal occurrence in practice when we are high performance-oriented), the conjunctive form of strong classifiers possesses greater expressive ability than a single linear classifier. It’s easy to see if we give a simple example of a feature space that consists of two features, and the positive and negative objects, represented in a coordinate space of features, are located the way it’s shown in the image below ( the green ones indicate the positive objects, the red ones — the negative ones). It is obvious that there is no single linear classifier that would be able to divide this sample set correctly. However, it is guaranteed that a four-stage cascade classifier would solve this problem.
So by using cascade classifiers, we not only get a higher performance, but we also get a greater expressive power than when using a single linear classifier with a limited number of features.
As we’ve mentioned earlier, the cascade framework delivers high performance through the fast process of elimination of “empty” regions since their number is way higher than the number of the regions containing the object. The processing time of an “empty” region is several times less than the processing time necessary for a region containing an object (it’s proportional to the cascade length — the number of the stages in the cascade).
Since the number of the regions containing an object varies from image to image, the processing time for each image varies as well. Due to the fact that there are significantly fewer regions containing an object than the regions without an object in any given image, the processing time difference is measured not by dozens of times, but by a double-digit percentage, which is still sizable in the industrial optical recognition systems.
So now we know that the cascade classifier operating time can vary significantly for different images. It means that when we assess the classifier performance, we have to assess the operating time on average and in the worst case. And we have to always be prepared for these “time inconsistencies” when using the cascade classifiers.
In our practice we’ve already dealt with serious issues due to the significant difference in the operating time of the cascade classifier in the average case and in the worst case. As a part of the toll road automation project, we were working on a car type recognition problem, and one of our main subtasks was to count the car’s wheel sets. Of course, in some images we could apply the Viola-Jones method to detect wheels. Due to the huge variety of wheels (see figure below), the trained cascade had to be quite long (20 stages). We have seen the problems happening because of different processing times for each image first-hand, and it definitely interfered with our attempts to build a real-time optical recognition system.
That’s when we developed the concept of a classic cascade classifier into a full decision tree concept, creating the unique method to train such decision trees (we remembered that we needed to propose an algorithm that would allow us to avoid the issue of the “infinite” training dataset). The details of this algorithm can be found in the paper . We’ll just mention a few facts here:
1. The longest path of the trained decision tree consists of 6 strong classifiers (the scheme of a trained decision tree-like classifier is demonstrated in the image below).
2. The trained decision tree-like classifier provided better performance than a trained classifier mentioned earlier. It’s logical and can be explained by the fact that the expressive power of a decision tree-like classifier (which is presented as a conjunctive-disjunctive form) is higher than the expressive power of a cascade classifier (which is presented as a conjunctive form).
3. The trained decision tree-like classifier significantly outperformed the cascade classifier in the worst case, and barely ever lost in the average cases.
There is a quantitative comparison of a cascade classifier and a decision tree-like classifier presented in the table below.
|Sensitivity||Specificity||Time in the worst case, microseconds||Time in an average case, microseconds|
|Decision tree-like classifier||94,02%||99,99%||58717||63552|
If you want to use the cascade classifiers in real-time recognition systems, you have to keep in mind the specificities of their performance in the average case and in the worst case.
It’s probably one of the most complex issues related to the cascade classifiers. The thing is, in all the articles that we came across, the training process of the cascade is described so poorly and briefly and without a proper justification of the training algorithm efficiency. As a rule, the algorithm for training a cascade is presented like this:
1. Determine the values of the false recognition rate () for the entire cascade.
2. Determine the values of the correct recognition rate and the false recognition rate for each recognition stage.
3. Determine the validation set for a fair evaluation of the qualitative indicators of the final classifier.
4. Train each new stage of the cascade (let’s remind that it is trained on all existing positive examples and on the false-positive errors of the current cascade) that way so its indicators and are not lower than the preset thresholds, i.e. and . By the way, the process of securing these indicators provokes some interest as well.
5. Add a newly trained stage to the cascade and evaluate its qualitative indicators using the validation set. If the false recognition rate is lower than the target indicator , we can stop the training process. Otherwise, we return to step 4 to train a new stage of the cascade.
If we trained stages of the cascade using the described algorithm, we can evaluate the average correct recognition rate of the cascade like this:
where is the -th stage complexity, is the probability of the -th stage computation, and is the correct recognition rate at the -th stage of the cascade.
As you can see the cascade complexity doesn’t matter for the training algorithm we have been describing. That’s why we can’t consider it to be efficient performance-wise. At the same time we know that the scientists all around the world strongly believe that the algorithms for effective cascade training are extremely important. In order to demonstrate it, we’ll quote from the article written by Paul Viola and Michael Jones :
The overall training process involves two types of tradeoffs. In most cases classifiers with more features will achieve higher detection rates and lower false positive rates. At the same time classifiers with more features require more time to compute. In principle one could define an optimization framework in which
– the number of classifier stages,
– the number of features, , of each stage,
– the threshold of each stage
are traded off in order to minimize the expected number of features given a target for and . Unfortunately finding this optimum is a tremendously difficult problem.
It’s worth mentioning that the review of the up-to-date literature we made in 2016, showed that no effective algorithms for cascade classifier training were created yet. By the way, that was when we were trying to solve this problem at Smart Engines. We studied the following functional that depends on Type I detection errors , on Type II detection errors , and on the errors of average complexity :
The choice of parameters , , and sets the relative cost of Type I and Type II errors and the produced detector complexity. As the next step of the training process of the cascade classifier, we used a greedy brute force search algorithm at each stage in order to get the cascades that maximize the selected functional. An in-depth description of this algorithm would be beyond the scope of this article, but we are always ready to share a helpful bibliographic reference  so you could learn more, our dear reader.
To sum up everything we talked about in this article, we want to draw the following conclusions about the cascade classifier model:
1. There are very few scientific research papers that thoroughly cover all the necessary details.
2. It proves to be very useful to re-read scientific papers and analyze the way they present the model performance, its characteristics and its limitations, even when everything seems to be broken down at first glance.
3. There is always room for a quality scientific research, even if the model was proposed 20 years ago.
 Paul Viola Michael J. Jones. Robust real-time object detection // International journal of computer vision. – 2001.
 Bourdev L., Brandt J. Robust object detection via soft cascade //2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). – IEEE, 2005. – Т. 2. – С. 236-243.
 Minkina A. et al. Generalization of the Viola-Jones method as a decision tree of strong classifiers for real-time object recognition in video stream //Seventh International Conference on Machine Vision (ICMV 2014). – International Society for Optics and Photonics, 2015. – Vol. 9445. – P. 944517.
 Paul Viola Michael J. Jones. Robust real-time face detection // International journal of computer vision. – 2004. – Vol. 57. – No. 2. – P. 137-154.
 Polyakov I. V., Kuznetsova E. G., Usilin S. A., Nikolaev D. P. Postroenie optimalnykh kaskadov violy–dzhonsa pri pomoshchi “zhadnykh” algoritmov perebora upravlyayushchikh parametrov s promezhutochnym kontrolem po validatsionnoi vyborke [Training optimal viola–jones detectors using greedy algorithms for selecting control parameters with intermediate validation on each level]. Sensornye sistemy [Sensory systems]. 2016. V. 30(3). P. 241-248 (in Russian).
Recognition of ID cards, passports, driver’s licenses, residence permits, visas, and more. Works on a mobile phone or server, on photos and scans, regardless of their quality, as well as in the video stream from a smartphone or web camera, robust to capturing conditions. No data transfer - scanning is performed on-device and on-premise.
Recognition of data from codified objects. Captures machine-readable zones (MRZ), embossed, indent-printed, and free-template bank cards, PDF417, QR code, AZTEC and other linear and 2D barcodes using a smartphone’s camera, on the fly. Works in mobile applications (on-device) and scans photographs, regardless of lighting conditions.
Automatic extraction of data from documents (KYC questionnaires, applications, tests, etc), administrative papers (accounting documents, corporate reports, business forms), and government forms (financial statements, insurance policies, etc). Recognizes scans and photographs taken in natural conditions. Total security: only on-premise installation.
Green AI for Tomographic reconstruction and visualization. Algorithmization of the image reconstruction process directly during the of X-ray tomographic scanning process. We aspire to reduce the radiation dose received during the exposure by finding the optimal termination point of scanning.
Please fill out the form to get more information about the products,
pricing and trial SDK for Android, iOS, Linux, Windows.