While other companies are busy discussing the staff management problems when working from home, we, at Smart Engines, keep sharing the technology stack with our readers. Today we’ll be talking about optimization of neural networks. It’s quite challenging to create a recognition system based on neural networks that will work on smartphones and other mobile devices efficiently. And it’s even a greater challenge to build a system that will provide high-quality recognition results. In this article we’ll describe a simple method for neural network regularization which we use to improve the quality of the “mobile” networks with a low number of parameters. This method is based on the gradual reduction of the linear dependence between the filters in the convolution layers during the training process. As a result each neuron starts working more effectively, and the generalization performance of the model improves. In order to do that, we show filters as one-dimensional vectors and orthogonalize the pair with the maximum projected length onto each other.
When designing most of the present-day neural networks, it’s presumed that they will be performing remotely, on some server, and the data for processing will be entered through a PC or a mobile device. However, this approach is not acceptable when we deal with the personal data and we don’t want to forward such sensitive information (for example, passport pictures or bank card pictures for recognition purposes). Fortunately, up-to-date mobile devices have a sufficient capacity to run neural networks which help to prevent passing this information to any third party. The fact that these networks have to be compact and consist of just a few operations not to try users’ patience is a completely different story. These requirements limit the peak performance of the networks, and the problem of how to upgrade these “simple” networks without affecting their speed of performance still stands. When we stumbled upon this problem, we designed a new regularization method for neural networks which is aimed at the compact networks and consists in orthogonalization of convolution filters.
This article is a short version of the paper «Convolutional neural network weights regularization via orthogonalization» which was presented in November, 2019 at the international conference ICMV 2019 in Amsterdam, the Netherlands.
Since the suggested method relates to regularization, first of all, we’d like to remind you what regularization is. It consists in imposing certain limitations on a model based on our understanding of the ways the problem should be solved. It leads to an improved network generalization performance. For instance, L1 regularization encourages the weight values towards zero and makes the network sparser, L2 regularization keeps the coefficient value within smaller numbers, dropout removes dependencies between neurons, etc. These methods became an integral part of the training process of many state-of-the-art networks, especially if they have a large number of parameters — since regularization is quite effective when it comes to retraining.
Now let’s go back to our method. We’d like to make it clear right away that, first and foremost, we are discussing the image classification problem using a convolutional neural network. The presumption that led us to the orthogonalization is the following: If the network is extremely limited in its resources for data pattern recognition, we need to make each neuron work as efficiently as possible and we need it to perform just their strictly designated function. To put it another way, we need it to be able “to notice” the specificities that other neurons can’t identify. And we can solve this problem by reducing the linear relationship between the vectors of neuron weights during the training process. In order to do that we modified the classic orthogonalization algorithm by adapting it to the realities of the training process.
Let’s define filters in the convolutional layer as a set of vectors , where is a convolutional layer index, and is the number of filters in it. After the weights have been updated during the inverse error distribution, we should find a pair of vectors with the maximum length of projection onto each other in each individual convolutional layer:
The projection of the vector onto the vector can be calculated with the help of the following formula: . Now, in order to orthogonalize the filters and , we replace the first step of the Gram-Schmidt algorithm:
with the following formula:
where is the learning rate and wort is the orthogonalization coefficient with its values in range . The introduction of the orthogonalization coefficient is prompted by the fact that “instant” orthogonalization of filters affects the training process in a negative way by wiping out systematic changes in weights from the previous iterations. The small values of wort maintain the training momentum and support a smooth gradual reduction of the linear dependency between the filters in each individual layer. We’d like to emphasize the critical point of this method once more: we can modify only one vector per iteration so that we don’t sabotage the optimization algorithm performance.
Figure. Visualization of an iteration.
We are reviewing orthogonalization of only convolutional filters since the convolutional layers constitute the major part of the architecture in present-day neural networks. However, the algorithm is still easily generalized to neuron weights in fully connected layers.
Let’s jump from theory to practice now. For our experiments we chose two most popular datasets used for the neural network assessment in the field of computer vision — MNIST (classification of images with hand-written numbers) and CIFAR10 (10 groups of photographs — boats, trucks, horses, etc).
Since we believe that orthogonalization would prove to be useful applied to the compact networks in particualr, we chose the LeNet-like architecture in three modifications that differ in the number of filters in convolutional layers. The architecture of our network (we’ll name it LeNet 1.0 for an easier reference) is shown in Table 1. The architectures LeNet 2.0 and LeNet 3.5 derived from LeNet 1.0 come with a higher number of filters in convolutional layers, 2 times and 3.5 times respectively.
When it came to choosing the activation function, we turned to ReLU not only due to the fact that it’s the most popular and computationally efficient function (we’d like to remind you that we are still discussing fast networks). The thing about using the piecewise linear functions is that they nullify the orthogonalization effect: for instance, the hyperbolic tangent causes heavy distortion to the input vectors because it possesses a very pronounced non-linearity quality in the near-saturation areas.
Table 1. The LeNet 1.0 network architecture used in the experiments.
We tried 3 different values of the orthogonalization coefficient wort: 0.01, 0.05, 0.1. All the experiments were conducted 10 times, and the results were averaged (the standard deviation (std) for error rate is shown in the table with the results). We also calculated by how many percent the number of benefits dropped.
The results of the experiments confirmed that the fewer parameters there are in the network, the more improvements there will be due to orthogonalization (table 2 and 3). Another interesting result we received was that when orthogonalization is applied to “complicated” networks, it leads to a decline in the quality.
Table 2. The results of the experiments with MNIST
Table 3. The results of the experiments with CIFAR10
But the LeNet networks are rare and, as a rule, more up-to-date models are used. That’s why we also conducted our experiments on the ResNet model which has fewer filters and consists of 25 convolutional layers. Each of the first 7 layers contained 4 filters, next 12 layers had 8 filters each, and the last 6 had 16 filters each. The total number of training parameters was 21 thousand. And we got the same result: orthogonalization improves the quality of the network.
Figure. Comparison of the training dynamics of ResNet on the MNIST database with orthogonalization and without it.
Despite the achieved improvements in quality, in order to make sure that the suggested method works properly, we still need to check what changes occurred to the filters. For this purpose we identified the values of the maximum projected length of the filters in the 2nd, 12th and 25th layers of ResNet during all stages of training. The main takeaway here is that there is a decrease in the linear dependence between filters in all of the layers.
Figure. The dynamics of changes of the maximum projected length of the filters in a convolutional layer using an example of ResNet.
Regularization via orthogonalization is incredibly simple in its implementation: it takes less than 10 lines of Python code through the numpy module. At the same time, it doesn’t slow down the training process and is compatible with other methods of regularization.
Despite its simplicity, orthogonalization makes it possible to improve the quality of “lightweight” networks with the imposed restrictions on the size and the performance speed. Due to the spread of mobile technologies, these kinds of restrictions are becoming more and more frequent: the neural network is supposed to run not somewhere in the cloud, but directly on a device with a slow processor and with little available memory. Training of such networks contradicts the latest trends in the field of neural-network science which uses a complex of models with millions of training parameters that no existing smartphone would be able to handle. Which is why, in the furtherance of these industrial goals, it’s extremely important to create and develop new methods for quality improvement of simple and fast networks.
Alexander V. Gayer, Alexander V. Sheshkus, «Convolutional neural network weights regularization via orthogonalization,» Proc. SPIE 11433, Twelfth International Conference on Machine Vision (ICMV 2019), 1143326 (31 January 2020); https://doi.org/10.1117/12.2559346
Please fill out the form to get more information about the products,
pricing and trial SDK for Android, iOS, Linux, Windows.