04.03.2021 г.

Development and testing of the Smart Tomo Engine program for tomographic reconstruction using the Elbrus platform

Today our article will be focused on two of our favorite subjects: computed tomography (CT) and the Russian-made processor called Elbrus. We’ll talk about the difference between an X-ray and a CT scan result and explain why such a large and serious piece of machinery as a tomograph could really use a specialized computer. Despite the fact that tomographs have been used for almost 50 years (the invention of the first tomograph was announced in 1972 [1]), there are still plenty of problems left to solve in the computed tomography field. There is a strong demand for new algorithms for computed tomography that would be faster and more accurate than the already existing ones and would make it possible to reduce exposure of an object to radiation. That, in turn, would significantly extend the scope of application of the CT method. With that in mind, we developed the Smart Tomo Engine software. We’ll be talking about it more later on. We’ve already written about orthotropic artifacts suppression and estimation of cupping effect. In this article we’ll write about a few tests that were performed using the synthetic datasets and actual tomographic datasets gathered by the Russian tomograph, and demonstrate how our software operates with the new-generation Elbrus processor (the video is attached below). As a result of this program performance, we will catch a glimpse of the inner world of a maybug, and the meaning of the “inner world” should be taken literally in this case.

Development and testing of the Smart Tomo Engine program for tomographic reconstruction using the Elbrus platform

Roentgenography is a widely used non-invasive diagnostic approach based on the image generation of an object using x-ray radiation. In order to get an image, the object is placed between an x-ray source and a detector (see Figure 1, on the left). The detector could be either an x-ray sensitive film or a position-sensitive detector. The image is generated with the help of the radiation that passed through an object and weakened during this process. Different materials attenuate the x-ray radiation differently, which provides some kind of contrast in the image. When we register the x-ray radiation that passed through an object, we are able to determine the local composition of the researched object. There is an example of a chest x-ray shown in Figure 1, on the right. The light areas mark the sections of higher absorption. When we look at the ribs (light curved slats) that limit the ribcage (the dark cavity with the lighter areas of the bronchial tree), on the right side of the upper part of the sternum (the light column in the center) we can see a small light nodule.

Development and testing of the Smart Tomo Engine program for tomographic reconstruction using the Elbrus platform

Figure 1. Roentgenography: the basic scheme (on the left); the roentgenography result – a radiogram (on the right).

Roentgenography doesn’t make it clear how deep the problem area is located – right on the sternum, in front of it, or behind it. It’s hard to analyze the fine spatial structure of the problem area and to determine its overall shape based on just one projection. Figure 2 illustrates this concern.

Development and testing of the Smart Tomo Engine program for tomographic reconstruction using the Elbrus platform

Figure 2.

The CT method helps to determine the shape and the internal structure. Just like with roentgenography, in order to collect the data, the object is placed between an x-ray generator and a detector, but in this case the machine registers a set of x-rays taken at different angles. Rotation angles are usually evenly distributed at a certain interval. The basic scheme of its operation is illustrated in Figure 3.

Development and testing of the Smart Tomo Engine program for tomographic reconstruction using the Elbrus platform

Figure 3. The basic scheme of tomograph operation

The process of gathering images at different angles is performed by a special device called the tomograph. Since it’s possible to get a tomogram of a wide range of animate and inanimate objects, and the studies are being conducted at the micro- and macro-levels, there are a lot of various kinds of tomographs. They differ according to their scanning patterns (layer-by-layer circular scanning, helical scanning, etc.), the types of x-ray generators used, the generation pattern of a probing beam (cone, parallel, microfocus). In very general terms, the tomograph can be presented as a machine that consists of a radiation generator, an object holder, and a detector. Any of these parts can be movable which would make it possible to change the angle in a controlled manner. An integral part of the modern tomograph is a computer that doesn’t just manage the procedure of gathering x-ray images, but also processes the collected data using the specialized software.

In order to be able to analyze the objects of different kinds, there is a wide range of various technical solutions that can be applied. For instance, when conducting a medical research, a gantry (a moving device that contains a detector and x-ray source structure) (Figure 4) rotates around the patient lying down in a fixed position. The spatial resolution in these tomographs reaches 0.2 – 0.5 mm. The CT results are stored in the DICOM file format which is a medical industry-specific standard that was developed for creating, storing, and transmitting digital medical images and related documents of a patient.

Development and testing of the Smart Tomo Engine program for tomographic reconstruction using the Elbrus platform

Figure 4. The scheme of a medical tomograph

When conducting an in vitro research in the laboratory, a different pilot scheme would be appropriate. In this case the source and the detector are stationary, and a set of x-ray images is generated by rotating the sample. An entire set of laboratory x-ray microtomographs was constructed and is operating at the Laboratory of Reflectometry and Low-angle scattering of the Federal Scientific Research Center “Crystallography and Photonics” of the Russian Academy of Sciences. One of these devices is shown in Figure 5. When using this tomograph, the sample is placed on the goniometer with its axis being perpendicular to the direction of probing. The device is equipped with a two-dimensional detector. The pixel size is 9 micron, and the field of view of the detector is 24 by 36 mm. This machine makes it possible to use both polychromatic and monochromatic radiation for probing purposes. That allows for not only higher quality of reconstructed images, but for gathering additional data about the elemental composition of the researched objects. The development of their own tomographs makes it possible to gain access to experimental data (x-ray images) and the performance of all the units of the device, which, in turn, allows for optimization of the measurement protocols in accordance with the objectives.

Development and testing of the Smart Tomo Engine program for tomographic reconstruction using the Elbrus platform

Figure 5. The image of a laboratory tomograph at the Federal Scientific Research Center “Crystallography and Photonics” of the Russian Academy of Sciences.

After registering x-ray images at different angles, i.e. gathering a full set of projections, they have to be processed next. The ultimate goal of processing is to reconstruct the internal morphological structure of an object. SInce the contrast in the registered image occurs due to the fact that each material attenuates the x-ray radiation in their own way, the reconstruction result would be the spatial distribution of the attenuation coefficients of the probing radiation. The characterization of the morphological structure of the objects that are processed by a tomograph is based on the above-mentioned spatial distribution.

If probing is conducted using a parallel beam, then the three-dimensional reconstruction problem can be solved by recovering a set of two-dimensional cross-sections of an object. In order to reconstruct a single cross-section, there is no need to use the whole set of projections. All we need is just one line of a fixed number from each angled view. All these lines correspond with one horizontal cross-section of a 3D distribution being reconstructed, and we can attribute the same number to this distribution. In Figure 6, on the right, there is an image built out of such lines. The horizontal axis accounts for the number of the detector column, the vertical one – for the number of the angular rotation. The cross-section reconstruction result is demonstrated in Figure 6, on the right.

Development and testing of the Smart Tomo Engine program for tomographic reconstruction using the Elbrus platform

Figure 6. The chest sinogram (on the left); the CT result – the cross-section of a 3D image (on the right).

If we use the monochromatic x-ray radiation for tomographic probing, then based on the Beer-Lambert-Bouguer law, the reconstruction problem could be reduced to the implementation of the Radon transform.

The Radon transform is the integral transform which connects the value of the function to the values of its integrals along every possible straight line. The procedure of applying it is the reconstruction of an unknown function using the known values of its integrals along the straight lines. The subintegral function that needs to be reconstructed is the distribution of the linear attenuation coefficient of the monochromatic x-ray radiation throughout the sample. The feature of reversibility that characterizes the Radon transform guarantees accurate reconstruction of the unknown frequency-limited function if there is a sufficient number of integrals along the systematically positioned straight lines. This feature uses a convolution back projection and filtered back projection algorithm which is implemented in most present-day mass-produced tomographs. It consists of two steps. The first step is a linear filtration of the registered images. The second step is the inverse projection, i.e. even “smudging” of each one-dimensioanl function generated at the previous stage in the appropriate direction over the whole two-dimensional image with the subsequent summation. The result of the algorithm performance is the reconstructed spatial distribution of the linear attenuation coefficient of the x-ray radiation of a given energy. If probing is performed using not a parallel beam, but a cone beam, the layer-by-layer reconstruction would not be possible, and the use of more complex algorithms would be necessary. We’ll review the three-dimensional reconstruction algorithms, such as the algorithm of Feldkamp, some other time. Now, let’s start the conversation about our software.

Smart Tomo Engine

The core of the Smart Tomo Engine is a tomographic reconstruction library which performs the following functions through API: tomographic image (projection) reading, the tomographic reconstruction itself (there are three algorithms to choose from here), and storing of the results (using the suggested file formats: DICOM, PNG). The software product additionally includes a graphic user interface which allows for the two-dimensional visualization of tomographic images and reconstruction results. The primary function of the software product is to perform reconstruction of a three-dimensional digital image of an object using a set of its transmission tomographic images in the x-ray band.

The following algorithms are implemented for layer-by-layer two-dimensional reconstruction:

FBP — Filtered Back Projection. The classic method of tomographic reconstruction that combines inverse projection and linear filtration. Computational complexity is $O(n^3)$. You can learn more about this method here [2].
FR — Direct Fourier Reconstruction. This algorithm operates in the frequency domain and uses a Fast Fourier Transform (FFT) for filtration and inverse projection. The computational complexity of multiplication operations is $O(n^2 \log n)$.
HFBP — Hough FBP. It’s a reconstruction algorithm that was developed by our scientists. The Brady algorithm for the Fast Hough Transform is used for inverse projection, and the Deriche method is used for linear filtration acceleration [4,5].

Testing on the Elbrus platform

We tested our software using the Russian-made platform. The testing was performed on the Elbrus-401, the Elbrus-804 and the Elbrus-801CB computers. The Elbrus-401 is a workstation computer that uses the Elbrus-4C microprocessor, the Elbrus-804 is a server with 4 Elbrus-8C processors. (We have already tested another software developed by us on these computers. The Elbrus-801CB is a MCST’s latest development: it’s a workstation computer that uses the Elbrus-8CB processor. Our colleagues from the Moscow Center of SPARC Technologies (MCST) talked to us about the main differences of the Elbrus computers of several generations: “The Elbrus-4c is the first processor which was mass-produced for the market. It’s a 4-core processor with a clock rate of 750…800 MHz and 3 DDR3-1600 channels for interprocessor communications. The Elbrus-8C is an 8-core microprocessor with a clock rate of 1.2…1.3 Ghz, and with 4 DDR3-1600 channel memory kits, and each core has 1.5 times more arithmetic logic units (ALU) for higher floating point performance. The Elbrus-8CB is an even further improvement: it’s an 8-core processor with a clock rate of 1.5 Ghz and the DDR4-2400 channel memory kit and with 2 times more ALUs. The Elbrus-8CB performs better with unaligned data, and it has a ton of other small improvements compared to the Elbrus-8C.”

The characteristics of the processors are exhibited in the Table 1.

Elbrus-4C, 800 MHzElbrus-8C, 1200MHzElbrus-8CB, 1500MHzAMD Ryzen 7 2700AMD Ryzen Threadripper 3970X
Clock rate, MHz8001200…1300150032003700
The number of cores488832
The number of operations per clock cycle (per core)Up to 23Up to  25Up to 50
L1 cache, per core (data)64 Kb64 Kb64 Kb32 Kb32 Kb
L1 cache, per core (commands)128 Kb128 Kb128 Kb64 Kb32 Kb
L2 cache, per core2 Mb512 Kb512 Kb512 Kb512 Kb
L3 cache, general16 Mb16 Mb16 Mb128 Mb
RAM organizationUp to 3 channels DDR3-1600 ECCUp to 4 channels DDR3-1600 ECCUp to 4 channels DDR4-2400 ECCUp to 2 channels DDR4-2933 ECCUp to 4 channels DDR4-3200 ECC
Technological process65 nm28 nm28 nm12 nm7 nm
The number of transistors986 million2,73 billion3,5 billion4,8 billion23,54 billion
Maximal width of SIMD instruction 64 bits64 bits128 bits256 bits256 bits
Support of multiprocessing systemsUp to 4 pros.Up to 4 pros.Up to 4 pros.??
Production year20142016201920182019
Operating systemOS “Elbrus” 5.0-rc2OS “Elbrus”6.0-rc1OS “Elbrus” 6.0-rc1Ubuntu 18.04Archlinux
Compiler versionlcc 1.24.09lcc 1.25.07lcc 1.25.05gcc 7.5.0gcc 10.1.0

We have already written about the optimization for the Elbrus computing platform, so we won’t elaborate on this subject now. We haven’t done anything extraordinary here:We used the optimized EML library (geometric transformations of an image (for example, affine transformations), arithmetic operations, etc);

—We used the optimized EML library (geometric transformations of an image (for example, affine transformations), arithmetic operations, etc);
—We used the intrinsics when the EML library didn’t work; however, SIMD on the Elbrus-8CB got upgraded to 128 bits, and we didn’t manage to fully adopt it in our research yet, that’s why the intrinsics were still working with the 64-bit vectors.

In order to test our Smart Tomo Engine software we collected two datasets: one with the synthetic data and one with the actual data. The synthetic dataset “Shepp-Logan 3D” was created using the mathematical modelling approach. The projections are calculated layer-by-layer on the 3D Shepp-Logan phantom using the push-broom approach. The cross-section is shown in Figure 8, on the left. The size of the phantom image is 511х511х511. The projections are calculated for 420 different angles, evenly distributed between 0.5 and 210 degrees. There were 511 sinograms with the size of 511х420 reviewed at the input of our experiment with the Smart Tomo Engine software (one of them is shown in Figure 7, on the right). And there were 511 reconstructed layers at the output, with the size of 511х511. The size of the phantom image is about the size of the images that are generated by the present-day dental tomographs: the maximum size of scanning area in the mouth is usually 16 cm, the spatial resolution claimed by the manufacturers is 0.3 – 0.4 mm. In this case the size of the registered projection will be approximately 500х500 pixels.

Development and testing of the Smart Tomo Engine program for tomographic reconstruction using the Elbrus platform

Figure 7. Left – the cross-section of the 3D Shepp-Logan phantom, right – the sinogram of the central layer.

The actual tomographic data (the dataset “Maybug”) was collected by the microtomograph at the Federal Scientific Research Center “Crystallography and Photonics” of the Russian Academy of Sciences. And this data is used for scientific research-purposes. The pixel size of the detector used was 9 micron. The experimental sample is a dried maybug. There were 400 projections taken in the parallel network. The sample, that was set in the holder, was rotating at the angles in the range from 0.5 to 200 degrees, in 0.5 degree increments. The size of the produced projection is 1261х1175. The input for the Smart Tomo Engine program is 1261 sinograms with the size of 1175х400, the output is 1261 reconstructed layers with the size of 1175х1175.

And here’s the best part – test results and conclusions

We measured the execution speed of the reconstruction algorithms we used: FBP, DFR and HFBP. The performance time of the algorithms is demonstrated in Table 2. The measurements were conducted on 5 computers: Elbrus-401, Elbrus-804, Elbrus-801CB, AMD Ryzen 7 2700 and AMD Ryzen Threadripper 3970X. We included the information about the number of processors, the number of physical cores, and the maximum number of running streams (is indicated in parentheses) for each computer. The measurements of the reconstruction speed were conducted in two different modes: in a single-threaded mode (SM) and a multithreaded (MM) mode. And they were implemented using the “2017 update 7” version of the tbb library.

Table 2. The measurements of the program operating time, sec.

Development and testing of the Smart Tomo Engine program for tomographic reconstruction using the Elbrus platform

When analysing the test results, first of all, we’d like to mention that in order to reconstruct 511 layers of the phantom, it took the Elbrus-804 server with 4 processors 19 seconds when using the HFBP algorithm. This means that each layer was reconstructed in 0.037 seconds, and the layer-by-layer frequency was 26.8 ips. In order to find out if it’s a high or a low frequency, we could use the following reference. The gantry of a 16 cross-section cardiology tomograph rotates the entire circle almost twice every second and registers around 30 sinograms. We reconstruct 26.8 layers per second, i.e. it’s practically a real-time reconstruction. So we can conclude that when using the Russian platform, the reconstruction meets the performance time requirements in cardiology, where the main reference parameter is the frequency of the heartbeat, which, on average, equals one second.

The real-time reconstruction is also used for implementing the new scanning protocol that was recently proposed by our scientists – the monitored reconstruction [6]. When this protocol is used, it becomes possible to reduce the radiation exposure due to the fact that gathering of x-ray images stops as soon as there is an adequate set for the reconstruction.

There are no severe time restrictions when it comes to scientific research, but there are certain requirements for the spatial resolution. For this reason, the reconstructed cross-sections are larger. When we were working with the dataset generated by the laboratory microtomograph, it took 189 seconds to reconstruct 1261 layers in multithreaded mode (6.7 ips). The measurement of the input data with the help of the laboratory tomograph took 2000 second, while the Smart Tomo Engine program running on the Elbrus-804 took only 3 minutes and some change to reconstruct all the layers, which amounts to 10% of the previous result. The 4-processor server with the Elbrus-8CB microprocessor will work even faster. It has already been developed at the MSCT and its serial production is currently in the planning phase.

The relationships between the performances of the different platforms using each of the algorithms are interesting as well. When using the FBP, the Elbrus overhead is moderate, and when the clock rate is normalized, the results are pretty close. But when using the DFR and the HFBP, the Elbrus overhead compared to the x86 platform is much higher. Why is that? It happens due to our software not being sufficiently optimized for the Elbrus platform. We’ve spent 5 years working on the optimization problems for the x86-64 platform, and we yet to optimize most of the programs and algorithms for the Elbrus platform, the Elbrus-8CB in particular.

In the foreseeable future, we are planning on improvements in three directions. The first one is to optimize our calculations when using the intrinsics. As of right now, our calculations are made for a 64-bit SIMD, but the Elbrus-8CB has a 128-bit SIMD. The second improvement is going to be made by the MCST team. There are already developments underway that will be supporting the two-dimensional and one-dimensional discrete Fourier transform for the input vector which is not a power of two. Since it’s not ready yet, we were using the ffts library with a little fine-tuning performed both for the Elbrus platform and the x86 platform.

To assess the potential acceleration of our program, we measured the operating time of the discrete Fourier transform performed on the Elbrus-8CB processor for the input complex matrix with the size of 512 by 512. The ffts library that is not optimized for Elbrus performed this operation in 27 milliseconds, and the EML library performed the same operation in just 5.5 milliseconds. We sped up the ffts library by a call to the EML library. The measurements in Table 2 were made after this optimization. We can make a conclusion that if the optimization is performed as thoroughly as the eml-file library is made, then the DFR algorithm on the Elbrus platform can still be accelerated 2.5 times.

And last but not least is the improvement related to the HFBP algorithm which is based on the use of the Hough transform. This transform is not presented in the EML library yet, and our version is optimized only with the help of the vector operations. Since this algorithm is more computationally effective than the DFT (our theoretical conclusions and the version optimized for the x86_64 platform prove that), it can be accelerated a few more times as well. We’ll definitely talk about the results of these optimizations next time.

Here’s the promised video of the program performance on the Elbrus-8CB.

See here what the inner the inner world of a maybug looks like.


In this article we presented our new product – the software for tomographic reconstruction called Smart Tomo Engine that:

—Includes the innovative algorithm HFBP that consistently outperforms the DFR algorithm, the past leader;
—Supports the operating systems: OS Elbrus, MS Windows, macOS, various Linux distributions;
—Supports the following processor architectures: Elbrus, x86, x86_64;
—Is an exclusively Russian development;
—As a part of a software and hardware complex of the platform Elbrus, it can be used by any medical or industrial scanners of any generation, by the latest nano-tomographs (devices which reconstruct objects with submicron resolution), and by synchrotron facilities as well.

But the main outcome of this article is that the combination of the Russian-made processor Elbrus and the Smart Tomo Engine program is sufficient for the real-time tomography, even without additional improvements that are already being developed!

P.S. We couldn’t resist and measured the UNet performance on the Elbrus platform. UNet is a well-known neural network architecture used for solving segmentation problems. Initially, UNet was designed to solve segmentation problems in the medical field, and now the tomographic images that were processed by this neural network method are used to identify pathologies and tumors. The computationally complex parts of the neural networks are realized through the EML library, and the EML library is optimized for different generations of the Elbrus platform. That’s why it’s easier to evaluate the actual performance of different processors using these measurements. The measurements are made for one core, without parallelization. That way there is no need to be concerned about the number of cores.

Development and testing of the Smart Tomo Engine program for tomographic reconstruction using the Elbrus platform

Take a look at the two last numbers. How neat is that? And our research is continuing…


[2] A. C. Kak, M. Slaney, G. Wang. “Principles of computerized tomographic imaging”, Medical Physics, 2002, vol. 29, №1, pp. 107-107.
[3] F. Natterer. “Fourier reconstruction in tomography”, Numerische Mathematik, 1985, vol. 47, №3, pp. 343-353.
[4] A. Dolmatova, M. Chukalina and D. Nikolaev. “Accelerated fbp for Computed tomography image reconstruction”, IEEE ICIP 2020, Washington, DC, United States, IEEE Computer Society, 2020, to be published.
[5] А. В. Долматова, Д. П. Николаев. “Ускорение свертки и обратного проецирования при реконструкции томографических изображений”, Сенсорные системы, 2020, Т. 34, №1, c. 64-71, doi: 10.31857/S0235009220010072.
[6] K. Bulatov, M. Chukalina, A. Buzmakov, D. Nikolaev and V. V. Arlazarov, “Monitored Reconstruction: Computed Tomography as an Anytime Algorithm”, IEEE Access, 2020, vol. 8, pp. 110759-110774, doi: 10.1109/ACCESS.2020.3002019.

Improve your business with Smart Engines technologies

Send Request

Please fill out the form to get more information about the products,pricing and trial SDK for Android, iOS, Linux, Windows.

    Send Request

    Please fill out the form to get more information about the products,pricing and trial SDK for Android, iOS, Linux, Windows.