The Expo lasted from June 19-21, and each day had it’s own focus and results, so this article is organized chronologically.

While we are standing nervously at our booth, thinking: “Is there going to be any interest? Will people come, will they ask questions?”, the first poster session starts and a wave of visitors floods the exhibition floor. Our first guest at the booth spends 30 minutes, knowledgeably inquiring about Elphel’s long-range 3D technology and leaves his business card, saying that he is very impressed. This was a good start of a very busy day full of technical discussions. CVPR is the first exhibition we have participated in where we did not have any problems explaining our projects.

Q: Why does your camera have 4 lenses?

A: With 4 lenses we have 4 stereo-pairs: horizontal stereo-pair is responsible for vertical features (it is the same as in regular stereo-camera); the vertical pair detects horizontal features, while the 2 diagonal pairs ensure that almost any edge is detected. (Elphel Presentation, p.8).

Q: Why don’t you use lidars?

A: Lidars are not capable of long range distance measurements. The practical maximum range of ~~ a lidar is 150-200 meters, while we target 500-2000 meters with a 150 mm stereo-base.(Elphel Presentation, p.7)~~

Q: How do you deal with thermal expansion?

A: there are 2 parts to this question: 1) thermal expansion of the calibrated lens is taken care of by the design of the sensor-lens module; 2) Thermal expansion of the whole camera is also an issue, especially with the larger stereo-base. The titanium tubes that form the cross of the X-Camera expand more on the “sunny side”, and the whole camera bends inward, like a flower. When taking images under the hot Utah sun, we had to wrap the tubes with insulation, covered in aluminum foil.

After some consideration we decided to remove the insulation wrapping for the CVPR show, because it makes the camera more presentable, and were surprised that this question came up so many times – we almost put the insulation cover back on.

Q: Do you build custom lenses?

A: We pre-select standard M12 lenses, because the quality of the lens can vary even within the same model. We have developed a full calibration process and post-processing software to compensate for optical aberrations, allowing to preserve the full sensor resolution over the camera FOV. Read more about aberration correction on Elphel blog.

Q: What is your part of the technology?

A: Everything: we have all parts: Hardware – cameras with aberration corrected lenses and thermally compensated sensor-lens module; a state of the art calibration facility and process; 3D-reconstruction software (Free Software, GNU GPL).

Q: What is the accuracy of your photogrammetry:

A: We comfortably work with 1/10 of a pixel.

Q: How dense is your 3D scene?

A: Currently: 324 x 242 overlapping tiles, with no dead zones.

Q: Does it build a 3D model in real-time?

A: Currently the 3D model is reconstructed in software in the post-processing of the acquired images. This software is tested for FPGA, with which we can achieve 12fps 3D-reconstruction on the camera. Our final goal is to design and manufacture custom ASIC with 3D processing on the chip. (Elphel Presentation, p.17)

The last visitor of the day from a large European company stopped at our booth around 6pm, 30 minutes before the end for the day, and we spend another hour talking and showing Elphel’s camera and high-resolution accurate images it produces. The lights on the floor went down, but we stayed and explain the technology behind our results and future plans. It was a nice day, well ~~spent.~~

- 500 meters, high resolution, 3D reconstruction with 5% accuracy with 150mm stereo-base (quad-stereo camera), passive (image-based)
- 2000 meters with 5% accuracy with 2 quad-stereo cameras based at 1256 mm from each other
- 3D-Reconstruction is done in post-processing with software simulated for FPGA porting. With FPGA – 12fps in 3D

Our near future goals:

- Train the neural network with the current 3D-image sets to achieve same results as in (2) – 2000 meter 3D reconstruction, with just one quad-sensor camera
- Port software to FPGA to achieve real-time 3D reconstruction (12 fps)

Our long-term goals:

- Manufacture custom ASIC for 3D-reconstruction to reconstruct 3D scenes on the fly with video frame rate. The small size of the custom chip will allow us to mass produce a long-range 3D reconstruction camera for industrial and commercial purposes. Smaller form-factor, inexpensive, and fast.
- Scalable design: a smaller stereo-base with 4 sensors – phone application (200 meter accurate 3D reconstruction for consumer applications); larger stereo-base: extremely long range passive 3D reconstruction – military applications, drone applications, etc.

Elphel’s approach to 3D reconstruction with the help of neural network is different from the mainstream one. This is partially due to our expertise is building calibrated hardware, which already produces robust 3D measurements. Therefore we see the value in pre-processing the images before feeding them to the NN similar to the eye pre-procesing visual information before sending it to the brain. The human eye receives 150 MPix of data, processes it on the retina, and sends only 1Mpix to the brain – 150 times reduction.

One question came up twice, both times from the conference attendees, about the difficulty the NN has to process high-resolution images. The current approach to NN training recommends to use low resolution raw images (640×480), because the NN has to be trained on a large amount of images (thousands). The last 6 slides of Elphel’s presentation talk about augmenting the NN with Elphel high-performance Tile Processor, that significantly reduces the amount of data, while leaving the “decision making” to the network.

We argue that the End-to-End approach is less efficient for high-resolution image processing then NN augmented with training-tree, linear image processing.

Elphel offers unique image sets for neural network training. These are high-resolution, space-invariant data sets.

Elphel is seeking collaboration with machine learning research teams working in the area of ML applications for 3D object classification, localization and tracking.

The exhibition is slowing down. Some booths start packing at midday. Elphel, however is busier ~~than ever.~~

We have talked to about every self-driving car company that had it’s presence at CVPR. We also talked to other large corporations doing research in perception and 3D-reconstruction.

We start by visiting our neighbors – other exhibitors at the Expo, we are very curious what they develop, for which applications, can it be combined with our technology, can we learn something from them, and can they learn from us? It’s a collaborat~~ive spirit, with which we start conversation. One of our closest neighbors is Mapbox – an open source mapping platform for custom designed maps, so we start with them. We use Open street Map as well as other maps with our Scene Viewer for 3D-reconstructed scenes for locating scenes on the map as well as for the ground truth data. During lunch break at we add Mapbox tiles to our scene viewer. Mapbox exhibitors are pleased~~~~ that it worked ~~~~so well for us. Now they are also very interested in our technology!~~

Then we walk to other booths, learn about them and present Elphel.

Some of the responses we’ve got when we introduced our technology and our current results:

- Researcher from Apple: “You have the most unique technology on this Expo!”
- Autonomous Vehicle company A: “I can not believe you can accurately measure distances at 500 meters with just a small base of 150mm? With just 5-10% error? It is simply impossible! I have to come see it!” After visiting our booth seeing the demo he asks: “Can you demonstrate your technology on the car? It has to be in Singapore!”
- Autonomous Vehicle company B: ” We are happy with the use of the Lidars on the city streets, with close range perception and 30 mph driving speed. But on the highway, where the speed if much higher and the distance much farther the Lidar’s output is too sparse. Passive, long-range 3D reconstruction would be a perfect solution.”
- Autonomous Vehicle company C: “How dense is your 3D scene? Oh, it is dense, then we can use it.”
- Autonomous Vehicle company D: “We are building the perception solution for self-driving trucks. We need long-range distance measurement.”
- Autonomous Vehicle company E: “We have 16 different types of sensors on the car right now. I don’t think we have room to put another camera. Besides the area behind the wind shield is very valuable, we already have 4 cameras there – there is no room. By the way, we re-evaluate our approach every 6 months, and we just did that, so let’s, maybe, talk in 6 months again?”

Interesting thought – why would anyone need unobstructed windshield in a self-driving car? They did not explain that.

One last thing to do before the show is over is to get the 3D-X-Cam up high and build the 3D model of the CVPR 2018 EXPO in “real-time”!

]]>We uploaded an image set with 2D correlation data together with the import Python code for experiments with the neural networks and are now looking for collaboration with those who would love to apply their DL experience to the new kind of input data. More data will follow and we welcome feedback to make this data set more useful.

The application area we are interested in is an extremely long distance 3D scene reconstruction and ranging with the distance to baseline ratio of 1000:1 to 10,000:1 while preserving wide field of view. Earlier post describes aircraft distance and velocity measurements with up to 3,000:1 distance-to-baseline ratio and 60°(H)×45°(V) field of view.

The data set contains 2D phase correlation output calculated from the 2592×1936 Bayer mosaic source images captured by the quad stereo camera, and Disparity Space Image (DSI) calculated from a pair of such cameras. Longer baseline provides higher range resolution and this DSI is serving as the ground truth for a single quad camera. The source images as well as all the used software is also provided under the GNU/GPLv3 license. DSI is organized as a 324×242 array – each sample is calculated from the corresponding 16×16 tile. Tiles are overlapping (as shown in Figure 3) with stride 8.

These data sets are provided together with the fully reconstructed X3D scene models, viewable in the browser (Wavefront OBJ files are also generated). The scene models are different from the raw DSI as the next software stages generate meshes, and that frequently leads to over-simplification of the original DSI (so most fronto parallel objects in the scene provide better range accuracy when probed near the bottom). Each scene is accompanied with the multi-layer TIFF file (*-DSI_COMBO.tiff) that allows to see the difference between the measured DSI and the one used in the rendered X3D model. File format and Python import software is documented in Oleg’s post “Reading quad stereo TIFF image stacks in Python and formatting data for TensorFlow”, the data files are directly viewable with ImageJ.

The first goal of applying neural network to this data is to improve DSI generation from the raw 2D correlation compared to what we can do now with the traditional code by using data from all correlation pairs and the neighbor tiles to propagate information along the objects edges and amplify correlation in low textured areas by pooling the correlation outputs from multiple tiles.

While there are no moving parts in the system, the overall process of measuring the DSI resembles human/animal stereo vision, where the eyes are converged for the expected object distance, and the visual processing deals with only a small residual mismatch, constantly adjusting the convergence. The difference here is that each tile can be individually “converged” for the requested disparity – this is done by combining integer pixel shift of the tile input window with the accurate fractional pixel shift by a phase rotator in the frequency domain to avoid any re-sampling errors. Currently prediction of the “convergence” (expected disparity) for the next tiles to measure is performed by a mixture of several algorithms (to avoid an expensive full disparity scan over the whole image). This process of predicting disparity is likely to be improved by a properly designed and trained network too, the one that will use the tile context.

As the current processing is recurrent (repeated “eye convergence” / mismatch processing), there are several options how to organize the training data. We decided to focus on sub-pixel matching, so for each scene with calculated ground truth DSI (from the dual camera rig) the correlation output is measured for the target disparity (convergence) in the range of minus 1 pixel to plus 1 pixel with the step of 0.1 pixel in the currently provided data sets. It makes total of 21 files of the correlation outputs. This partial disparity sweep around the ground truth is an alternative to a full disparity sweep that would require too much storage.

In addition to the raw Bayer images scene models have links to the JPEG images, corresponding to each sub-camera view. These images have aberrations corrected and they are partially rectified for the disparity = 0 (infinity). They are not rectilinear, and still have small radial distortion common to all of them, it is compensated during generation of the x3d models. The images have visible modulation caused by the Bayer mosaic – they were not subject to any (destructive) demosaic processing – only to the linear one. Each color channel is separately de-convolved with a set of space-variant kernels.

Initial data contains scenes captured with the car mounted dual quad camera rig (Figure 2) in two series:

- Salt Lake City State Street ↗
- Salt Lake City overlook (northern part of the city in the direction of the airport) ↗
- Airliner approaching SLC International airport ↗

The first series (80 scenes) was captured while driving north at a rate of 5 frames per second. Only the first four seconds (20 scenes) of each 20 second interval are posted to reduce the amount of uploaded data. The second series (59 scenes) was captured while standing and then slowly turning and starting to move. Third series was made from the stationary camera pointed in the direction of the airport.

More data will be added, we are open to suggestions – what data should be included to make the whole data set more useful. All available series combined are available at this link: ↗. Number of the scenes is definitely too small to extract the semantic information, but it may be enough to train the network to build DSI using small samples of the correlation tiles. At least to start with.

We do provide raw Bayer images captured by the camera in JP4 format, but due to significant for the high-resolution images optical aberrations, complete end-to-end processing would require **space-variant** coefficients that to our understanding would require enormous resources and a very large number of images for training. Instead we propose to separate the space-variant pre-processing and provide **space-invariant** data to the network. And the 2D correlation data is the earliest space-invariant data available with this approach.

In the case when the source image contains nicely textured in all directions areas, the correlation output contains a sharp round spot and subpixel argmax calculation is not very difficult – polynomial interpolation provides good results. Unfortunately many real-life objects do not have convenient texture, and poor texture is Achilles’ heel of many vision-based 3D reconstruction systems. On the other hand discontinuity between foreground and background objects usually results in sharp edges that can be used for matching, and it is usually safe to assume that lack of visible features means that the surface is smooth and depth can be reconstructed by averaging over larger area. The edges may have any direction, but binocular systems with horizontal baselines are only sensitive to the vertical features. Figure 3. shows correlation output of a distant airplane where the correlation spots are stretched in the direction of the edges. The four-sensor camera in Figure 1 contains 6 pairs, and after combining horizontal and vertical there are 4 directions left, so for any edge direction there is a pair with a almost perpendicular to it.

Most available LIDARs have a rather short range of 150-250 meters, and we are interested in measuring larger scenes. So instead we decided to use a pair of similar quad cameras positioned at ~5 times larger distance than the quad camera baseline, so the measurements it provides are also 5 times more accurate and can be used for training as ground truth. Binocular system does have limitations as described above, so it will not provide accurate data for the horizontal features (like electrical wires so common over Salt Lake City).

**Individual quad cameras** use precisely thermally compensated sensor front ends with the image plane movement relative to the sensor of less than ±0.02μ/°C, but the overall mount made of titanium tubes does bend when heated by the sun radiation from one side. Just few tenths of a pixel, but it is too much for the application, so until we have a new design we have to use field calibration and bundle-adjust individual sub-camera attitudes using the image data itself. It allows to keep misalignment to ~0.05 pix.

**Dual camera rig** (Figure 2) as currently implemented has its limitations that we temporarily mitigate with the field calibration. We are working on the rig improvements too – one of the mechanical issues is that when driving at freeway speeds we noticed 55 Hz torsion oscillations (two cameras rotating in opposite directions at the ends of the 1.5 m aluminum tube) of 0.4 pix amplitude. So until we will strengthen the rig we had to limit data set to the images captured at lower speeds to keep vibration-caused mismatch under 0.1 pix.

Figure 4. shows results of the field calibration for the image series captured during driving along the State Street. Each quad camera (main with the 258 mm baseline and auxiliary with the 150 mm one) had adjusted its sensor front end attitude angles using the image data, same with the attitude of the auxiliary camera as a whole with respect to the main one (baseline of 1256 mm). The results of the distance measurements obtained from this long baseline pair were then used as ground truth for the individual quad cameras. Relatively strong correction required for the first ~20 samples was caused by the changing temperature in the camera connection tubes after slowing down from the freeway speed to the city one.

The residual convergence standard deviation is 4.2 arcseconds for the 258 mm baseline quad camera and 2.5 arcseconds for the 150 one – it is close to the astronomical seeing that can be 1-4 arcseconds vertically and 5-20 arcseconds with the view close to horizon (for the full thickness of the atmosphere). That means that getting higher depth resolution would require compensating atmospheric turbulence. Luckily we do not need long exposures so active compensation can be similar to the “lucky shot” method by comparing measurements of the same features in the consecutive shots.

**Electronic rolling shutter (ERS) limitations**. We really love small format high-resolution image sensors perfected by the cellphone industry for the image quality and sensitivity, but that performance currently requires Electronic Rolling Shutter (ERS) that leads to image distortions of the moving objects (or caused by the moving/rotating camera itself). In the presence of movement of the objects (or camera egomotion) the measured disparity depends on the time difference between the moments the same feature is captured by the sub-cameras. The sensors are mechanically aligned to be parallel within 5 pixels that corresponds to 0.2ms of the readout time. That means that horizontal pairs capture all objects almost simultaneously regardless of disparity, for the vertical and diagonal pairs that time difference increases proportionally to the disparity, but for the far objects (where disparity accuracy is most important) that difference is still small (0.03ms for each disparity pixel). Currently posted image sets do not have any ERS correction, and that somewhat deteriorates performance for the near objects, but when considering just the egomotion (both translation and rotation) this effect can be easily compensated, at least for the uniform rotation on movement.

In the posted image set ERS is not an issue for small (<5 pix) disparities, we will add software compensation over the full disparity range of the scenes for uniform movements, and later will use IMU to compensate higher frequency movements also.

]]>- convert a TIFF image stack into a NumPy array
- extract information from the TIFF header tags
- reshape/perform a few array manipulations based on the information from the tags.

`~$ python3 imagej_tiff.py <filename>.tiff`

It will print header info in the terminal and display the layers (and decoded values) using Matplotlib.
- image data shape: (242, 324, 9, 9, 4) with the layers along the last dimension
- values data shape: (242, 324, 3)

Elphel is presenting at CVPR 2018 EXPO: Long Range Passive 3D Reconstruction System, providing NN with custom training sets, enabling realtime 3D-aware machine learning systems.

Please visit us at CVPR, Booth 132.

You can also find our presentation and related articles on wiki page https://wiki.elphel.com/wiki/CVPR2018 ]]>While we continue to work on the multi-sensor stereo camera hardware (we plan to double the number of sensors to capture single-exposure HDR image sets) and develop code to get the ground truth data for the CNN training, we had some fun testing the camera for capturing aircraft position in 3-D space. Completely passive method, of course.

We found a suitable spot about 2.5 km from the beginning of the runway 34L of the Salt Lake City international airport (exact location is shown in the model viewer) so approaching aircraft would pass almost over our heads. With the 60°(H)×45°(V) field of view of the camera aircraft are visible when they are 270 m away horizontally.

Camera was set to capture synchronized image sets at 5 frames per second (same as in Fig. 1 animation), each set was processed independently without any relation to the other ones (no optical flow or motion vectors). Our existing (not airplane-aware) prototype software generated the scene 3-D models, they in turn were used to measure the aircraft positions. Result accuracy may be improved by using directly the intermediate Disparity Space Image (DSI) data – current reconstruction code tries to fit some flat surfaces to approximate the DSI data. At long distances there is insufficient data about the shape of the objects, so at ~500 m range (Fig. 2) the software used a fronto-parallel surface (distances to the aircraft nose, center, and tail are the same) and at ~600m it selected a tilted surface in the wrong direction, so the measured distance to the tail is larger than the one measured to the nose and center.

The measurements accuracy is still good, considering that the disparity of 1 image pixel corresponds for this short baseline camera to 527 meters range, at 800 m 1/10 pixel disparity precision corresponds to 120 m uncertainty. Ground speed and the descend rate (indicated in the animation) are calculated between the first captured set and the current one, so the last value averages over the full 6 seconds.

This method of capturing aircraft position and speed does not require that it travels directly to/from the camera, it can fly in any direction. There could also be multiple aircraft (such as UAVs) in the same camera view. Another 3-D scene (Fig. 3) captured from the same spot perpendicular to the flight direction shows an aircraft at 1600 m from the camera at 110 m altitude. According to the map the center line of the 34R runway is 14% farther (at 1820 m) from the camera position. That corresponds to 0.045 pix disparity error in the distance measurement, so calculated ground speed and altitude would also be 14% lower than actual.

Each of the provided 3-D scene models has the corresponding “set of 4 images” link on the settings page that can be viewed as an animation, manually iterated over or downloaded – the one for Fig. 3 scene shows how a 0.33 pixel wiggle of the FedEx MD-11 image looks like (images are zoomable and pannable). Here are the direct links for image sets of the first (1.75 pix disparity) and the last (0.7 pix disparity) of the scene models of the approaching B737 shown in Fig. 1.

These four images are not used in the actual reconstruction process, they are rendered from the source raw (JP4 format) ones for the specified disparity (D0.0 in the file names show that disparity was 0.0 pixels) for viewing or evaluating with alternative software. These images are not rectilinear and have residual common (for all four) radial distortion. They have optical aberrations corrected by the space-variant deconvolution, sub-camera residual misalignment is field-corrected (such misalignment may be caused by an uneven heating of the camera tubular structure by the direct sunlight). The images are partially rectified to the common radial distortion model that bests fits simultaneously all four sub-cameras. The remaining radial distortion is compensated during final generation of the 3-D model. These images have visible grid artifacts caused by the Bayer pattern – no de-mosaic processing is involved because it hurts linear operations over the images, and it is not adequate for the peripheral image areas (see earlier post). Each color component is processed separately for the phase correlation, when combining texture images these artifacts are reduced as they are not coherent between the channels. Any additional non-linear improvements of the Bayer mosaic suppression can be applied to the texture generation step, not to the data used for correlation.

]]>- window size
- horizontal & vertical mirror
- color gains
- exposure
- fps and trigger-synced ports
- frame-based commands sequence allowing to change settings of any image up to 16 frames ahead (didn’t need to be implemented as it’s the common part of the driver for all sensors)
- auto cable phase adjustment during init for cables of various lengths

- Get the register reference document for MT9F002
- Setup the Eclipse IDE for linux kernel development – see instructions

- common for all sensors – sets up working with fpga frame-based command sequencers (see NC393 development progress and the future plans).
- sensor specific – commands that do actual communication with a sensor. This is the part needs to be implemented.

*mt9f002_pages*– a ‘page’ is the upper byte of a 16-bit register address of the sensor (for 8-bit reg addresses the page number is 0x00), the table should list all available pages in the sensor – or at least the most important ones – they are used to program fpga’s 256-max records per port tables used to write commands and at least one record is used to read from port – so in general for a setup w/o a MUX board the limit of allowed pages per sensor is 256-1 = 255 (with a MUX + 3x sensors on a single port it is 256-2(mux write & read)-4(3x sensor read + 1x broadcast read))/4 = 62)*mt9f002_par2addr*– used convert 256 8-bit parameters available as P_SENSOR_REGSxxx to 16-bit register address*mt9f002_ahead_tab*– sequencer commands latencies – leave default, they are needed on the final stages of development

*mt9f002_pgm_detectsensor*– remove reset read sensor id*mt9f002_pgm_initsensor*– init registers that select interface, sensor’s PLL multipliers/dividers and other required settings*mt9f002_pgm_exposure*– program exposure*mt9f002_pgm_window*– program image width and height*mt9f002_pgm_limitfps*– limit fps based on compressor capabilities and trigger period settings*mt9f002_pgm_gains*– program color gains*mt9f002_pgm_triggermode*– set sensor in triggered mode*mt9f002_pgm_sensorregs*– write sensor registers directly

Following the plan laid out in the earlier post we’ve built a camera rig for capturing training/testing image sets. The rig consists of the two quad cameras as shown in Figure 1. Four identical Sensor Front Ends (SFE) 10338E of each camera use 5 MPix MT9P006 image sensors, we will upgrade the cameras to 18 MPix SFE later this year, the circuit boards 103981 are in production now.

We decided to use cameras with different baselines: 150 mm and 258 mm, the distance between the quad-camera centers is 1256 mm. The distance map calculated from the larger of the two cameras will have higher precision, than the distance map from the smaller one. For the same processing algorithms and the same object the distance resolution is reverse proportional to the baseline. Even higher resolution will be achieved by fusing depth maps from the two cameras together. The higher precision results may be used for the cost functions of the lower distance resolution camera instead of the unavailable real ground truth data. To make the rig sufficiently rigid the cameras are mounted on the aluminum tube (OD=57 mm), and that tube is attached to the rest of the mount through the vibration isolators.

Each of the two cameras uses matched set of the lenses with focal length variation of less than 1% to simplify frequency domain mutual image rectification. Cameras still need fine alignment and calibration, after that we will start image sets capturing.

This year Computer Vision and Pattern Recognition conference (CVPR 2018) will be held in Salt Lake City, Utah. We plan to participate in the CVPR exhibition and we will be happy to see visitors at Elphel office that is less than 10 km from the conference venue.

]]>*Featured on Image Sensors World*

This article describes our next steps that will continue the year-long research on high resolution multi-view stereo for long distance ranging and 3-D reconstruction. We plan to fuse the methods of high resolution images calibration and processing, already emulated functionality of the Tile Processor (TP), RTL code developed for its implementation and the Convolutional Neural Network (CNN). Compared to the CNN alone this approach promises over a hundred times reduction in the number of input features without sacrificing universality of the end-to-end processing. The TP part of the system is responsible for the high resolution aspects of the image acquisition (such as optical aberrations correction and image rectification), preserves deep sub-pixel super-resolution using efficient implementation of the 2-D linear transforms. Tile processor is free of any training, only a few hyperparameters define its operation, all the application-specific processing and “decision making” is delegated to the CNN.

Machine learning is an active development area, and its applications to the 3-D scene reconstruction stimulated by the development of the autonomous vehicles including self-driving cars is no exception. Use of the CNNs to extract surfaces from the random-dot stereograms was published as early as 1992^{[1]}. Most of the modern researches use standard image sets: Middlebury stereo data set^{[2]} for high resolution near objects and KITTI^{[3]}
for the longer range applications. KITTI images are acquired from a moving car, they have attached ground truth data captured by the LIDAR. This image set uses binocular pairs and has relatively low resolution (1.4 MPix) compared to the modern image sensors, and still most of the CNN architectures require from seconds to thousands of seconds even when implemented with GPU devices and so are not yet suitable for the real-time applications.

Most of the stereo image processing CNNs^{[4]}
input raw pixel data and perform unary feature extraction in the parallel subnets (one for each image in a stereo set), merge features and perform additional processing of the resulting 3-D data. This is a so-called “siamese” network architecture that benefits from sharing parameters between the identical subnetworks. It is common to put most resources to the unary part of the processing resulting in truncating of the common stage of the processing that can consist of just a single layer (Fast Architecture in ^{[4]}). Efficient implementation in ^{[5]} limits CNN processing to just the DSI generation and then uses traditional methods of DSI enhancement such as semi-global matching^{[6]}, other architectures split network after exchanging features and generate depth maps for each of the stereo images individually^{[7]}.

Early layers of the various CNNs (and the eye retina too) are very general and even remind the basis functions (Figure 2) of the two-dimensional Fourier (DFT), cosine/sine (DCT and DST) and wavelet (DWT) transforms so it is no surprise that there are works that explore combinations of such transforms and the neural networks. Some of them ^{[8, 9]} exploit the energy concentration property of these transforms that makes possible popular image compression such as JPEG.
Others ^{[10, 11]} evaluate efficiency of the available Fast Fourier Transform implementations to speed-up convolutions by converting image data to the frequency domain and then applying the pointwise multiplication according to the convolution-multiplication property. Improvement is modest, as the frequency domain calculations are most efficient for the large windows, while most modern CNNs use small ones, such as 3×3, where Winograd algorithm is more efficient.

Multi-view high resolution cameras present a special case where frequency domain processing results may lead to the reduction of the CNN input features by two orders of magnitude compared to raw pixel input, the data flow diagram is presented in Figure 3. Four identical subnets process individual channels, each providing 4912×3684 pixel Bayer mosaic color images. As described in the earlier post Tile Processor is using efficient Modulated Complex Lapped Transform (MCLT) conversion of the Bayer mosaic (color) high resolution image data to the frequency domain, and of 4x8x8 coefficients representing full 16×16 input tiles for each of the 3 colors, red and blue result in 1x8x8 each, and green produces 2x8x8 coefficients, to the total of 256. Residual fractional pixel shifts needed for the image rectification are implemented as cosine/sine phase rotators, they are performed in parallel for each color and result in 768 coefficients for each tile. Frequency domain processing includes space-variant optical aberration correction (required for the high resolution small format image sensors), phase correlation for image pairs and textures processing if it is required in addition to the distance (disparity) measurement. Aberration correction is performed in each channel subnet, correlation and texture processing combine data from all four of them. After channels/pairs merging the frequency domain data is converted back to the pixel domain by the IMCLT modules as 16×16 tiles representing 2-D correlation. In most cases the 2-D correlation data is reduced to a 16-element array by summing perpendicular to disparity direction (orthogonal pairs are transposed in the frequency domain before they are combined and fed to the IMCLT), full 2-D correlation may still be used for the minor field calibration. The 16-element array is then processed to calculate sub-pixel argmax (residual disparity) and the corresponding correlation value (confidence) – this is where the number of features is dramatically reduced. It may be useful to increase the number of features and supplement average disparity for all 4 pairs of quad camera with separate horizontal and vertical pairs to improve foreground-background separation. These additional features are calculated by the identical TP phase correlation subnets as shown in Figure 3. With all 3 correlations each tile results in 6 values that are fed to the CNN input as a 614x460x6 tensor, in that case the feature reduction would still be over 40 (128 for a single correlation pair).

Conversion of the raw pixels to the Disparity Space Image (DSI) by the TP involves significant reduction of the (X,Y) resolution. When using four of the 18 MPix (4912×3684) imagers the result DSI resolution is just 614×460, this may seem like a waste of the sensor resolution. Actually, it is not:

- a deep sub-pixel resolution for disparity measurement needed for long-distance ranging requires matching of the large image areas anyway
- most of the image area for the most real-world images corresponds to smooth 3-D surfaces where assumption of a common disparity value for a tile is reasonable.
- the initial image resolution is preserved by the TP when source images are converted to the textures (simultaneously improving quality as the data from 4 rectified images is averaged)
- pixel-accurate distance map may be restored by extra processing the pixel data for selected tiles where depth discontinuity is detected, then assigning each pixel to one of the available surfaces.

Significant (42..128) reduction of the input features is not the only advantage of the TP+CNN combination over the CNN alone. Being “convolutional” CNNs depend on translation symmetry, the groups of related pixels are treated the same way regardless of their localization in the image. That is only an approximation, especially when dealing with the high resolution images and extracting subpixel disparity values. This divergence from the strict convolution model is caused by the optical aberrations and distortions and requires use of the space-variant convolution instead, or performing complete aberration correction and image rectification before the images are fed to the network. Otherwise both the complexity of the network and amount of training data would increase dramatically. Image rectification with pixel (or slightly better) precision is a common task in stereo processing. It involves interpolation and re-sampling of the pixel data, the process that leads to the phase noise introduction, especially harmful when deep super-resolution of the matched images is required. Tile processor implementation combines multiple operations (fractional pixel image shifts, optical aberrations correction, phase correlation of the matched pairs), TP avoids image re-sampling from the sensor pixel grid by replacing it with the phase rotation in the frequency domain.

Final step that reduces the number of features that are sent from the TP to the CNN is extraction of the disparity value by calculation of the argmax of the phase correlation data. This function has to be calculated with subpixel resolution for the data defined on the integer pixel grid. Certain biases are possible, and the TP implementation offers trade-off between speed and accuracy. The result disparity value is a sum of the pre-applied disparity (implemented as a phase rotation in the frequency domain on top of the integer pixel shift) and the argmax value (correlation maximum offset from zero). When higher accuracy is required, a second iteration may be performed by applying the full disparity from the first iteration, then the residual argmax offset will be close to zero and less subject to bias.

Optimal system for the real-time high resolution 3-D scene reconstruction and ranging would require development of the application-specific SoC. If used with a set of four 18 MPix image sensors (such as ON Semiconductor AR1820HS) and a single ×16 1600 MHz DDR4 memory device, the 16 nm technology process, the TP subsystem will be capable of 10 Hz operation covering the full 4912×3684 frames reserving half of the memory bandwidth for other then TP operations.

We plan to emulate such system using available NC393 camera electronic and optical-mechanical components, including multiple 10393 system boards based on Xilinx Zynq 7030 SoC. Each such board has a GigE port and four identical sensor ports routed directly to the FPGA I/O pads allowing flexible assignment of pin functions. Typical applications include up to 8 differential LVDS pairs, clock pair, I²C and clock input. The same connectors can be used for high-speed communication between the 10393 boards. Partitioning the system into multiple boards will allow to fit the required TP functionality into smaller FPGAs, then send the result features (614×460×6) over the GigE to a workstation with GPU for the experiments with different CNN implementations. The system bandwidth will be lower than that of the application-specific SoC, the 10 Hz operation will be possible with 5 MPix sensors (2.5 Hz with 18 MPix).

Inter-board connections are shown in Figure 1 (just the connections, the actual prototype camera will look more like in Figure 4, but with a wider body). Five to seven of the 10393 boards are arranged in 2 layers. Four layer 1 boards use one of the sensor ports to receive image data from the attached sensor, perform image conditioning, flat-field correction and store data in the dedicated DDR3 memory. They later read the data as 16×16 pixels overlapping tiles, calculate the tile centers using calibration data and requested location and nominal disparity from the data received over the GigE. Each tile is transformed to the frequency domain, the data is subject to the space-variant aberration correction. The result frequency domain tiles are output through three remaining sensor ports that are reconfigured to be LVDS transmitters. The layer 2 boards simultaneously receive frequency domain data through all 4 of their sensor ports from the layer 1 and perform phase correlation (pointwise multiplication followed by normalization) on the image pairs. There could be just a single layer 2 board, or up to 3 (limited by the available layer 1 ports) to perform different types of correlations in parallel (all 4 pairs combined, two vertical pairs and separately 2 horizontal pairs for better foreground/background separation. The results of the frequency domain calculations are then transformed to the pixel domain and the argmax is calculated. Then argmax value is used to calculate the full tile disparity, and the corresponding correlation value – as a disparity confidence. The pair (disparity, confidence) for each tile is then sent over GigE to the CNN implemented on a workstation computer.

While the TP functionality is already tested with the software emulation, and the efficient implementation is developed, more research is needed for the CNN part of the system. Available image sets, such as KITTI^{[3]} have insufficient resolution (1.4 MPix) and they use different spatial arrangement of the cameras. We plan to capture high resolution quad camera image sets using available NC393-based cameras that will be upgraded from 5 MPix to 18 MPix sensors of the same 1/2.3″ format so the optical-mechanical design will remain the same. As we are primarily interested in long distance ranging (few hundreds to thousands meters), use of the LIDARs to capture ground truth data is not practical. Instead we plan to mount a pair of identical quad cameras (with the baseline of 150mm) on a car 1500 mm apart, pointed in the same direction, so when the 3-D measurements from these quad cameras are fused, the accuracy of the composite distance data would be ten times better, because the effective baseline will be 1500mm. Of course that method has some limitations (it will not help to improve data from the poorly textured objects) but it will provide higher absolute distance resolution that can be used for the loss function during CNN training. Data from the individual quad cameras will be used for training and testing of the network.

**Update:** here is the actual rig.

All acquired images, related calibration data and software will be available online under GNU GPL.

[1] Becker, Suzanna, and Geoffrey E. Hinton. “Self-organizing neural network that discovers surfaces in random-dot stereograms.” Nature 355.6356 (1992): 161.

[2] Scharstein, Daniel, et al. “High-resolution stereo datasets with subpixel-accurate ground truth.” German Conference on Pattern Recognition. Springer, Cham, 2014.

[3] Menze, Moritz, and Andreas Geiger. “Object scene flow for autonomous vehicles.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

[4] J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” Journal of Machine Learning Research, vol. 17, no. 1-32, p. 2, 2016.

[5] W. Luo, A. G. Schwing, and R. Urtasun, “Efficient deep learning for stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5695–5703, 2016.

[6] H. Hirschmuller, “Accurate and efficient stereo processing by semi-global matching and mutual information,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2, pp. 807–814, IEEE, 2005.

[7] J A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” arXiv preprint arXiv:1703.04309, 2017.

[8] Sihag, Saurabh, and Pranab Kumar Dutta. “Faster method for Deep Belief Network based Object classification using DWT.” arXiv preprint arXiv:1511.06276 (2015).

[9] Ulicny, Matej, and Rozenn Dahyot. “On using CNN with DCT based Image Data.” Proceedings of the 19th Irish Machine Vision and Image Processing conference IMVIP 2017

[10] Vasilache, Nicolas, et al. “Fast convolutional nets with fbfft: A GPU performance evaluation.” arXiv preprint arXiv:1412.7580 (2014).

[11] Lavin, Andrew, and Scott Gray. “Fast algorithms for convolutional neural networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

]]>- works in JP4 format (COLOR=5). Because in this format demosaicing is not done it does not require extra scan lines, which simplified fpga’s logic.
- fps is controlled:
- by exposure for the sensor in the freerun mode (TRIG=0, delivers max fps possible)
- by external or internal trigger period for the sensor in the snapshot mode (TRIG=4, a bit lower fps than in freerun)

Why “*small tile*“? Most camera images have short (up to few pixels) correlation/mutual information span related to the acquisition system properties – optical aberrations cause a single scene object point influence a small area of the sensor pixels. When matching multiple images increase of the window size reduces the lateral (x,y) resolution, so many of the 3d reconstruction algorithms do not use any windows at all, and process every pixel individually. Other limitation on the window size comes from the fact that FD conversions (Fourier and similar) in Cartesian coordinates are shift-invariant, but are sensitive to scale and rotation mismatch. So targeting say 0.1 pixel disparity accuracy the scale mismatch should not cause error accumulation over window width exceeding that value. With 8×8 tiles (16×16 overlapped) acceptable scale mismatch (such as focal length variations) should be under 1%. That tolerance is reasonable, but it can not get much tighter.

What is “*space variant*“? One of the most universal operations performed in the FD is convolution (also related to correlation) that exploits convolution-multiplication property. Mathematically convolution applies the same operation to each of the points of the source data, so shifted object of the source image produces just a shifted result after convolution. In the physical world it is a close approximation, but not an exact one. Stars imaged by a telescope may have sharper images in the center, but more blurred in the peripheral areas. While close (angularly) stars produce almost the same shape images, the far ones do not. This does not invalidate convolution approach completely, but requires kernel to (smoothly) vary over the input images ^{[1, 2]}, makes it a space-variant kernel.

There is another issue related to the space-variant kernels. Fractional pixel shifts are required for multiple steps of the processing: aberration correction (obvious in the case of the lateral chromatic aberration), image rectification before matching that accounts for lens optical distortion, camera orientation mismatch and epipolar geometry transformations. Traditionally it is handled by the image rectification that involves re-sampling of the pixel values for a new grid using some type of the interpolation. This process distorts the signal data and introduces non-linear errors that reduce accuracy of the correlation, that is important for subpixel disparity measurements. Our approach completely eliminates resampling and combines integer pixel shift in the pixel domain and delegates the residual fractional pixel shift (±0.5 pix) to the FD, where it is implemented as a cosine/sine phase rotator. Multiple sources of the required pixel shift are combined for each tile, and then a single phase rotation is performed as a last step of pixel domain to FD conversion.

Modulated Complex Lapped Transform (MCLT)^{[3]} can be used to split input sequence into overlapping fractions, processed separately and then recombined without block artifacts. Popular application is the signal compression where “processed separately” means compressed by the encoder (may be lossy) and then reconstructed by the decoder. MCLT is similar to the MDCT that is implemented with DCT-IV, but it additionally preserves and allows frequency domain modification of the signal phase. This feature is required for our application (fractional pixel shifts and asymmetrical lens aberrations modify phase), and MCLT includes both MDCT and MDST (that use DCT-IV and DST-IV respectively). For the image processing (2d conversion) four sub-transforms are needed:

- horizontal DCT-IV followed by vertical DCT-IV
- horizontal DST-IV followed by vertical DCT-IV
- horizontal DCT-IV followed by vertical DST-IV
- horizontal DST-IV followed by vertical DST-IV

Figure 1 illustrates the principle of TDAC (time-domain aliasing cancellation) that restores initial data from the series of individually converted subsections after seemingly lossy transformations. Step a) shows the 2*N long (in our case N=8) data subsets extraction, each subset is multiplied by a window function. It is a half period of the sine function, but other windows are possible as long as they satisfy the Princen-Bradley condition. Each of the sections $(a\dots h)$ corresponds to N/2 of the input samples, color gradients indicate input order. Figure 2b has seemingly lossy result of MDCT (left) and MDST (right) performed on the 2*N long input $(a,b,c,d)$ resulting in N-long $(\u2013\stackrel{~}{c}\u2013d,a\u2013\stackrel{~}{b})$ – tilde indicates time-reversal of the subsequence (image has the gradient direction reversed too). Upside-down pieces indicate subtraction, up-right – addition. Each of the 2*N → N transforms is irreversible by itself, TDAC depends on the neighbor sections.

Figure 1c shows the first step of original sequence restoration, it extends N-long sequence using DCT-IV boundary conditions – it continues symmetrically around left boundary, and anti-symmetrically around the right one (both around half-sample from the first/last one), behaving like a first quadrant (0 to π/2) of the cosine function. Conversions for DST branch are not shown, they are similar, just extended like a first quadrant of the sine function.

The result sequence of step c) (now 2*N long again) is multiplied by the window function for the second time in step d), the two added terms of the last column ($d$ and $\stackrel{~}{c}$) are swapped for clarity.

The last image e) places result of the step d) and those of similarly processed subsequences $(c,d,e,f)$ and $(e,f,g,h)$ to the same time line. Fraction $(\stackrel{~}{d},\stackrel{~}{c})$ of the first block compensates $(\u2013\stackrel{~}{d},\u2013\stackrel{~}{c})$ of the second, and $(\stackrel{~}{f},\stackrel{~}{e})$ of the second – $(\u2013\stackrel{~}{f},\u2013\stackrel{~}{e})$ of the third. As the window satisfies Princen-Bradley condition, and it is applied twice, the $c$ columns of the first and second block when added result in $c$ segment of the original sequence. The same is true for the $d$, $e$, and $f$ columns. First and last N samples are not restored as there are no respective left and right neighbors to be processed.

Modified discrete cosine and sine transforms exhibit a perfect reconstruction property when input sequence is split into regular overlapping intervals and this is the case for many applications, such as audio and video compression. But what happens in the case of a space-variant shift? As the total shift is split into integer and symmetrical fractional part even smooth variation of the required shift for the neighbor sample sequences there will be places where the required shift crosses ±0.5 pix boundaries. This will cause the overlap to be N±1 instead of exactly N and the remaining fractional pixel shift will jump from +0.5 pix to -0.5pix or vice versa.

Such condition is illustrated in Figure 2, where fractional pixel shift is increased to ±1.0 instead of ±0.5 to avoid signal shape distortion caused by a fractional shift. The sine window is also modified accordingly to have zeros on both ends and so to have a flat top, but it still satisfies Princen-Bradley condition. The sawtooth waveform represents input signal that has large pedestal = 10 and a full amplitude of 3. The two left input intervals (1 to 16 and 9 to 24) are shown in red, the two right ones (15 to 30 and 23 to 38) are blue. These intervals have extra overlap (N+2) between red and blue ones. Dotted waveforms show window functions centered around input intervals, dashed – inputs multiplied by the windows. Solid lines on the top plot show result of the FD rotation resulting in 1 shift to the right for the red subsequences, to the left – by the blue ones.

The bottom plot of Figure 2 is for the “rectified image”, where the overlapping intervals are spread evenly (0-15, 8-23, 16-31, 24-39), the restored data is multiplied by the windows again (red and blue solid lines) and then added together (dark green waveform) in an attempt to restore the original sawtooth.

There is an obvious problem in the center where the peak is twice higher than the center sawtooth (dotted waveform). It is actually a “wrinkle” caused by the input signal pedestal that was pushed to the center by the increased overlap of the input, not by the sawtooth shape. The FD phase shift moved *windowed* input signal, not just the input signal. So even if the input signal was constant, left two windows would be shifted right by 1, and right ones – left by one sample, distorting their sum. Figure 3 illustrates the opposite case where the input subsequences have reduced rather than increased overlap and FD phase rotation moves windowed data away from the center – the restored signal has a dip in the middle.

This problem can be corrected if the first window function used for input signal multiplication takes the FD shift into account, so that the input windows *after* phase rotation match the output ones. Signals similar to those in Figure 2, but with appropriately offset input windows (dotted waveforms are by 1 pixel asymmetrical) are presented in Figure 4. It shows perfect reconstruction (dark green line) of the offset input sawtooth signal.

The MCLT converter illustrated in Figure 5 takes an array of 256 pixel samples and processes them at the rate of 1 pixel per clock, resulting in 4 of the 64 (8×8) arrays representing FD transformation of this tile. Converter incorporates phase rotator that is equivalent to the fractional pixel shifter with 1/128 pixel resolution. Multiple pixel tiles tiles can be processed immediately after each other or with a minimal gap of 16 pixels needed to restore the pipeline state. *Fold sequencer* receives start signal and simultaneously two 7-bit fractional pixel X and Y shifts in 1/128 increments (-64 to +63). Pixel data is received from the read-only port of the dual-port memory (*input tile buffer*) filled from the external DDR memory. *Fold sequencer* generates buffer addresses (including memory page number) and can be configured for various buffer latency.

*Fold sequencer* simultaneously generates X and Y addresses for the *2-port window ROM* that generates window function (it is a half-sine as discussed above) values that are combined by a multiplier as the 2-d window function is separable. Each of the dimensions calculate window values with appropriate subpixel shift to allow space-variant FD processing. Mapping from the 16×16 to 8×8 tiles is performed according to Figure 1b for each direction, resulting in four 8×8 tiles for DCT/DCT (horizontal/vertical), DST/DCT, DCT/DST and DST/DST future processing, Each of the source pixels contribute to all 4 of the 8×8 arrays, and each corresponding elements of the 4 output arrays share the same 4 contributing pixels – just with different signs. That allows to iterate through the source pixels in groups of 4 once, multiply them by appropriate window values and in 4 cycles add/subtract them in 4 accumulators. These values are registered and multiplexed during the next 4 cycles feeding *512×25 DTT input buffer* (DTT stands for Discrete Trigonometric Transform – a collective name for both DCT and DST).

“Folded” data stored in the *512×25 DTT input buffer* is fed to the 2-dimensional 8×8 pixel DTT module. It is similar to the one described in the DCT-IV implementation blog post, it was just modified to allow all 4 DTT variants, not just the DCT/DCT described there. This was done using the property of DCT-IV/DST-IV that DST-IV can be calculated as DCT-IV if the input sequence is reversed (x0 ↔ x7, x1 ↔ x6, x2 ↔ x5, x3 ↔ x4) and sign of the odd output samples (y1, y3, y5, y7) is inverted. This property can be seen by comparing plots of the basis functions in Figure 6, the proof will be later in the text.

Another memory buffer is needed after the *2d DTT* module as it processes one of four 64-pixel transforms at a time, and the phase rotator (Figure 7) needs simultaneous access to all 4 components of the same FD sample. These 4 components are shown on the left of the diagram, they are multiplied by 4 sine/cosine values (shown as CH, SH, CV and SV) and then combined by the adders/subtracters. Total number of multiplications is 16, and they have to be performed it 4 clock cycles to maintain the same data rate through all the MCLT converter, so 4 multiplier-accumulator modules are needed. One FD point calculation uses 4 different sine/cosine coefficients, so a single ROM is sufficient. The phase rotator uses horizontal and vertical fractional pixel shift values for the second time (first was for the window function) and combines them with the multiplexed horizontal and vertical indexes (3 bits each) as address inputs of the coefficient ROM. Rotator provides rotated coefficients that correspond to the MCLT transform of the pixel data shifted in both horizontal and vertical directions by up to ±0.5 pix at a rate of 1 coefficient per clock cycle, providing both output data and address to the external multi-page memory buffer.

Most camera applications use color image sensors that provide Bayer mosaic color data: one red, one blue and two diagonal green pixels in each 2×2 pixel group. In old times image senor pixel density was below the lenses optical resolution and all the cameras performed color interpolation (usually bilinear) of the “missing” colors. This process implied that each red pixel, for example, had four green neighbors (up, down, left and right) at equal distance, and 4 blue pixels located diagonally. With the modern high-resolution sensors it is not the case, possible distortions are shown on Figure 8 (copied from the earlier post). More elaborate “de-mosaic” processing involves non-linear operations that would influence sub-pixel correlation results.

As the simple de-mosaic procedures can not be applied to the high resolution sensors without degrading the images, we treat each color subchannel individually, merging results after performing the optical aberration correction in the FD, or at least compensating the lateral chromatic aberration that causes pixels x/y shift.

Channel separation means that when converting data to the FD the input data arrays are decimated: for green color only half of the values (located in a checkerboard pattern) are non-zero, and for red and blue subchannels – only quarter of all pixels have non-zero values. The relative phase of the remaining pixels depends on the required integer pixel offset (only ±0.5 are delegated to the FD) and may have 2 values (black vs. white checkerboard cells) for green, and 4 different values (odd/even for each of the horizontal/vertical directions independently) for red and blue. MCLT can be performed on the sparse input arrays the same way as on the complete 16×16 ones, low-pass filters (LPF) may be applied on the later processing stages (LPF may be applied to the deconvolution kernels used for aberration correction during camera calibration). It is convenient to multiply red and blue values by 2 to compensate for the smaller number of participating pixels compared to the green sub-channel.

Direct approach to calculate FD transformation of the color mosaic input image would be to either run the monochrome converter (Figure 5) three times (in 256*3=768 clock cycles) masking out different pixel pattern each time, or implement the same module (including the input pixel buffer) three times to achieve 256 clock cycle operation. And actually the 16×16 pixel buffer used in monochrome converter is not sufficient – even the small lateral chromatic aberration leads to mismatch of the 16×16 tiles for different colors. That would require using larger – 18×18 (for minimal aberration) pixel buffer or larger if that aberration may exceed a full pixel.

Luckily MCLT that consists of MDCT and MDST, they it turn use DCT-IV and DST-IV that have a very convenient property when processing the Bayer mosaic data. Almost the same implementation as used for the monochrome converter can transform color data at the same 256 clock cycles. The only part that requires more resources is the final phase rotator – it has to output 3*4*64 values in 256 clock cycles so three instances of the same rotator are required to provide the full load of the rest of the circuitry. This almost (not including rotators) triple reduction of the MCLT calculation resources is based on “folding” of the input data for the DTT inputs and the following DCT-IV/DST-IV relation.

*DST-IV is equivalent to DCT-IV of the input sequence, where all odd input samples are multiplied by -1, and the result values are read in the reversed order.* In the matrix form it is shown in (1) below:

$${\mathrm{DST}}^{\mathrm{IV}}=\left[\begin{array}{ccccc}0& \mathrm{}& \mathrm{}& \mathrm{}& 1\\ \mathrm{}& \mathrm{}& \mathrm{}& 1& \mathrm{}\\ \mathrm{}& \mathrm{}& 1& \mathrm{}& \mathrm{}\\ \mathrm{}& \mathrm{\cdot \cdot \cdot}& \mathrm{}& \mathrm{}& \mathrm{}\\ 1& \mathrm{}& \mathrm{}& \mathrm{}& 0\end{array}\right]\cdot {\mathrm{DCT}}^{\mathrm{IV}}\cdot \left[\begin{array}{ccccc}1& \mathrm{}& \mathrm{}& \mathrm{}& 0\\ \mathrm{}& -1& \mathrm{}& \mathrm{}& \mathrm{}\\ \mathrm{}& \mathrm{}& 1& \mathrm{}& \mathrm{}\\ \mathrm{}& \mathrm{}& \mathrm{}& \mathrm{\cdot \cdot \cdot}& \mathrm{}\\ 0& \mathrm{}& \mathrm{}& \mathrm{}& -1\end{array}\right]$$ | (1) |

Equations (2) and (3) show definitions of DCT-IV and DST-IV^{[4]}:

$${\mathrm{DCT}}^{\mathrm{IV}}\left(k\right)=\sqrt{\frac{2}{N}}\cdot \sum _{l=0}^{N\u20131}\mathrm{cos}(\frac{\pi}{N}\cdot (l+\frac{1}{2})\cdot (k+\frac{1}{2}\left)\right)$$ | (2) |

$${\mathrm{DST}}^{\mathrm{IV}}\left(k\right)=\sqrt{\frac{2}{N}}\cdot \sum _{l=0}^{N\u20131}\mathrm{sin}(\frac{\pi}{N}\cdot (l+\frac{1}{2})\cdot (k+\frac{1}{2}\left)\right)$$ | (3) |

The modified by (1) DST-IV can be re-writted as the two separate sums for even (*l=2*m*) and odd (*l=2*m+1*) input samples and by replacing output samples *k* with reversed (*N-1-k*). Then after removing full periods (n*2*π) and applying trigonometric identities it can be converted to the same value as DST-IV:

$${\mathrm{DCT}}_{\mathrm{mod}}^{\mathrm{IV}}\left(k\right)=\sqrt{\frac{2}{N}}\cdot (\sum _{m=0}^{N/2\u20131}\mathrm{cos}(\frac{\pi}{N}\cdot (2\cdot m+\frac{1}{2})\cdot ((N\u20131\u2013k)+\frac{1}{2}\left)\right)\u2013\sum _{m=0}^{N/2\u20131}\mathrm{cos}(\frac{\pi}{N}\cdot (2\cdot m+\frac{3}{2})\cdot ((N\u20131\u2013k)+\frac{1}{2}\left)\right))=$$ $$\sqrt{\frac{2}{N}}\cdot (\sum _{m=0}^{N/2\u20131}\mathrm{cos}(\frac{\pi}{N}\cdot (\frac{1}{2}\cdot N\u2013(2\cdot m+\frac{1}{2})\cdot (k+\frac{1}{2})\left)\right)\u2013\sum _{m=0}^{N/2\u20131}\mathrm{cos}(\frac{\pi}{N}\cdot (\frac{3}{2}\cdot N\u2013(2\cdot m+\frac{3}{2})\cdot (k+\frac{1}{2})\left)\right))=$$ $$\sqrt{\frac{2}{N}}\cdot (\sum _{m=0}^{N/2\u20131}\mathrm{cos}(\u2013\frac{\pi}{2}+\frac{\pi}{N}\cdot (2\cdot m+\frac{1}{2})\cdot (k+\frac{1}{2}\left)\right)\u2013\sum _{m=0}^{N/2\u20131}\mathrm{cos}(\frac{\pi}{2}+\frac{\pi}{N}\cdot (2\cdot m+\frac{3}{2})\cdot (k+\frac{1}{2}\left)\right))=$$ $$\sqrt{\frac{2}{N}}\cdot (\sum _{m=0}^{N/2\u20131}\mathrm{sin}(\frac{\pi}{N}\cdot (2\cdot m+\frac{1}{2})\cdot (k+\frac{1}{2}\left)\right)+\sum _{m=0}^{N/2\u20131}\mathrm{sin}(\frac{\pi}{N}\cdot (2\cdot m+\frac{3}{2})\cdot (k+\frac{1}{2}\left)\right))={\mathrm{DST}}^{\mathrm{IV}}\left(k\right)$$ | (4) |

As shown in Figure 1b, the 2*N-long input sequence of four fragments (a,b,c,d) is folded in N-long sequence (5) for DCT-IV input:

$$(a,b,c,d)\u27f6(\u2013\stackrel{~}{c}\u2013d,a\u2013\stackrel{~}{b})$$ | (5) |

and (6) – for DST-IV:

$$(a,b,c,d)\u27f6(\stackrel{~}{c}\u2013d,a+\stackrel{~}{b})$$ | (6) |

where tilde “~” over the name indicates reversal of the segment. Such direction reversal for the sequences of even length (N/2=4 in our case) swaps odd- and even-numbered samples. Each of the halves of each of (5) and (6) show that both have the same $(\u2013d,a)$ for direct and differ in sign for reversed: $(\u2013\stackrel{~}{c},\u2013\stackrel{~}{b})$ and $(\stackrel{~}{c},\stackrel{~}{b})$ terms. Each of the input samples appears exactly once in each of the DCT-IV and DST inputs. Even samples of the input sequence contribute identically to the even samples of both output sequences (through a an d) and contribute with the opposite signs to the odd output samples (through b and c). And similarly the odd-numbered input samples contribute identically to the odd-numbered output samples and with the opposite sign to the even ones.

So, *sparse input sequence with only non-zero values at even positions result in both DCT-IV and DST-IV having the same values at even positions and multiplied by -1 – in the odd. Odd-only input sequences result in same values in odd positions and negatives – in the even ones.*

Now this property can be combined to the previously discussed DST-IV to DCT-IV relation and extended to the two-dimensional Bayer mosaic case. We can start with the red and blue channels that have 1-in-4 non-zero pixels. If the non-zero pixels are in the even columns, then after first horizontal pass the DST-IV output will differ from the DCT-IV only by the output samples order according to (1), as even input values are the same, and odd are negated. Reversed output coefficients order for the horizontal pass means that the DST-IV will be just horizontally flipped with respect to the DCT-IV one. If the non-zero input was for the odd columns, then the DST-IV output will be reversed and negated compared to the DCT-IV one.

The second (vertical) pass is applied similarly. If original pattern had non-zero even rows, the result of vertical DST-IV would be the same as those of the DCT-IV after a vertical flip. If the odd rows were non-zero instead – the result will be vertically flipped and negated with respect of the DCT-IV one.

This result means that for the 1-in-4 Bayer mosaic array only one DCT-IV/DCT-IV transform is need. The three other DTT combinations may be obtained by flipping the DCT-IV/DCT-IV output horizontally and/or vertically and possibly negating all the data. Reversal of the readout order does not require additional hardware resources (just a few gates for a little fancier memory address counter). Data negation is also “free” as it can easily be absorbed by the phase rotator that already has adders/subtracters and just needs an extra inversion sign control. Flips do not depend on the odd/even columns/rows, only negation depends on them.

Green channel has non-zero values in either (0,0) and (1,1) or (0,1) and (1,0) positions. So it can be considered as a sum of two 1-in-4 channels described above. In the (0,0) case neither horizontal, no vertical DST-IV inverts sign, and (1,1) inverts both, so the DST-IV/DST-IV does not have any inversion, and it will stay true for the green color – combination of (0,0) and (1,1). We can not compare DCT-IV/DCT-IV with ether DST-IV/DCT-IV or DCT-IV/DST-IV, but these two will have the same sign compared to each other (zero or double negation). So the DST-IV/DST-IV will be double-flipped (both horizontally and vertically) version of the DCT-IV/DCT-IV output, and DCT-IV/DST-IV – double flipped version of DST-IV/DCT-IV, and calculation for this green pattern requires just two of the 4 DTT operations. Similarly, for the (0,1)/(1,0) green pattern DST-IV/DST-IV will be double-flipped and negated version of the DCT-IV/DCT-IV output, and DCT-IV/DST-IV – double-flipped and negated version of DST-IV/DCT-IV.

Combining results for all three color components: red, blue and green, we need total of four DTT operations on 8×8 tiles: one for red, one for blue and 2 for green component instead of twelve if each pixel had full RGB value.

MCLT converter for the Bayer Mosaic data shown in Figure 9 is similar to the already described converter for the monochrome data. The first difference is that now the *Fold sequencer* addresses larger source tiles – it is run-time configurable to be 16×16, 18×18, 20×20 or 22×22 pixels. Each tile may have different size – this functionality can be used to optimize external memory access: use smaller tiles in the center areas of the sensor with lower lateral chromatic aberrations, read larger tiles in the peripheral areas with higher aberrations. The extended tile should accommodate all three color shifted 16×16 blocks as illustrated in Figure 10.

The *start* input triggers calculation of all 3 colors, it can appear either each 256-th clock cycle or it needs to wait for at least 16 cycles after the end of the previous tile input if started asynchronously. X/Y offsets are provided individually for each color channel and they are stored in a register file inside the module, the integer top-left corner offset is provided separately for each color to simplify internal calculations.

Dual-port window ROM is the same as in the monochrome module, both horizontal and vertical window components are combined by a multiplier and then the result is multiplied by the pixel data received from the external memory buffer with configurable latency.

There are only two (instead the four for monochrome) accumulators required because each of the “folded” elements of the 8×8 DTT inputs has two contributor source pixels at most (for green color), while red and blue use just a single accumulator. Red color is processed first, then blue, and finally green. For the first ones only the DCT-IV/DCT-IV input data is prepared (in 64 clock cycles each), green color requires DST-IV/DCT-IV block additionally (128 clock cycles).

DTT output data goes to three parallel dual-port memory buffers. Red and blue channels each require just single 64-element array that later provides 4 different values in 4 clock cycles for horizontally and vertically flipped data. Green channel requires 2*64 element array. These buffers are filled one at a time (red, blue, green) and each of them feed corresponding phase rotator. Rotator outputs need 256 cycles to provide 4*64=256 FD values, the symmetry exploited during DTT conversion is lost at this stage. The three phase rotators can provide output addresses/data asynchronously or the green and blue outputs can be delayed and output simultaneously with the green channel, sharing the external memory addresses.

Results of the simulation of the MCLT converter are shown in Figure 11. Simulation ran with the clock frequency set to 100 MHz, the synthesized code should be able to run at at least 250 MHz in the Zynq SoC of the NC393 camera – at the same rate as is currently used for the data compression – all the memories and DSPs are fully buffered. Simulation show processing of the 3 tiles: two consecutive and the third one after a minimal pause. The data input was provided by the Java code that was written for the post-processing of the camera images, the same code generated intermediate results and the output of the MCLT conversion. Java used code involved double precision floating point calculations, while the RTL code is based on fixed-point calculations with the precision of the architecture DSP primitives, so there is no bit-to-bit match of the data, instead there is a difference that corresponds to the width of the used data words.

The RTL code (licensed under GNU GPLv3+) developed for the MCLT-based frequency domain conversion of the Bayer color mosaic images uses almost the same resources as the monochrome image transformation for the tiles of the same size – three times less than a full RGB image would require. Input window function modification that accounts for the two-dimensional fractional pixel shift and the post-transform phase rotators allow to avoid re-sampling for the image rectification that degrades sub-pixel resolution.

This module is simulated with Icarus Verilog (a free software simulator), results are compared to those of the software post-processing used for the 3-d scene reconstruction of multiple image sets, described in the “Long range multi-view stereo camera with 4 sensors” post.

The MCLT module for the Bayer color mosaic images is not yet tested in the FPGA of the camera – it needs more companion code to implement a full processing chain that will generate disparity space images (DSI) required for the real-time 3d scene reconstruction and/or output of the aberration-corrected image textures. But it constitutes a critical part of the Tile Processor project and brings the overall system completion closer.

[1] Thiébaut, Éric, et al. “Spatially variant PSF modeling and image deblurring.” SPIE Astronomical Telescopes+ Instrumentation. International Society for Optics and Photonics, 2016. pdf

[2] Řeřábek, M., and P. Pata. “The space variant PSF for deconvolution of wide-field astronomical images.” SPIE Astronomical Telescopes+ Instrumentation. International Society for Optics and Photonics, 2008.pdf

[3] Malvar, Henrique. “A modulated complex lapped transform and its applications to audio processing.” Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on. Vol. 3. IEEE, 1999.pdf

[4] Britanak, Vladimir, Patrick C. Yip, and Kamisetty Ramamohan Rao. Discrete cosine and sine transforms: general properties, fast algorithms and integer approximations. Academic Press, 2010.

]]>