October 11, 2018

GPU Implementation of the Tile Processor

by Andrey Filippov

After we coupled the Tile Processor (TP) that performs quad camera image conditioning and produces 2D phase correlation in space-invariant form with the neural network[1], the TP remained the bottleneck of the tandem. While the inferred network uses GPU and produces disparity output in 0.5 sec (more than 80% of this time is used for the data transfer), the TP required tens of seconds to run on CPU as a multithreaded Java application. When converted to run on the GPU, similar operation takes just 0.087 seconds for four 5 MPix images, and it is likely possible to optimize the code farther — this is our first experience with Nvidia® CUDA™.


Starting with CUDA™ and JCUDA

Before starting development of the GPU code we verified that the GPU acceleration is possible for the main program written in Java by evaluating demo ImageJ plugin[2]. Next was to get into development with Nvidia® CUDA™ using Nsight[3] plugin for Eclipse IDE. Nsight offered an option to import sample projects and I started to learn CUDA™ with one of them – dct8x8. That project uses “Runtime API” and consists of a mixture of C/C++ code for the CPU and for the GPU. This mixed mode is not directly compatible with JCUDA[4], but if the “kernel” (code executed by the GPU) is kept separate from the CPU one, it is possible to develop/debug/test the program in Nsight IDE and then use the same file containing just the GPU kernel(s) with JCUDA. What needs to be changed is the portion of the non-GPU C/C++ code that transfers data between the computer system (CPU) memory and the GPU dedicated memory physically located on the graphic card.

Guided by the Nvidia® dct8x8 sample project I first implemented similar DCT-IV[5] and DST-IV needed for the Complex Lapped Transform (CLT) used by the TP for conversion to the frequncy domain and back (sample project contained code only for the DCT-II (direct) and DCT-II (inverse) used in JPEG and related applications. After several iterations I made those programs to run almost as fast as the highly optimized code in the Nvidia® sample, and the next step was to implement complete CLT and aberration correction for the Bayer mosaic images, following the approach we used for the RTL[6]. Debugging was simplified by the fact that the same algorithm was already tested in Java, and the intermediate data arrays could be compared between the Java and CUDA™ outputs.

Kernel for the Complex Lapped Transform of the Bayer Mosaic Images

The first implemented kernel is taking the following inputs:

  • Four of the 5 MPix Bayer mosaic images.
  • Four of the per-camera arrays of the space variant (stride 16) CLT deconvolution kernels already converted to the frequency domain (this data is reused for multiple image sets).
  • Additional kernel fractional pixel x,y offsets and their derivatives to interpolate required image shifts between the grid nodes of the space-variant kernels
  • List of the tiles to process that contains tiles location on the tile grid and the fractional pixel X,Y shift calculated externally from the required disparity for the known radial lens distortions. This list is used because not all the tiles need to be processed on each pass, when building the depth map only some tiles need to be processed, and some tiles require multiple iterations with different disparity values.

Output from the first kernel is a set of per-camera (4), per tile (324×242 for the 2592×1936 images), per color component (3) of 4×8×8 arrays representing CLT frequency domain transformation of each of the 16×16 (stride 8) pixels tiles of the source images. The transformation sequence is described in the earlier post[6], it assumes the following stages:

  • Finding the closest de-convolution kernel for the specified tile index and offsets.
  • Calculation of the full horizontal and vertical offset that combines requested image tile center and the interpolated kernel data.
  • Splitting X,Y offsets into rounded integer and fractional pixel shifts. Integer part is used to select position of the 16×16 window in the source image, fractional one is applied later in the form of the phase shift in the frequency domain. It is also applied to the window function.
  • Folding 16×16 image tiles into the 8×8 pixel ones for the 2D DCT-IV/DST-IV conversions, using the shifted 2D half-sine window function.
  • Calculating the CLT layers: DCT-IV/DCT-IV, DST-IV/DCT-IV, DCT-IV/DST-IV and DST-IV/DST-IV (horizontal pass/vertical pass). For the Bayer mosaic input, of the 12 (3 colors by 4 CLT layers) transforms only 4 are actually needed, other 8 are restored using the symmetry of transforms of sparse (1 in 4 non-zero for red and blue, 1 in 2 for green color components) inputs.
  • Element-wise multiplication of the converted image tile by the kernel tile, equivalent to the pixel-domain convolution.
  • Applying the residual fractional pixel shift (in the range of +/-0.5 pix in each direction) implemented as a phase rotation. Such shift does not involve re-sampling as the space domain shift would, and so it does not introduce any related quantization noise.
  • Optional 2D low pass filter that can be combined with the convolution kernels.

Kernels Performance Evaluation

When I went through all the processing pipeline, and made sure the results match the Java code output. I measured the execution time and was disappointed – ~5 seconds per set of 4 images wasn’t what I expected. Using profiler I soon realized that my understanding of CUDA™ was wrong – all the participating threads have to execute exactly the same code, it is not like in multi-threaded CPU code or multiple simultaneously operating modules in RTL. Re-writing the code to eliminate divergence reduced execution time to 0.9 s. Not too exciting, but still significant gain over the CPU alone. Another kernel for inverse transform to convert from the frequency domain back to the images added 0.6 seconds. The inverse MCLT produces overlapping 16×16 (stride 8) tiles that need to be added together to result in a full picture, implementation uses 4 (stride 2 in each direction) passes over the image, with each pass free of any overlaps between tiles allowing asynchronous parallel execution.

After spending about two weeks troubleshooting the code kernel code as a part of the C++ application I was expecting to encounter more difficult to pinpoint problems when adding these GPU kernels to the Java application. There are multiple data arrays to be fed to the GPU with no convenience of having the debugger at hand.

In reality it was much easier – JCUDA[4] does a wonderful job and it took me just a few hours to convert the code to run as a GPU accelerator for the Java program. I already had the Java code to convert images and convolution kernels to the data arrays that were passed to the C++ test application via binary files, only what remained to be done was to flatten multi-dimensional arrays and to replace an array of struct – it had a mixture of integer (tile indices) and float (offsets) members. With that done, there were a couple bugs left, but they were clearly reported in the Java stack trace output.

I do not know if it is possible to use #include directives in the code compiled from JCUDA, but as the source code is anyway first read from the file to a Java String before it is sent to the compiler, I just concatenated all the needed (currently just 2) files as strings. I also added “#define JCUDA” to the concatenated string before the files content, as well as some other numeric defines (like image dimensions) shared between the Java and the GPU code. I enclosed all the includes and the duplicate parameter defines within #ifndef JCUDA in the GPU kernel source files, that made the same source code files to work in both Nsight IDE and in Java application with nvrtcCreateProgram(). Soon I got the complete output images and verified they are the same as generated by the Java code.

And in the end I’ve got an unexpected and wonderful “reward”. When stepping over the critical GPU-related Java code lines I was expecting 1.5 second delay for the execution of the two kernels, but could not notice any. I enclosed each GPU kernel call with extra cuCtxSynchronize() thinking it did not wait for completion – still no visible delay. Then I put a loop to run kernels 100 times and got ~6 seconds for the first kernel. Something was obviously wrong – but the output images were correct. I went back to the IDE and made sure I have “Release” (not “Debug”) in seemingly every relevant place, but still it was running slow. Then I launched built binaries from the command line and discovered, that while “Debug” version is as slow as when run from the IDE, the “Release” version is 20.1 times faster, same performance as with JCUDA.

Kernel description Debug mode execution time Release mode execution time
Convert 4 images to FD and deconvolve with kernels 1073.4 ms 57.0 ms
Convert 4 images from FD to RGB slices 665.3 ms 29.5 ms
Total 1738.7 ms 86.5 ms

These run times were measured for the “GeForce GTX 1050 Ti” with compute capability 6.1, 4GB memory.


The GPU code implemented and tested so far does not include the 2D phase correlation kernel need for the depth map generation, I started with just the rectified images as the pictures usually show if something is wrong in the processing. Phase correlation kernel will be easy to implement (the first kernel will stay the same) and the execution time will be approximately the same. Then we will work on coupling this code with the Tensorflow inferred network and feed the 2D correlation data directly from the GPU device memory. That will result in the near real-time performance of the whole system.

Links to the source code are provided below[7],[8],[9] (the code is also mirrored at Github). It needs to be cleaned up and may eventually be released as a library after more functionality will be added. We believe this code will be useful for variety of imaging applications both coupled with the ML systems or traditional. The phase correlation can be used for multiple view camera systems (as in our case) or for the optical flow processing when matching images acquired by the same camera in subsequent frames.


[1] “Neural network doubled effective baseline of the stereo camera”

[2] “How to create an ImageJ Plugin using JCuda”

[3] Nvidia®Nsight™ Eclipse Edition

[4] Java bindings for Nvidia®CUDA™

[5] Wikipedia article: “Discrete Cosine Transform”

[6] “Efficient Complex Lapped Transform Implementation for the Space-Variant Frequency Domain Calculations of the Bayer Mosaic Color Images”

[7] Source code of the 8×8 DCT-II,DST-II,DCT-IV and DST-IV for GPU

[8] Source code of the Tile Processor GPU implementation

[9] Source code of the Java class to integrate GPU acceleration with JCUDA

Leave a Reply

Your email address will not be published. Required fields are marked *

+ nine = 18