Elphel Development Blog

Calibration of the LWIR16 camera prototype

Andrey Filippov — Fri, 16 Jul 2021 20:30:53 +0000

Advance Steel 2024: https://abcoemstore.com/product/autodesk-advance-steel-2024/ System Requirements Learn about the system requirements for configuring Advance Steel.
Download links for: video and captions.

This research “Long-Range Thermal 3D Perception in Low Contrast Environments” is funded by NASA contract 80NSSC21C0175

Convert ip cam to web cam in Linux for Jitsi, Zoom, etc.

Oleg Dzhimiev — Mon, 01 Jun 2020 18:39:11 +0000

Basically, one need to direct the ip cam stream (mjpeg or rtsp) to a virtual v4l2 device which acts like a web cam and is automatically picked up by a web browser or a web cam application. This can be easily done by gstreamer or ffmpeg.

Quick setup

Install and create a virtual webcam

~$ sudo apt install v4l2loopback-dkms ~$ sudo modprobe v4l2loopback devices=1

Direct the stream

The options below are for gstreamer/ffmpeg and rtsp/mjpeg streams:

gstreamer:
# rtsp stream ~$ sudo gst-launch-1.0 rtspsrc location=rtsp://192.168.0.9:554 ! rtpjpegdepay ! jpegdec ! videoconvert ! tee ! v4l2sink device=/dev/video0 sync=false # mjpeg stream ~$ sudo gst-launch-1.0 souphttpsrc is-live=true location=http://192.168.0.9:2323/mimg ! jpegdec ! videoconvert ! tee ! v4l2sink device=/dev/video0

ffmpeg:
# rtsp stream ~$ sudo ffmpeg -i rtsp://192.168.0.9:554 -fflags nobuffer -pix_fmt yuv420p -f v4l2 /dev/video0 # mjpeg stream ~$ sudo ffmpeg -i http://192.168.0.9:2323/mimg -fflags nobuffer -pix_fmt yuv420p -r 30 -f v4l2 /dev/video0

Finally

Start a video conference application, select the webcam from the menu (/dev/video0).
Test link: meet.jit.si

Details

Requirements

ffmpeg (2.8.15) or gstreamer (1.8.3)
IP camera capable of streaming mjpeg or rtsp – in this case it’s Elphel NC393

IP camera setup

rtsp stream ports for Elphel NC393 cameras: 554, 556, 558, 560
mjpeg stream ports for Elphel NC393 cameras: 2323, 2324, 2325, 2326
stream resolution/fps: 1920×1080, 30fps
ffmpeg for mjpeg stream – “-r 30” sets fps – the default ffmpeg setting is 25 fps

Useful commands

Print web cam settings:
v4l2-ctl -d 0 --all
Gstreamer play mjpeg:
gst-launch-1.0 souphttpsrc is-live=true location=http://192.168.0.9:2323/mimg ! jpegdec ! autovideosink
Gstreamer play rtsp:
gst-launch-1.0 -v playbin uri=rtsp://192.168.0.9:554 uridecodebin0::source::latency=0
Test stream for web cam:
sudo gst-launch-1.0 videotestsrc ! tee ! v4l2sink device=/dev/video0

Comments

Gstreamer 1.8.3 didn’t work without tees in the pipeline.
modprobe v4l2loopback can create multiple devices

TPNET with LWIR

Andrey Filippov — Sun, 04 Aug 2019 06:41:27 +0000

Figure 1. Talon (“instructor/student”) test camera.

Update: arXiv:1911.06975 paper about this project.

Summary

This post concludes the series of 3 publications dedicated to the progress of Elphel five-month project funded by a SBIR contract.

After developing and building the prototype camera shown in Figure 1, constructing the pattern for photogrammetric calibration of the thermal cameras (post1), updating the calibration software and calibrating the camera (post2) we recorded camera image sets and processed them offline to evaluate the result depth maps.

The four of the 5MPix visible range camera modules have over 14 times higher resolution than the Long Wavelength Infrared (LWIR) modules and we used the high resolution depth map as a ground truth for the LWIR modules.

Without machine learning (ML) we received average disparity error of 0.15 pix, trained Deep Neural Network (DNN) reduced the error to 0.077 pix (in both cases errors were calculated after removing 10% outliers, primarily caused by ambiguity on the borders between the foreground and background objects), Table 1 lists this data and provides links to the individual scene results.

For the 160×120 LWIR sensor resolution, 56° horizontal field of view (HFOV) and 150 mm baseline, disparity of one pixel corresponds to 21.4 meters. That means that at 27.8 meters this prototype camera distance error is 10%, proportionally lower for closer ranges. Use of the higher resolution sensors will scale these results – 640×480 and longer baseline of 200 mm (instead of the current 150 mm) will yield 10% accuracy at 150 meters, 56°HFOV.

Table1: LWIR 3D disparity measurement accuracy enhanced by TPNET deep neural network (DNN).

Scene timestamp and comments [Expand all]

Non-TPNET disparity error (pix)

TPNET disparity error (pix)

TPNET accuracy gain

1562390202_933097: Single person, trees.

EO(m)	LWIR(m)	Err(%)
16.2	16.5	1.64
10.2	10.1	-1.13
17.7	18.4	3.93
51.2	46.5	-9.17
20.6	19.7	-4.47

Detailed results page (pdf)

0.136

0.060

2.26

1562390225_269784: Single person, trees.

EO(m)	LWIR(m)	Err(%)
16.5	16.4	-0.04
26.0	27.5	5.98
9.2	9.8	7.41
6.3	6.2	-0.62

Detailed results page (pdf)

0.147

0.065

2.25

1562390225_839538: Single person. Horizontal motion blur.

EO(m)	LWIR(m)	Err(%)
9.1	8.3	-8.65
5.5	5.7	2.54
35.8	37.6	4.88
14.9	16.1	7.95

Detailed results page (pdf)

0.196

0.105

1.86

1562390243_047919: No people, trees.

EO(m)	LWIR(m)	Err(%)
7.5	7.2	-3.31
9.9	10.2	2.53
27.2	26.6	-2.41
108.7	95.2	-12.42
121.3	143.3	18.16
6.3	6.3	-0.68
9.8	10.1	3.52

Detailed results page (pdf)

0.136

0.060

2.26

1562390251_025390: Open space. Horizontal motion blur.

EO(m)	LWIR(m)	Err(%)
8.1	7.9	-2.80
19.7	21.0	6.86
25.6	27.1	5.90
8.2	8.2	-0.06
47.6	48.9	2.70
107.3	96.9	-9.71
23.0	21.5	-6.40

Detailed results page (pdf)

0.152

0.074

2.06

1562390257_977146: Three persons, trees.

EO(m)	LWIR(m)	Err(%)
10.0	9.6	-3.73
9.1	8.8	-3.11
6.9	7.0	1.38
11.4	11.3	-1.00
8.7	8.6	-1.07

Detailed results page (pdf)

0.146

0.074

1.96

1562390260_370347: Three persons, near and far trees, minivan.

EO(m)	LWIR(m)	Err(%)
11.1	10.7	-3.34
10.4	10.4	0.12
8.7	8.8	1.14
11.3	11.4	0.56
8.5	8.6	1.69
11.3	11.3	-0.83
32.8	34.2	4.29
91.6	84.9	-7.23
49.8	47.9	-3.80
130.0	140.7	8.26
51.9	54.6	5.35

Detailed results page (pdf)

0.122

0.058

2.12

1562390260_940102: Three persons, trees.

EO(m)	LWIR(m)	Err(%)
11.8	11.0	-6.45
10.7	10.3	-3.76
8.7	9.1	4.98
11.3	11.4	0.18
8.6	8.5	-1.87
11.0	11.2	2.27
31.4	35.3	12.25
27.9	29.6	6.31

Detailed results page (pdf)

0.135

0.064

2.12

1562390317_693673: Two persons, far trees.

EO(m)	LWIR(m)	Err(%)
9.5	9.4	-0.65
11.0	11.3	2.28
66.8	76.6	14.66
95.0	79.1	-16.75
81.9	93.0	13.45
69.4	72.8	4.83
9.8	9.4	-3.67

Detailed results page (pdf)

0.157

0.078

2.02

1562390318_833313: Two persons, trees.

EO(m)	LWIR(m)	Err(%)
8.3	8.3	-0.67
10.6	10.7	1.00
81.2	78.7	-3.17
63.6	48.4	-23.98
86.2	78.9	-8.46
54.0	53.4	-1.10
12.0	11.8	-1.61
76.9	80.0	3.94
23.3	28.4	21.99

Detailed results page (pdf)

0.136

0.065

2.10

1562390326_354823: Two persons, trees.

EO(m)	LWIR(m)	Err(%)
6.5	6.2	-4.85
9.2	9.4	2.86
21.3	21.0	-1.36
6.9	7.1	2.15
19.7	19.1	-3.08

Detailed results page (pdf)

0.144

0.090

1.60

1562390331_483132: Two persons, trees.

EO(m)	LWIR(m)	Err(%)
5.4	5.1	-5.19
8.1	8.1	0.03
6.9	6.8	-1.09
5.9	6.0	1.29
16.4	15.9	-2.72

Detailed results page (pdf)

0.209

0.100

2.08

1562390333_192523: Single person, sun-heated highway fill.

EO(m)	LWIR(m)	Err(%)
4.3	4.2	-2.32
10.6	10.5	-0.75
6.7	6.8	0.51
20.2	19.1	-5.67
14.7	14.3	-2.81
18.1	19.0	4.60

Detailed results page (pdf)

0.153

0.067

2.30

1562390402_254007: Car on a highway, background trees.

EO(m)	LWIR(m)	Err(%)
11.9	11.9	0.30
31.1	29.6	-5.05
27.6	28.1	1.66
30.6	35.1	14.51

Detailed results page (pdf)

0.140

0.077

1.83

1562390407_382326: Car on a highway, background trees.

EO(m)	LWIR(m)	Err(%)
12.0	12.3	2.53
30.7	29.3	-4.47
27.3	25.6	-6.30
29.9	30.2	1.14
27.4	34.5	25.73

Detailed results page (pdf)

0.130

0.065

2.01

1562390409_661607: Single person, a car on a highway.

EO(m)	LWIR(m)	Err(%)
9.5	9.5	-0.35
28.3	32.1	13.44
124.5	88.7	-28.74
29.4	31.4	6.72
14.4	14.9	3.24

Detailed results page (pdf)

0.113

0.063

1.79

1562390435_873048: Single person, two parked cars.

EO(m)	LWIR(m)	Err(%)
7.8	7.9	1.18
22.2	22.4	0.85
56.9	65.3	14.91
121.4	113.0	-6.96

Detailed results page (pdf)

0.153

0.057

2.68

1562390456_842237: Close trees.

EO(m)	LWIR(m)	Err(%)
9.6	9.5	-1.69
6.7	6.5	-2.49
114.6	96.0	-16.25
7.4	7.4	0.26
8.1	8.0	-1.49

Detailed results page (pdf)

0.211

0.102

2.08

1562390460_261151: Trees closer than the near clipping plane.

EO(m)	LWIR(m)	Err(%)
5.4	5.4	0.76
3.9	3.8	-0.53
4.2	4.3	3.01
4.9	5.0	3.48
100.1	82.5	-17.59

Detailed results page (pdf)

0.201

0.140

1.44

Average

0.154

0.077

2.04

Notes: click on the scene title to show details. Click on the image to start/stop GIF animation.

Detailed results

Figure 2 illustrates results at the output and at the intermediate processing steps. Similar data is availble for each of the 19 evaluated scenes listed in the Table 1 (Figure 2 corresponds to scene 7, highlighted in the table).

Figure 2. LWIR depth map generation and comparison to the ground truth data.

Processing steps to generate and evaluate LWIR depth maps

Four visible range 5 Mpix (2592×1936) images (Figure 2b) are corrected from the optical aberrations and are used to generate disparity map c) that has 1/8 resolution (324×242 where each pixel corresponds to an overlapping 16×16 tile in the source images) with the Tile Processor. Red frames in b) and c) indicate field of view of the LWIR camera modules. Each of the high resolution disparity tiles in c) is mapped to the corresponding tiles of the 20×15 LWIR disparity map f). This data is used as ground truth for both neural network training and subsequent testing and inference (20% of the captured 1100 image sets are reserved for testing, 80% are used for training).

One of the 4 simultaneously acquired LWIR images is shown in Figure 2a. Table 1 contains animated GIFs for each set of 4 scene images, animation (click image to start) consequently shows all 4 views of the quad camera setup. These 4 LWIR images are partially rectified and then subject to 2D correlation between each of the 6 image pairs, resulting in the disparity map d) similarly to how it is done for the visible range (aka electro-optical or EO) ones. Sub-plot g) shows the difference between LWIR disparity map calculated without machine learning d) and the ground truth data f). Subplot e) contains disparity map enhanced by the ML that uses raw 2D-correlation data (Figures 3, 4) as its input features. And the last subplot h) shows difference between e) and f) as the resudual errors after full TPNET processing. Manually placed markers listed in a) and b) correspond to the scene features, black markers in a) show linear distances calculated from the output e), while red markers legend in b) lists the corresponding ground truth distances. Table 1 contain the same data and additionally shows relative errors of the LWIR measurements.

Gray “×”-es in g) and h) indicate tiles where data may be less reliable. These tiles include 2-tile margins on each side of the map, caused by the fact that convolution part of the neural network analyses 5×5 tile clusters. Bottom edge tiles use pixels that are not available in top pair of the four images because of the large parallax for the near objects, so bottom tiles use only a single binocular pair instead of the 6 available for the inner tiles.

Another case of the potential mismatch between the disparity output and the ground truth data is when different range objects in high-resolition map c) correspond to the same tile in LWIR maps d) and e). The network is trained to avoid averaging of the far and near objects in the same tile, but it still can wrongly identify which of the planes (foreground or background) was selected for the ground truth map. For this reason all LWIR tiles that have both foreground and background match in c) are also marked.

Most of the tiles that have both foreground and background objects are assigned corectly so we do not remove the marked tiles from the average error calculation. Instead we handle gross outliers by discarding a fixed fraction of all tiles (we used 10%) that have the largest disparity errors and average errors among the remaining 90% of tiles.

Depth maps X/Y resolution

Disparity maps generated from the low resolution 160×120 LWIR sensors of the prototype camera have even lower resolution of just 20×15 tiles. There are other algorithms that generate depth maps with the same pixel resolution as the original images, but they have lower ranging accuracy. Instead we plan to fuse high-reolution (pixel-accurate) rectified source images and low X/Y (but high Z) resolution depth maps with another DNN. That was not a part of this project, but it is doable — similar approach is already used for fusion of the high-resolution regular images with the low pixel resolution time-of-flight (ToF) cameras or LIDARs. Deep subpixel resolution needed for the ranging accuracy requires many participating pixels for each object no be measured, so the fusion will not significantly increase the number of different objects in each tile, but it can restore the pixel accuracy of the object contours.

Comments on the individual scenes

Among the evaluated scenes there are two with strong horizontal motion blur caused by the operator turning around the vertical axis – scenes 3 and 5. Such blur can not be completely eliminated by the nature of the microbolometer sensors, so the 3D perception has to be able to mitigate it. If it was just a traditional binocular system, horizontal blur would drown useful data in the noise and make it impossible to achieve a deep subpixl resolution. Quad non-colinear stereo layout is able to mitigate this problem and the acuracy drop is less prominent.

The last two scenes (18 and especially 19) demonstrate high disparity errors (0.1 and 0.14 pixels respectively). These large errors correspond to high absolute disparity value (over 8 pix) and low distances (under 3 meters) where disparity variations within the same tile are high and so the disparity of a tile is ambiguous. On the other hand, relative disparity error for even 0.4 pixels is only 5% or just 15 cm at 3 m range. Scene 19 has details in the right side that are closer than the near clipping plane preset for of the ground truth camera.

TPNET – Tile Processor with the deep neural network

Image rectification and 2D correlation

Earlier post “Neural network doubled effective baseline of the stereo camera” (and also arXiv:1811.08032 preprint) explains the process of aberration correction, partial rectification of the source images and calcualtion of the 2D phase correlation using Complex Lapped Transform (CLT) for the 16×16 overlapping tiles (stride 8) of the quad camera high resolution visible range images. We used the same algorithms in the current project too. Depth perception with TPNET is a 2-step process (actually iterative as these steps may need be repeated) similar to the human stereo vision – first the eyes are converged at the objects at certain distance producing zero disparity between the two images there, then data from the pair of images is processed (locally in retinas and then in the visual cortex) to extract residual mismatch and convert it to depth. There are no moving parts in TPNET-based system, instead the Tile Processor implements target disparity (“eye convergence”) by image tile shifts in the frequency domain with CLT, so it is as if each of the image tiles had a set of 4 small “eyes” that simultaneously can converge at different distances.

Figure 3. Set of four images of the scene 7. (click to start/stop)

Figure 3 shows partially rectified set of 4 LWIR images of the scene 7. “Partially” means that they are not rectified to the complete rectilinear (pinhole) projection, but are rather subject to subtle shifts to a common to all four of them radial distortion model. CLT is similar to 2D Fourier transform and is also invariant to shifts, but not to stretches and rotations. We use subpixel-accurate tile shifts in the frequency domain, benefitting from the fact that individual lenses while different have similar focal length and distortion parameters. Complete distortion correction would require image resampling increasing computation load and memory footprint or reducing subpixel resolution.

Next two figures contain 2D phase correlation data that is fed to the DNN part of TPNET. Figure 4 is calculated with each tile target disparity of zero (converged at infinity). The four frames of the GIF animation correspond to 4 stereo pairs directions – horizontal, vertical and 2 diagonals, as indicated in the top left image corner.

Figure 4. 2D phase correlations with all tiles target disparity set to 0. (click to start/stop)

Figure 5. 2D phase correlations, target disparity set by the polynomial approximaton. (click to start/stop)

Similarly to the source images (Figure 3) all the correlation tiles for infinity objects (as clouds in the sky) do not move when animation is running, while the near objects do – the residual disparity (diffenece beween the full disparity and the “target disparity”) value corresponds to the offest of the spot center from the grid center in the dirrection of the red arrows. The correlation result is most accurate when residual disparity is near zero, the farther the spot center is from zero, the less source pixels participate in matching and so the correlation S/N ratio and resolution drop.

Figure 5 is similar, but this time each tile target disparity is set by the iterative process of finding residual correlation argmax by the polynomial approximation and re-calculating new 2D correlation. This time most of the tiles do not move, except those that are on the foreground/backgroud boreders.

Neural network layout

Figure 6. Network diagram

The 2D phase correlation tiles shown above and the per-tile target disparity value make the input features that are fed to the neural network during training and later inference. We tried the same network architecture and cost finctions we developed for the high resolution cameras (blog post, arXiv:1811.08032 preprint) and it behaved really well, allowing us to finish this project in time.

The network consists of two stages, where the first one gets input features from a single tile 2D phase correlation as [4,9,9] tensor (first dimension, 4 is the number of correlation pairs, two others (9×9) specify the center window of the full 15×15 2D correlation – 4 pixels in each direction around the center). Stage 1 output for each tile is a 16-element vector fed to the Stage 2 inputs.

Stage 2 subnet for each tile gets input features from the corresponding stage 1 output and its 24 neighbors in a form of a [5,5,16] tensor. The output of the stage 2 is a single scalar for each tile or [15,20,1] tensor for the whole disparity map, shown in Figure 2e. Such two-stage layout is selected to allow the network to use neighbor tiles when generating its disparity output (such as to continue edges, to fill low-textured areas, and to improve foreground/background separation), but simultaneously limit dimensions of the fully connected layers.

Training and the cost function

Instead of the full images the training sets consist of the 5×5 tile fragments harvested from the 880 image sets. Randomly constructed batches use 2D histogram normalization of the image tiles, so each batch includes similar repesenation of near and far, high contrast and low contrast tiles. Cost function design followed the same approach as described in the earlier post, we clipped the L2 cost of the difference with ground truth (essentially giving up on gross outliers caused by incorrect foreground/background tiles assignment) and included additional terms.

Figure 7. Screenshot of the Tensorboard after TPNET training. a) test image average disparity error; b) network accuracy gain over heuristic (non-NN); c) total cost function; d) cost of “cutting corners” between the foreground and background; e) cost of mixture of L2 for 5×5, 3×3 and single tile Stage2; f) cost for 5×5 only (used for inference); g) cost for 3×3 only; h) cost for single tile only.

Additional cost function terms included regularization to reduce overfitting, and the one to avoid averaging of the distances to the foreground and background objects for the same tile, forcing the network to select either fore- or background, but not their mixture.

Regularization included “shortcuts” requiring that if just a single (center) tile input from Stage 1 is fed to stage 2, it should provide less accurate but still reasonably good disparity output (Figure 7h), same for the 3×3 sub-window (Figure 7g). Only full 5×5 Stage 2 output (Figure 7f) is used for test image evaluation and inference, but training uses a weighted mixture of all three (Figure 7e).

Cost function term to reduce averaging of foreground and background disparities in the border tiles (Figure 7d) is made in the following way: for each tile the average of the ground truth disparities of its 8 neighbors is calculated, and additional cost is imposed if the network output falls between the ground truth and this average. Figure 7c shows the total cost function that combines all individual terms.

Conclusion

Experiments with the prototype LWIR 3D camera provided disparity accuracy consistent with our earlier results obtained with the high resolution visible range cameras.
The new photogrammetric calibration pattern design is adequate for the LWIR imaging and can be used for factory calibration of the LWIR cameras and camera cores.
Machine learning with TPNET provided 2× ranging accuracy improvement over the one obtained with the same data and traditional only (w/o machine learning) processing.
Combined results proved the feasibility of completely passive LWIR 3D perception even with relatively low resolution thermal sensors.

Photogrammetric Calibration of the Quad Thermal Camera

Andrey Filippov — Wed, 19 Jun 2019 01:46:55 +0000

Figure 1. Calibration of the quad visible range + quad LWIR camera.

We’ve got the first results of the photogrammetric calibration of the composite (visible+LWIR) Talon camera described int the previous post and tested the new pattern. In the first experiments with the new software code we’ve got average reprojection error for LWIR of 0.067 pix, while the visible quad camera subsystem was calibrated down to 0.036 pix. In the process of calibration we acquired 3 sequences of 8-frame (4 visible + 4 LWIR) image sets, 60 sets from each of the 3 camera positions: 7.4m from the target on the target center line, and 2 side views from 4.5m from the target and 2.2 m right and left from the center line. From each position camera axis was scanned in the range of ±40° horizontally and ±25° vertically.

In the first stage of calibration the acquired images (Figure 2 shows both LWIR and visible range images) were processed to extract the detected pattern grid node. Each node’s pixel coordinates (x,y), a pair of the pattern UV indexes and the contrast are stored in the form of multi-layer TIFF files – processed images from Figure 2 are shown on animated GIFs, Figure 3 is for LWIR and Figure 4 – for the visible range images. Each of the images has a link to a respective TIFF file that has metadata visible with ImageJ program.

Figure 2. LWIR and visible range images of the pattern.

The software for the calibration had to be modified, as it did not have provisions for handling different resolution sensors in the same system. Additionally it was designed for the smaller pattern cell size so each image had larger number of detected nodes. We also had to use other method of absolute pattern matching (finding out which of the multiple almost identical measure nodes corresponds to the particular (e.g. center) node of the physical pattern. Earlier, with the fine pitch patterns, we used software-controlled laser pointers, with the current approach we used the fact that the visible subsystem was already calibrated – that provided sufficient data for absolute matching. Only one image for each camera position had manually marked center pattern cell – that mark simplified the processing.

Before LWIR data was used the visible rage images were processed with Levenberg-Marquardt algorithm (LMA) and the system knew precise position and orientation of the camera during capturing of each image set. After preliminary adjustment with idealized (uniform and flat) pattern grid it was possible to simultaneously refine each physical pattern node location in 3d, similarly as we did earlier (Figures 4 and 5 in this post). Using large cell pattern, the visible cameras bundle adjustment resulted in 0.11 pixels when using just radial distortion model. This was improved to 0.036 pix when using non-radial correction to compensate other irregularities of the lenses.

When LWIR data was added and each image set had known extrinsic parameters, the absolute grid matching was performed in the following way. Starting with the image with the smallest distance from the target (so the wrong matching of the grid cell would cause attitude error larger than initial precision of the system assembly) the attitude angles of each of the 4 LWIR subcameras was found. Then for each additional image LMA was run with all but 2 parameters known and fixed – camera base position X and Y coordinates (in the plane parallel to the target). As the position was already known from the visible range subsystem, the X and Y difference was matched to the physical pattern dimensions to determine how many cells (if any) the assumed grid match is offset. That gave absolute grid match for all of the LWIR images, and bundle adjustment of the LWIR modules intrinsic (focal length, offset from the center pixel, radial distortion coefficient) as well as the modules location in the camera coordinate system and their attitudes was possible. It resulted in 0.097 pixels reprojection error with just the radial distortion model. We did not expect much from the non-radial correction as the sensor resolution is low, but it still gave some improvement – the residual error dropped down to 0.067 pix. It is likely this value can be improved more with better tweaking of the processing parameters – the current result was achieved in the first run.

Figure 3. Processed pattern grid from LWIR data (click image to download ImageJ Tiff stack).

Figure 4. Processed pattern grid from visible range camera data (click image to download ImageJ Tiff stack).

These calibration results are not that exciting compared to what we will get with the LWIR 3D model reconstruction, but it gives us confidence that we are on the right track, and our goal is possible. Having a much higher resolution sub-system conveniently provides ground truth data for training of the neural network – something that was much more difficult to achieve with the high resolution visible range only systems.

3D Perception with Thermal Cameras

Andrey Filippov — Thu, 06 Jun 2019 23:33:53 +0000

Figure 1. Oleg carrying combined LWIR and visible range camera.

While working on extremely long range 3D visible range cameras we realized that the passive nature of such 3D reconstruction that does not involve flashing lasers or LEDs like LIDARs and Time-of-Flight (ToF) cameras can be especially useful for thermal (a.k.a LWIR – long wave infrared) vision applications. SBIR contract granted at the AF Pitch Day provided initial funding for our research in this area and made it possible.

We are now in the middle of the project and there is a lot of the development ahead, but we have already tested all the hardware and modified our earlier code to capture and detect calibration pattern in the acquired images.

Printed Calibration Patterns for LWIR

Figure 2. Initial LWIR pattern detection test with a small (100×100 pix) FLIR handheld camera.

For the LWIR imaging we planned to use approach similar to what we have developed for the visible range cameras, and the first step is the precise photogrammetric camera calibration. Before engaging in the project we made sure that we can capture images of the pattern in LWIR wavelength range, the same pattern image should also be registered by the visible range camera to be used as the ground truth data. For the regular cameras there are many ways and materials to create high quality test patterns, our current setup uses 7.0 meters by 3.0 meters pattern printed on five sheets of self-adhesive film attached to the wall. This pattern illuminated by the regular bright fluorescent lamps did not register any image by a LWIR camera (we borrowed a small, handheld FLIR camera, used for HVAC applications).

In the next (successful) test we used a larger scale pattern printed on paper that was not bonded to the wall. The pattern was illuminated by a pair of 500W halogen lamps (see Figure 2). The lamp radiation — both visible and infrared (mostly infrared) was heating the image and so the black painted areas on paper became hotter than the white ones as shown in the FLIR camera image (it is upside down as the camera was held in a vise in that position). The measured temperature difference between white and black areas, illuminated from the distance of 1.5–2 meters was in the range of 2–4°K.

Figure 3. LWIR calibration pattern design.

This quick test proved that with reasonable amount of high power halogen lamps we can achieve sufficient temperature contrast for the photogrammetric calibration. It still has some limitations — while the amount of heat absorbed by the pattern paint can be calculated from the data registered by the visible range cameras, the pattern surface temperature also depends on less controlled effects – air convection and heat transfer to the wall where the paper touches it. These factors reduce accuracy of the calibration, to mitigate their influence we designed a pattern where the ground truth relative surface temperature can be calculated from the measured visible range images and a few parameters that describe thermal properties of the materials.

Calibration Pattern Design

We used 5mm thick presentation type foam boards that consist of the polystyrene foam layer between two paper surface sheets. As we found out later, it is important to use both black and white paints during the printing process – unpainted paper surface that looks matte in visible light behaves like a mirror for the LWIR radiation and reflects the lamps used for illumination. You may find such reflection in Figure 5 – the leftmost panel that was built first does not have white paint. A thin (t=0.8 mm) aluminum sheets are glued to the back surfaces of the foam boards, and the forced airflow equalizes their temperatures. Figure 4 reveals the back side of the pattern module – thicker (1/2″ and 1″) machined foam boards make ribs that provide structural integrity of the pattern modules. These ribs are attached to the aluminum side of the pattern sheets with epoxy resin, their other side is reinforced with the fiberglass tape. Soft foam along the perimeter of the modules allows another use of the fans that provide the forced airflow – the developed vacuum is used to firmly press the modules to the smooth (covered with the visible range pattern film) wall. Timelapse video below illustrates the process of building and installation of the LWIR calibration pattern.

Figure 4. A stack of two face-down LWIR pattern modules.

Figure 3 illustrates temperature calculation for the uniformly illuminated pattern, this method will be used to derive ground truth thermal data from the registered visible range images. For a straight line on a surface of the pattern the absorbed light power produces a step function. The temperature of the underlying aluminum is a blurred version of that function, with the sigma depending on the thicknesses and thermal conductivity ratios of the foam and aluminum. With the used parameters amplitude of the aluminum temperature variations is below 1%. For the very thin foam the surface temperature would be just a sum of the aluminum temperature and a step function, the actual surface temperature distribution is blurred by the amount proportional to the foam thickness as the heat flow under the black/white pattern borders is not exactly orthogonal to the surface.

Calibration pattern period is 357mm, approximately four times larger than that of the pattern we use for the visible range cameras calibration. This size is selected to work with the low resolution prototype system (160×120 pix in each of the four channels) and still is suitable for the higher resolution LWIR imagers of up to 1280×960. Figure 5 shows the view of the pattern by the visible range (left) and LWIR (right) cameras from the distance of 3 meters (top) and 7 meters (bottom) from the target. The field of view of the already calibrated visible range cameras is slightly larger that that of the LWIR, so ground truth data is available over all LWIR sensors area.

Figure 5. LWIR calibration pattern captured by visible range (2592×1936) and LWIR (160×120) cameras from the distance of 3 meters (above) and 7 meters (below).

Visible + LWIR combo camera

The main goal of our current project is to determine to what extent methods that we currently use for visible range camera photogrammetric calibration, distortion and aberration correction, deep subpixel disparity calculation and neural network training are applicable to the LWIR systems. Generally LWIR sensors have lower resolution than the regular visible and near infrared (VNIR) ones, are expensive and subject to export control, so with the limited time and resources available we selected small (160×120) FLIR Lepton-3.5 sensor modules available at Digi-Key. These modules have integrated germanium lenses and mechanical shutters that are essential for the photometric calibration. The sensors have limited frame rate (9 FPS) that puts them outside of the scope of export regulations. We plan use higher resolution sensors after evaluating the prototype system and achieving the goal of 0.1 pix (or better) disparity resolution.

The overall prototype camera design (Figure 6) is rather straightforward: we used already calibrated X-shaped quad camera and added another one with four LWIR sensors. FLIR Leton-3.5 modules do not support externally triggered operation, so I hoped that if all 4 sensors are fed with the same clock, and the reset signal is de-asserted simultaneously, then all channels will stay in sync and output frames simultaneously. Later this assumption turned out to be true, the difference between the channels after startup was in the range of ±10μs and it did not change over time. These small offset values are likely related to the analog PLL lock, they do not require any additional correction as they are more than a thousand times shorter than the effective pixel integration time. The visible range quad camera is triggered by the LWIR one. We measured internal LWIR module latency by capturing image of the rotating object and used this data to program the cameras triggering parameters to achieve simultaneous image capturing by all 8 of the visible range and thermal sensors.

Figure 6. Prototype combo camera: 4×VNIR+4×LWIR

Lepton-3.5 sensor module is small enough so we were able to fit it together with the power conditioning circuitry and the configuration EEPROM on the same dimension PCB as our regular sensor front ends (SFE) of 28mm×15mm and then mount in the standard SFE body. The board electrical and mechanical design is documented in Elphel Wiki.

Lepton module is controlled over the I2C bus that was already supported in Elphel NC393 series cameras, the video output is provided over SPI-based VoSPI interface, so some FPGA development along with the related drivers and applications code was required. Due to the digitized microbolometers output instead of that of a photodiodes in regular CMOS sensors, the rationale behind the use of the gamma compression that allows virtually lossless 8-bit representation of 12-bit pixel ADC output does not hold. That in turn required full 16-bit output rather than the JP4 image format that we were using for over a decade. We implemented Tiff format (both 8 and 16 bits per pixel) that is now available for all other supported image sensors, not only for the LWIR ones.

Photogrammetric calibration software

We are now working on adapting our existing code base to work with the LWIR low resolution sensors, and the photogrammetric calibration is the first step before we can perform 3D scene reconstruction and train a DNN to get higher disparity measurement accuracy. Our existing calibration workflow consists of the following steps:

Capturing of a few hundred image sets of the calibration pattern from a 3 different camera locations, using switchable laser pointers for absolute matching of the periodic pattern cells in the camera images.
Extracting pattern grid key points positions in the individual images with deep subpixel accuracy (0.03-0.05 pix) combining both pixel domain and frequency domain image processing.
Bundle adjustment of the camera intrinsic and extrinsic parameters assuming lenses radial distortion models, physical pattern grid points correction with Levenberg-Marquardt algorithm (LMA).
Continuing bundle adjusting allowing non-radial lens distortions and perform photometric calibration of the sensors (flat field, lens vignetting correction).
Applying calculated distortions to the synthesized pattern grid and calculating Point Spread Function (PSF) to be used for space-variant deconvolution for aberration correction.

Figure 7. Image set of 4 color and 4 LWIR images from the frame sequence captured by Talon camera (raw images – not rectified/matched).

As of now we just finished adaptation/re-implementation of the first two steps to work with LWIR sensors. Compared to the above workflow we do not use laser pointers for LWIR, but the visible range quad camera is already calibrated and can provide absolute matching of the pattern images. The major challenge in step 2 was to make the software work when there are very few (or only one) pattern cell available in each processing window – our previously developed software could comfortably use many pattern cells.

3D perception with LWIR

When the remaining steps of the photogrammetric calibration will be finished, we will use the quad camera 3D scene reconstruction software to build the 3D scenes from the captured LWIR images. The paired visible range quad camera has significantly higher resolution and its disparity data will be used as ground truth for the network training. Here we will benefit from the thermal imaging difference from the traditional image intensifier-based night vision cameras that can only operate in very low light conditions. LWIR sensors can operate simultaneously with the visible range ones (and many systems fuse their images to effectively enhance LWIR resolution).

Impatient to get real life scene imagery even before the software is ready, we took the first prototype camera to Southern Utah. We call this prototype “Talon” for the T-38 training jet where the visible range quad camera plays the role of the instructor that sits slightly behind the LWIR “the student”. We captured synchronized visible/LWIR images in a slot canyon as well as the LWIR-only sets in the pitch black darkness — it was a New Moon that night.

The raw LWIR images themselves do not look that exciting – there are many much higher resolution ones used by the military, law enforcement and even hunters. The things will change when we’ll be able to generate the LWIR 3D scenes.

Timelapse video: Building the LWIR Calibration Pattern

GPU Implementation of the Tile Processor

Andrey Filippov — Thu, 11 Oct 2018 21:10:09 +0000

After we coupled the Tile Processor (TP) that performs quad camera image conditioning and produces 2D phase correlation in space-invariant form with the neural network^[1], the TP remained the bottleneck of the tandem. While the inferred network uses GPU and produces disparity output in 0.5 sec (more than 80% of this time is used for the data transfer), the TP required tens of seconds to run on CPU as a multithreaded Java application. When converted to run on the GPU, similar operation takes just 0.087 seconds for four 5 MPix images, and it is likely possible to optimize the code farther — this is our first experience with Nvidia® CUDA™.

Implementation

Starting with CUDA™ and JCUDA

Before starting development of the GPU code we verified that the GPU acceleration is possible for the main program written in Java by evaluating demo ImageJ plugin^[2]. Next was to get into development with Nvidia® CUDA™ using Nsight^[3] plugin for Eclipse IDE. Nsight offered an option to import sample projects and I started to learn CUDA™ with one of them – dct8x8. That project uses “Runtime API” and consists of a mixture of C/C++ code for the CPU and for the GPU. This mixed mode is not directly compatible with JCUDA^[4], but if the “kernel” (code executed by the GPU) is kept separate from the CPU one, it is possible to develop/debug/test the program in Nsight IDE and then use the same file containing just the GPU kernel(s) with JCUDA. What needs to be changed is the portion of the non-GPU C/C++ code that transfers data between the computer system (CPU) memory and the GPU dedicated memory physically located on the graphic card.

Guided by the Nvidia® dct8x8 sample project I first implemented similar DCT-IV^[5] and DST-IV needed for the Complex Lapped Transform (CLT) used by the TP for conversion to the frequncy domain and back (sample project contained code only for the DCT-II (direct) and DCT-II (inverse) used in JPEG and related applications. After several iterations I made those programs to run almost as fast as the highly optimized code in the Nvidia® sample, and the next step was to implement complete CLT and aberration correction for the Bayer mosaic images, following the approach we used for the RTL^[6]. Debugging was simplified by the fact that the same algorithm was already tested in Java, and the intermediate data arrays could be compared between the Java and CUDA™ outputs.

Kernel for the Complex Lapped Transform of the Bayer Mosaic Images

The first implemented kernel is taking the following inputs:

Four of the 5 MPix Bayer mosaic images.
Four of the per-camera arrays of the space variant (stride 16) CLT deconvolution kernels already converted to the frequency domain (this data is reused for multiple image sets).
Additional kernel fractional pixel x,y offsets and their derivatives to interpolate required image shifts between the grid nodes of the space-variant kernels
List of the tiles to process that contains tiles location on the tile grid and the fractional pixel X,Y shift calculated externally from the required disparity for the known radial lens distortions. This list is used because not all the tiles need to be processed on each pass, when building the depth map only some tiles need to be processed, and some tiles require multiple iterations with different disparity values.

Output from the first kernel is a set of per-camera (4), per tile (324×242 for the 2592×1936 images), per color component (3) of 4×8×8 arrays representing CLT frequency domain transformation of each of the 16×16 (stride 8) pixels tiles of the source images. The transformation sequence is described in the earlier post^[6], it assumes the following stages:

Finding the closest de-convolution kernel for the specified tile index and offsets.
Calculation of the full horizontal and vertical offset that combines requested image tile center and the interpolated kernel data.
Splitting X,Y offsets into rounded integer and fractional pixel shifts. Integer part is used to select position of the 16×16 window in the source image, fractional one is applied later in the form of the phase shift in the frequency domain. It is also applied to the window function.
Folding 16×16 image tiles into the 8×8 pixel ones for the 2D DCT-IV/DST-IV conversions, using the shifted 2D half-sine window function.
Calculating the CLT layers: DCT-IV/DCT-IV, DST-IV/DCT-IV, DCT-IV/DST-IV and DST-IV/DST-IV (horizontal pass/vertical pass). For the Bayer mosaic input, of the 12 (3 colors by 4 CLT layers) transforms only 4 are actually needed, other 8 are restored using the symmetry of transforms of sparse (1 in 4 non-zero for red and blue, 1 in 2 for green color components) inputs.
Element-wise multiplication of the converted image tile by the kernel tile, equivalent to the pixel-domain convolution.
Applying the residual fractional pixel shift (in the range of +/-0.5 pix in each direction) implemented as a phase rotation. Such shift does not involve re-sampling as the space domain shift would, and so it does not introduce any related quantization noise.
Optional 2D low pass filter that can be combined with the convolution kernels.

Kernels Performance Evaluation

When I went through all the processing pipeline, and made sure the results match the Java code output. I measured the execution time and was disappointed – ~5 seconds per set of 4 images wasn’t what I expected. Using profiler I soon realized that my understanding of CUDA™ was wrong – all the participating threads have to execute exactly the same code, it is not like in multi-threaded CPU code or multiple simultaneously operating modules in RTL. Re-writing the code to eliminate divergence reduced execution time to 0.9 s. Not too exciting, but still significant gain over the CPU alone. Another kernel for inverse transform to convert from the frequency domain back to the images added 0.6 seconds. The inverse MCLT produces overlapping 16×16 (stride 8) tiles that need to be added together to result in a full picture, implementation uses 4 (stride 2 in each direction) passes over the image, with each pass free of any overlaps between tiles allowing asynchronous parallel execution.

After spending about two weeks troubleshooting the code kernel code as a part of the C++ application I was expecting to encounter more difficult to pinpoint problems when adding these GPU kernels to the Java application. There are multiple data arrays to be fed to the GPU with no convenience of having the debugger at hand.

In reality it was much easier – JCUDA^[4] does a wonderful job and it took me just a few hours to convert the code to run as a GPU accelerator for the Java program. I already had the Java code to convert images and convolution kernels to the data arrays that were passed to the C++ test application via binary files, only what remained to be done was to flatten multi-dimensional arrays and to replace an array of struct – it had a mixture of integer (tile indices) and float (offsets) members. With that done, there were a couple bugs left, but they were clearly reported in the Java stack trace output.

I do not know if it is possible to use #include directives in the code compiled from JCUDA, but as the source code is anyway first read from the file to a Java String before it is sent to the compiler, I just concatenated all the needed (currently just 2) files as strings. I also added “#define JCUDA” to the concatenated string before the files content, as well as some other numeric defines (like image dimensions) shared between the Java and the GPU code. I enclosed all the includes and the duplicate parameter defines within #ifndef JCUDA in the GPU kernel source files, that made the same source code files to work in both Nsight IDE and in Java application with nvrtcCreateProgram(). Soon I got the complete output images and verified they are the same as generated by the Java code.

And in the end I’ve got an unexpected and wonderful “reward”. When stepping over the critical GPU-related Java code lines I was expecting 1.5 second delay for the execution of the two kernels, but could not notice any. I enclosed each GPU kernel call with extra cuCtxSynchronize() thinking it did not wait for completion – still no visible delay. Then I put a loop to run kernels 100 times and got ~6 seconds for the first kernel. Something was obviously wrong – but the output images were correct. I went back to the IDE and made sure I have “Release” (not “Debug”) in seemingly every relevant place, but still it was running slow. Then I launched built binaries from the command line and discovered, that while “Debug” version is as slow as when run from the IDE, the “Release” version is 20.1 times faster, same performance as with JCUDA.

Kernel description	Debug mode execution time	Release mode execution time
Convert 4 images to FD and deconvolve with kernels	1073.4 ms	57.0 ms
Convert 4 images from FD to RGB slices	665.3 ms	29.5 ms
Total	1738.7 ms	86.5 ms

These run times were measured for the “GeForce GTX 1050 Ti” with compute capability 6.1, 4GB memory.

Results

The GPU code implemented and tested so far does not include the 2D phase correlation kernel need for the depth map generation, I started with just the rectified images as the pictures usually show if something is wrong in the processing. Phase correlation kernel will be easy to implement (the first kernel will stay the same) and the execution time will be approximately the same. Then we will work on coupling this code with the Tensorflow inferred network and feed the 2D correlation data directly from the GPU device memory. That will result in the near real-time performance of the whole system.

Links to the source code are provided below^[7],[8],[9] (the code is also mirrored at Github). It needs to be cleaned up and may eventually be released as a library after more functionality will be added. We believe this code will be useful for variety of imaging applications both coupled with the ML systems or traditional. The phase correlation can be used for multiple view camera systems (as in our case) or for the optical flow processing when matching images acquired by the same camera in subsequent frames.

Links

[1] “Neural network doubled effective baseline of the stereo camera”↗

[2] “How to create an ImageJ Plugin using JCuda”↗

[3] Nvidia®Nsight™ Eclipse Edition↗

[4] Java bindings for Nvidia®CUDA™↗

[5] Wikipedia article: “Discrete Cosine Transform”↗

[6] “Efficient Complex Lapped Transform Implementation for the Space-Variant Frequency Domain Calculations of the Bayer Mosaic Color Images”↗

[7] Source code of the 8×8 DCT-II,DST-II,DCT-IV and DST-IV for GPU↗

[8] Source code of the Tile Processor GPU implementation↗

[9] Source code of the Java class to integrate GPU acceleration with JCUDA ↗

Neural network doubled effective baseline of the stereo camera

Andrey Filippov — Wed, 05 Sep 2018 20:47:33 +0000

Figure 1. Network diagram. One of the tested configurations is shown.

Neural network connected to the output of the Tile Processor (TP) reduced the disparity error twice from the previously used heuristic algorithms. The TP corrects optical aberrations of the high resolution stereo images, rectifies images, and provides 2D correlation outputs that are space-invariant and so can be efficiently processed with the neural network.

What is unique in this project compared to other ML applications for image-based 3D reconstruction is that we deal with extremely long ranges (and still wide field of view), the disparity error reduction means 0.075 pix standard deviation down from 0.15 pix for 5 MPix images.
See also: arXiv:1811.08032

Tile Processor and the 2D correlation output

TP receives raw Bayer mosaic data from four (or more) camera sensors and combines several chained operations in the frequency domain, eliminating re-sampling errors. TP operates with the fixed windows of 16×16 Bayer mosaic pixels (stride 8 in each direction). While the camera does not have any moving parts, TP operates similarly to the human vision – it calculates correlations for the requested “target disparity” that is analogous to the eye convergence. The difference is that each image tile may have independently set target disparity.

At this stage of the project we evaluated generation of the depth map using 2D correlation outputs for the tiles and the tile target disparity. Of the six possible correlation pairs of the four camera sensors (top, bottom, left, right and two diagonals) we used four, combining (by averaging) horizontal and vertical pairs together. Of the full 2D correlation results we preserved the center 9×9 pixels, so each tile provides 4×9×9 tensor as shown in Figure 1. Five megapixel sensors output 2592(H)×1936(V) pixels making it 324(H)×242(V) tiles. Target disparity for each tile is calculated using existing heuristic-based program, and the network is trained to improve that value using the 2D correlation data.

Neural network and the data sets

Suggested network architecture consists of the two stages – first stage processes each tile output without any interaction with the neighbors, the second will be convolutional enhancing disparity prediction for each tile by using information from the neighbors. We now use a Siamese type network with samples combining correlation outputs (and corresponding target disparities) from 5×5 tile regions, with network trained to predict disparity of the center tile only. While it is less efficient when processing continuous depth maps, it gives us higher training flexibility by allowing to adjust frequency of the “smooth” samples and the samples with discontinuities of different types.

We had 266 processed scenes to use for training and testing. Each scene provides up 324×242=78408 tiles, of them about half are usable (not the featureless sky and not the near objects that we did not used in this project). When capturing images camera was running at 5 frames per second, so we put aside first image from each second for testing (20%) and used the rest for training. Trying to equalize batches we calculated 2D histogram (disparity-confidence) over all images, divided disparity-confidence area into equal percentile regions and then created batches from the data files selecting random tiles – one from each of those regions.

For the ground truth we used disparity/confidence pairs captured by a dual camera rig that has a baseline almost 5 times longer than that of a single camera. Initial cost function was just a weighted (by ground truth confidence) squared disparity difference, later we clipped it by a certain value to reduce influence of a rather common case when the dual camera rig measurement (ground truth) and that of a single camera (used as target disparity) matched different (foreground or background) objects for the same tile resulting in multi-pixel disparity differences.

The training results were evaluated in 2 ways. First we compared costs (total one minimized and partials) for the training data, test data independently generated but from the same source files already used for training, and from the fresh images that were laid aside for testing. Difference between the tests was used as an indication of overfitting. In addition we compared the heuristic output for the whole test images (same data was also used as a target disparity input for the network) and the network output, calculating accuracy gain for different criteria: tiles with certain disparity range and correlation strength (confidence). And it is this gain for far tiles (less than 5 pix disparity or farther than 100 meters) that let us claim that the network doubled disparity resolution that is equivalent of using twice longer baseline of the stereo camera.

Noticing that overfitting starts to develop (by comparing costs corresponding to the test data) we performed the following actions:

Applied right/left mirroring of the source data. There may be insufficient up/down and transposition symmetry so we did not rely on replicating source data that way
First layer of the stage 1 subnets receives image-like 2D correlation data that has certain properties, so for regularization we added costs for the Laplacians of the first layer relative weights with the zero boundary conditions
Processing in the stage 2 should provide reasonable results even when only the data from the same (center) tile is available. Adding 8 (for 3×3 kernel) neighbors and then 16 (to the total of 5×5) should improve result, so we added 2 more instances (sharing all weights) of the stage 2 – one with all zeros but the center (of 25 stage 1 subnets) input, and the other with the center 3×3 non-zero inputs. And then calculated cost for the error of each of these outputs and mixed them in the final cost used for optimization.

These methods combined allowed to postpone the beginning of overfitting and improved convergence – the results did not go worse (at least significantly) when the training ran for too long.

Figure 2. Results for far flat objects (> 1000 m). X3D ↗, images↗

Results

Figures 2 and 3 illustrate results of the depth map processing with the neural network. Each has 6 sub plots following the common pattern:

Top right “Ground truth confidence” contains the full scene correlation strength and the red frame that identifies what part of the scene is analyzed in the other subplots
Top left “Ground truth disparity map” shows tile disparity values as measured by a wide baseline dual camera rig. The disparity units as shown on the vertical bar to the right are pixels of the main camera (not the long baseline rig), same units are used on all the remaining subplots
Middle left “Heuristic disparity map” shows result of the existing processing of the 2D tile correlation output using heuristic algorithms. These values are used as “target disparity” to calculate 2D correlation for each tile and the correlation results (as 4×9×9 tensors) are fed to the network.
Bottom left “Network disparity output” shows the network output for the tiles that do have ground truth data, tiles output that can not be verified is blanked (shown in white)
Middle right shows mismatch between the heuristic disparity map (middle left) and the ground truth disparity (top left)
Bottom right shows similar errors of the network disparity prediction

Image captions provide the links to the x3d models of the same scene (models are built with the existing software that uses what is considered here as “ground truth” and do not rely on the network processing). Another link (Images↗) in the captions leads to the camera image viewer for the scene. That viewer shows all 8 camera images, the first 4 correspond to the processed data, the last 4 are captured by a second camera used for the ground truth measurements. These images are not raw (raw Bayer mosaic images are also provided as explained in the wiki page), they are calculated for the target disparity set to 0 for all tiles. Images are the result of the space-variant deconvolution for aberration correction, they are being subject to the linear operations only, so at certain zoom levels they have visible modulation caused by the Bayer mosaic, and color de-mosaic artifacts in the periodic grid areas as there are no (nonlinear) demosaic operations in the processing pipeline.

Figure 3. Buildings in the range of 680-2200m. X3D ↗, images↗

Evaluating disparity maps of the flat horizontal surface

The scene fragment in Figure 2 shows rather flat surface from 1000 meters (~0.5 pix disparity) to infinity (mountain ridge is over 20,000 meters – beyond the resolution of the camera). The disparity noise of the middle row (heuristic output) is greatly reduced by the network (bottom row) and it is not just the low-pass filtering – the transition to the near objects (yellow) is not blurred. The network output reveals some low frequency disparity fluctuations that are not present in the ground truth data (top subplot). These fluctuations are caused by the correlations between the Bayer patterns of the individual sensors, each pattern being distorted by the optical aberration correction. These errors may be reduced by feeding all 6 individual correlation pairs to the network – currently two horizontal (top and bottom) and two vertical (left and right) pairs are reduced by averaging. Combining them non-linearly in the network may increase S/N ratio. Additionally this noise may be reduced by adding more sensors without increasing the overall camera dimensions.

Disparity maps in the urban environment

Figure 3 contains the scene fragment captured while driving northbound along the State Street in Salt Lake City. The buildings visible there (pseudo-colors other than white and yellow) are from 680 meters to 2200 meters ahead (2200 m is the Utah State Capitol around tile at [170,113] – it is visible on the ground truth (top left) and network output (bottom left) subplots.

The challenge here was to prevent the network from blurring the depth map between objects at different distances. While smooth disparity gradients are possible (like the street pavement) most detectable disparity variations at long distances are due to the discontinuities caused by the overlapping objects. When disparity difference is small (1 pixel or less) the correlation argmax for the tile containing both foreground and background pixels would be (incorrectly) somewhere between the foreground and the background values. For the purpose of the next stage of the scene 3d reconstruction – fitting planes – it would be better if such edge tile would be assigned to either foreground or background. And this “cutting corners” on the depth map did happen with the initial cost function based on L2 (cost proportional to squared difference to ground truth disparity weighted with the ground truth confidence) alone.

Cost function for preserving edges in the depth map

Tweaking the cost function significantly improved performance – the result is visible by comparing middle right and bottom right subplots along the vertical building edge at horizontal tile 163 and vertical between 100 and 111 of the Figure 3. The difference between the heuristic disparity and the ground truth shows a visible tile column that is more negative than the tiles around it (the disparity of the foreground object was reduced by correlation). That column completely disappears in the bottom right subplot as the edge sharpness is restored by the trained network.

The cost function was modified in the following way:

First the disparity difference was leaky-clipped to 0.3 pixels. Larger errors are usually caused by matching different objects – ground truth may be measured for the distant background while the target disparity matched closer foreground (or vice versa).
The second modification specifically added extra cost for “cutting corners” (blurring edges). The average value of the 8 neighbors’ ground truth disparity is calculated for each tile, and outputs falling between the ground truth disparity and the average disparity generate additional cost.

Next steps

Increase of the depth resolution twice was just a low hanging fruit – this project is our first hands-on experience with the neural networks. When visitors of our booth at CVPR-2018 were amazed by just a few percents range error at 2000 meters in the interactive X3D model, we had to explain that the model uses data captured by a pair of such cameras and we plan to use it as a ground truth for the neural network training. And that we hope that eventually a single 258 mm quad camera using the neural network will provide the data as accurate as the existing dual camera rig with 1256 mm baseline. We are not there yet, just half-way, but believe that our original estimate was correct and that goal is reachable.

Add more data

The next immediate step will be adding more training data and possibly reimplementing the TP code to use GPU (this code exist now in Java for CPU and in Verilog for FPGA/VLSI). CPU TP implementation is a bottleneck in the current pipeline, so far we processed only few percents of the captured imagery. Larger dataset will allow to try deeper/wider networks

We will evaluate results of feeding all 6 pairs to the network and experiment with cost functions to reduce noise caused by the correlation of the Bayer patterns of the individual sensors and maintain space-invariance of the TP output.

Generate output for field calibration

Currently the network outputs only the single scalar per each tile – disparity value. It is important to train the network to generate the confidence value of the disparity it outputs. Additionally the heuristic program we use now can provide misalignment data for each tile (“lazy eye”) – such data is used for field calibration of the camera system by bundle adjustment of the individual sensors attitude. Such output can also be received from the network.

Then we will work on switching other parts of the 3d scene reconstruction to use the neural networks and there are at least two areas that rely not just on straightforward math but use a lot of heuristics.

Target disparity calculation with NN

One such area is selection of the target disparity for each tile. With the small TP window the correlation output provides valid result only within just a few disparity pixels of the center – preprogrammed shift. Full disparity sweep would be expensive so current software uses a combination of methods – scans all tiles at infinity, then uses that data to create low resolution images and correlates them to identify potentially occupied 3D volumes, then grows measured tiles by predicting disparity for the new tiles and measuring correlation. This prediction depends heavily on heuristics and seems to be a good application area to use the network.

Building 3D surfaces

Another area is building 3D surfaces from tile disparities, it is highly heuristic-infested too. Current software uses “supertiles” (overlapping areas of 16×16 with stride 8 tiles) to build multiple “plates” using eigenvalues/eigenvectors of the covariance matrices build of the tile data. Then such plates in the neighboring supertiles are matched to each other and merged if they are likely to belong to the same 3D surface. These plates simultaneously use both the Disparity Space Image (DSI) and the world 3D coordinates. The DSI coordinates are native to the camera and the measurement accuracy can be expressed in the pixels of DSI. The real world coordinates, on the other hand characterize likelihood of such 3d object to exist. The plate merging considers both DSI distance (to be withing measurement accuracy) and world 3D (comparing angles and linear distances between planes to merge). After merging the supertile plates into 3D surfaces, each tile is re-evaluated and assigned to one of them.

Pixel-accurate texture edges

The last step of building realistic 3D scene models was not implemented in the current software – just the provisions for it in the tile processor. It is the restoration of the pixel-accurate edges in the output textures from the tile-accurate depth map and the texture tiles output from the TP.

Hardware improvements

In parallel we plan to improve the hardware and the image capturing process. We will make a light enclosure and a more rigid frame to eliminate the need for additional field calibration, caused by minor variations (mostly thermal) of the camera sensors attitudes. We will try to use fusion of multiple scene models to calculate 3D ground truth data instead of the dual camera rig that we use now. This rig relies on the really far objects (tens of thousands meters) to be able to calibrate itself. We are lucky to have mountain ridges visible around in Salt Lake City, but even they are often too close to be considered infinity.

A fun project

And a fun project – we will try to capture a flock of birds in the air and see how well we can measure and track the 3D coordinates of the multitude of the small flying objects – both without and with the neural network.

CVPR 2018 – from Elphel’s perspective

Olga Filippova — Sat, 21 Jul 2018 19:32:20 +0000

In this blog article we will recall the most interesting results of Elphel participation at CVPR 2018 Expo, the conversations we had with visitor’s at the booth, FAQs as well as unusual questions, and what we learned from it. In addition we will explain our current state of development as well as our near and far goals, and how the exhibition helps to achieve them. The Expo lasted from June 19-21, and each day had it’s own focus and results, so this article is organized chronologically.

Day One: The best show ever!

June 19, CVPR 2018, booth 132

While we are standing nervously at our booth, thinking: “Is there going to be any interest? Will people come, will they ask questions?”, the first poster session starts and a wave of visitors floods the exhibition floor. Our first guest at the booth spends 30 minutes, knowledgeably inquiring about Elphel’s long-range 3D technology and leaves his business card, saying that he is very impressed. This was a good start of a very busy day full of technical discussions. CVPR is the first exhibition we have participated in where we did not have any problems explaining our projects.

The most common questions that were asked:

Q: Why does your camera have 4 lenses? A: With 4 lenses we have 4 stereo-pairs: horizontal stereo-pair is responsible for vertical features (it is the same as in regular stereo-camera); the vertical pair detects horizontal features, while the 2 diagonal pairs ensure that almost any edge is detected. (Elphel Presentation, p.8). Q: Why don’t you use lidars? A: Lidars are not capable of long range distance measurements. The practical maximum range of a lidar is 150-200 meters, while we target 500-2000 meters with a 150 mm stereo-base.(Elphel Presentation, p.7) Q: How do you deal with thermal expansion? A: there are 2 parts to this question: 1) thermal expansion of the calibrated lens is taken care of by the design of the sensor-lens module; 2) Thermal expansion of the whole camera is also an issue, especially with the larger stereo-base. The titanium tubes that form the cross of the X-Camera expand more on the “sunny side”, and the whole camera bends inward, like a flower. When taking images under the hot Utah sun, we had to wrap the tubes with insulation, covered in aluminum foil. After some consideration we decided to remove the insulation wrapping for the CVPR show, because it makes the camera more presentable, and were surprised that this question came up so many times – we almost put the insulation cover back on. Q: Do you build custom lenses? A: We pre-select standard M12 lenses, because the quality of the lens can vary even within the same model. We have developed a full calibration process and post-processing software to compensate for optical aberrations, allowing to preserve the full sensor resolution over the camera FOV. Read more about aberration correction on Elphel blog. Q: What is your part of the technology? A: Everything: we have all parts: Hardware – cameras with aberration corrected lenses and thermally compensated sensor-lens module; a state of the art calibration facility and process; 3D-reconstruction software (Free Software, GNU GPL). Q: What is the accuracy of your photogrammetry: A: We comfortably work with 1/10 of a pixel. Q: How dense is your 3D scene? A: Currently: 324 x 242 overlapping tiles, with no dead zones. Q: Does it build a 3D model in real-time? A: Currently the 3D model is reconstructed in software in the post-processing of the acquired images. This software is tested for FPGA, with which we can achieve 12fps 3D-reconstruction on the camera. Our final goal is to design and manufacture custom ASIC with 3D processing on the chip. (Elphel Presentation, p.17) The last visitor of the day from a large European company stopped at our booth around 6pm, 30 minutes before the end for the day, and we spend another hour talking and showing Elphel’s camera and high-resolution accurate images it produces. The lights on the floor went down, but we stayed and explain the technology behind our results and future plans. It was a nice day, well spent.

Day 2: Neural network focus.

End-to-End vs. Augmented

Our current results:

500 meters, high resolution, 3D reconstruction with 5% accuracy with 150mm stereo-base (quad-stereo camera), passive (image-based)
2000 meters with 5% accuracy with 2 quad-stereo cameras based at 1256 mm from each other
3D-Reconstruction is done in post-processing with software simulated for FPGA porting. With FPGA – 12fps in 3D

Our near future goals:

Train the neural network with the current 3D-image sets to achieve same results as in (2) – 2000 meter 3D reconstruction, with just one quad-sensor camera
Port software to FPGA to achieve real-time 3D reconstruction (12 fps)

Our long-term goals:

Manufacture custom ASIC for 3D-reconstruction to reconstruct 3D scenes on the fly with video frame rate. The small size of the custom chip will allow us to mass produce a long-range 3D reconstruction camera for industrial and commercial purposes. Smaller form-factor, inexpensive, and fast.
Scalable design: a smaller stereo-base with 4 sensors – phone application (200 meter accurate 3D reconstruction for consumer applications); larger stereo-base: extremely long range passive 3D reconstruction – military applications, drone applications, etc.

CNNs are not efficient for the real time high-resolution 3-D image processing.

Elphel’s approach to 3D reconstruction with the help of neural network is different from the mainstream one. This is partially due to our expertise is building calibrated hardware, which already produces robust 3D measurements. Therefore we see the value in pre-processing the images before feeding them to the NN similar to the eye pre-procesing visual information before sending it to the brain. The human eye receives 150 MPix of data, processes it on the retina, and sends only 1Mpix to the brain – 150 times reduction. One question came up twice, both times from the conference attendees, about the difficulty the NN has to process high-resolution images. The current approach to NN training recommends to use low resolution raw images (640×480), because the NN has to be trained on a large amount of images (thousands). The last 6 slides of Elphel’s presentation talk about augmenting the NN with Elphel high-performance Tile Processor, that significantly reduces the amount of data, while leaving the “decision making” to the network. We argue that the End-to-End approach is less efficient for high-resolution image processing then NN augmented with training-tree, linear image processing.

Elphel training sets for Neural Network:

Elphel offers unique image sets for neural network training. These are high-resolution, space-invariant data sets. Elphel is seeking collaboration with machine learning research teams working in the area of ML applications for 3D object classification, localization and tracking.

Day 3: Autonomous Vehicle Focus:

The exhibition is slowing down. Some booths start packing at midday. Elphel, however is busier than ever. We have talked to about every self-driving car company that had it’s presence at CVPR. We also talked to other large corporations doing research in perception and 3D-reconstruction. We start by visiting our neighbors – other exhibitors at the Expo, we are very curious what they develop, for which applications, can it be combined with our technology, can we learn something from them, and can they learn from us? It’s a collaborative spirit, with which we start conversation. One of our closest neighbors is Mapbox – an open source mapping platform for custom designed maps, so we start with them. We use Open street Map as well as other maps with our Scene Viewer for 3D-reconstructed scenes for locating scenes on the map as well as for the ground truth data. During lunch break at we add Mapbox tiles to our scene viewer. Mapbox exhibitors are pleased that it worked so well for us. Now they are also very interested in our technology! Then we walk to other booths, learn about them and present Elphel. Some of the responses we’ve got when we introduced our technology and our current results:

Researcher from Apple: “You have the most unique technology on this Expo!”
Autonomous Vehicle company A: “I can not believe you can accurately measure distances at 500 meters with just a small base of 150mm? With just 5-10% error? It is simply impossible! I have to come see it!” After visiting our booth seeing the demo he asks: “Can you demonstrate your technology on the car? It has to be in Singapore!”
Autonomous Vehicle company B: ” We are happy with the use of the Lidars on the city streets, with close range perception and 30 mph driving speed. But on the highway, where the speed if much higher and the distance much farther the Lidar’s output is too sparse. Passive, long-range 3D reconstruction would be a perfect solution.”
Autonomous Vehicle company C: “How dense is your 3D scene? Oh, it is dense, then we can use it.”
Autonomous Vehicle company D: “We are building the perception solution for self-driving trucks. We need long-range distance measurement.”
Autonomous Vehicle company E: “We have 16 different types of sensors on the car right now. I don’t think we have room to put another camera. Besides the area behind the wind shield is very valuable, we already have 4 cameras there – there is no room. By the way, we re-evaluate our approach every 6 months, and we just did that, so let’s, maybe, talk in 6 months again?” Interesting thought – why would anyone need unobstructed windshield in a self-driving car? They did not explain that.

View of the CVPR 2018 Expo “From Elphel Perspective”:

creating 3-D model of the CVPR 2018 Expo

3D model view of CVPR 2018, made by Elphel 3D X-Camera

One last thing to do before the show is over is to get the 3D-X-Cam up high and build the 3D model of the CVPR 2018 EXPO in “real-time”!

Two Dimensional Phase Correlation as Neural Network Input for 3D Imaging

Andrey Filippov — Sat, 21 Jul 2018 01:10:18 +0000

We uploaded an image set with 2D correlation data together with the import Python code for experiments with the neural networks and are now looking for collaboration with those who would love to apply their DL experience to the new kind of input data. More data will follow and we welcome feedback to make this data set more useful.

The application area we are interested in is an extremely long distance 3D scene reconstruction and ranging with the distance to baseline ratio of 1000:1 to 10,000:1 while preserving wide field of view. Earlier post describes aircraft distance and velocity measurements with up to 3,000:1 distance-to-baseline ratio and 60°(H)×45°(V) field of view.

Figure 1. Space-invariant 2D phase correlation data

Data set: source images, 2D correlation tiles, and X3D scene models

The data set contains 2D phase correlation output calculated from the 2592×1936 Bayer mosaic source images captured by the quad stereo camera, and Disparity Space Image (DSI) calculated from a pair of such cameras. Longer baseline provides higher range resolution and this DSI is serving as the ground truth for a single quad camera. The source images as well as all the used software is also provided under the GNU/GPLv3 license. DSI is organized as a 324×242 array – each sample is calculated from the corresponding 16×16 tile. Tiles are overlapping (as shown in Figure 3) with stride 8.

These data sets are provided together with the fully reconstructed X3D scene models, viewable in the browser (Wavefront OBJ files are also generated). The scene models are different from the raw DSI as the next software stages generate meshes, and that frequently leads to over-simplification of the original DSI (so most fronto parallel objects in the scene provide better range accuracy when probed near the bottom). Each scene is accompanied with the multi-layer TIFF file (*-DSI_COMBO.tiff) that allows to see the difference between the measured DSI and the one used in the rendered X3D model. File format and Python import software is documented in Oleg’s post “Reading quad stereo TIFF image stacks in Python and formatting data for TensorFlow”, the data files are directly viewable with ImageJ.

The first goal of applying neural network to this data is to improve DSI generation from the raw 2D correlation compared to what we can do now with the traditional code by using data from all correlation pairs and the neighbor tiles to propagate information along the objects edges and amplify correlation in low textured areas by pooling the correlation outputs from multiple tiles.

While there are no moving parts in the system, the overall process of measuring the DSI resembles human/animal stereo vision, where the eyes are converged for the expected object distance, and the visual processing deals with only a small residual mismatch, constantly adjusting the convergence. The difference here is that each tile can be individually “converged” for the requested disparity – this is done by combining integer pixel shift of the tile input window with the accurate fractional pixel shift by a phase rotator in the frequency domain to avoid any re-sampling errors. Currently prediction of the “convergence” (expected disparity) for the next tiles to measure is performed by a mixture of several algorithms (to avoid an expensive full disparity scan over the whole image). This process of predicting disparity is likely to be improved by a properly designed and trained network too, the one that will use the tile context.

As the current processing is recurrent (repeated “eye convergence” / mismatch processing), there are several options how to organize the training data. We decided to focus on sub-pixel matching, so for each scene with calculated ground truth DSI (from the dual camera rig) the correlation output is measured for the target disparity (convergence) in the range of minus 1 pixel to plus 1 pixel with the step of 0.1 pixel in the currently provided data sets. It makes total of 21 files of the correlation outputs. This partial disparity sweep around the ground truth is an alternative to a full disparity sweep that would require too much storage.

In addition to the raw Bayer images scene models have links to the JPEG images, corresponding to each sub-camera view. These images have aberrations corrected and they are partially rectified for the disparity = 0 (infinity). They are not rectilinear, and still have small radial distortion common to all of them, it is compensated during generation of the x3d models. The images have visible modulation caused by the Bayer mosaic – they were not subject to any (destructive) demosaic processing – only to the linear one. Each color channel is separately de-convolved with a set of space-variant kernels.

Figure 2. Dual quad camera rig

Available data series

Initial data contains scenes captured with the car mounted dual quad camera rig (Figure 2) in two series:

Salt Lake City State Street ↗
Salt Lake City overlook (northern part of the city in the direction of the airport) ↗
Airliner approaching SLC International airport ↗

The first series (80 scenes) was captured while driving north at a rate of 5 frames per second. Only the first four seconds (20 scenes) of each 20 second interval are posted to reduce the amount of uploaded data. The second series (59 scenes) was captured while standing and then slowly turning and starting to move. Third series was made from the stationary camera pointed in the direction of the airport.

More data will be added, we are open to suggestions – what data should be included to make the whole data set more useful. All available series combined are available at this link: ↗. Number of the scenes is definitely too small to extract the semantic information, but it may be enough to train the network to build DSI using small samples of the correlation tiles. At least to start with.

The frequently asked questions about this dataset

Why is there a special format of the data set, why not just provide the raw images?

We do provide raw Bayer images captured by the camera in JP4 format, but due to significant for the high-resolution images optical aberrations, complete end-to-end processing would require space-variant coefficients that to our understanding would require enormous resources and a very large number of images for training. Instead we propose to separate the space-variant pre-processing and provide space-invariant data to the network. And the 2D correlation data is the earliest space-invariant data available with this approach.

Figure 3. Anisotropy of the 2D correlation. Red and blue squares show two overlapping tiles in the source images and in the calculated 2D correlation.

Why do you use four cameras for stereo when most systems have just two?

In the case when the source image contains nicely textured in all directions areas, the correlation output contains a sharp round spot and subpixel argmax calculation is not very difficult – polynomial interpolation provides good results. Unfortunately many real-life objects do not have convenient texture, and poor texture is Achilles’ heel of many vision-based 3D reconstruction systems. On the other hand discontinuity between foreground and background objects usually results in sharp edges that can be used for matching, and it is usually safe to assume that lack of visible features means that the surface is smooth and depth can be reconstructed by averaging over larger area. The edges may have any direction, but binocular systems with horizontal baselines are only sensitive to the vertical features. Figure 3. shows correlation output of a distant airplane where the correlation spots are stretched in the direction of the edges. The four-sensor camera in Figure 1 contains 6 pairs, and after combining horizontal and vertical there are 4 directions left, so for any edge direction there is a pair with a almost perpendicular to it.

Why don’t you use LIDARs for the ground truth data?

Most available LIDARs have a rather short range of 150-250 meters, and we are interested in measuring larger scenes. So instead we decided to use a pair of similar quad cameras positioned at ~5 times larger distance than the quad camera baseline, so the measurements it provides are also 5 times more accurate and can be used for training as ground truth. Binocular system does have limitations as described above, so it will not provide accurate data for the horizontal features (like electrical wires so common over Salt Lake City).

Figure 4. Field calibration results

What are the current limitations of the hardware?

Individual quad cameras use precisely thermally compensated sensor front ends with the image plane movement relative to the sensor of less than ±0.02μ/°C, but the overall mount made of titanium tubes does bend when heated by the sun radiation from one side. Just few tenths of a pixel, but it is too much for the application, so until we have a new design we have to use field calibration and bundle-adjust individual sub-camera attitudes using the image data itself. It allows to keep misalignment to ~0.05 pix.

Dual camera rig (Figure 2) as currently implemented has its limitations that we temporarily mitigate with the field calibration. We are working on the rig improvements too – one of the mechanical issues is that when driving at freeway speeds we noticed 55 Hz torsion oscillations (two cameras rotating in opposite directions at the ends of the 1.5 m aluminum tube) of 0.4 pix amplitude. So until we will strengthen the rig we had to limit data set to the images captured at lower speeds to keep vibration-caused mismatch under 0.1 pix.

Figure 4. shows results of the field calibration for the image series captured during driving along the State Street. Each quad camera (main with the 258 mm baseline and auxiliary with the 150 mm one) had adjusted its sensor front end attitude angles using the image data, same with the attitude of the auxiliary camera as a whole with respect to the main one (baseline of 1256 mm). The results of the distance measurements obtained from this long baseline pair were then used as ground truth for the individual quad cameras. Relatively strong correction required for the first ~20 samples was caused by the changing temperature in the camera connection tubes after slowing down from the freeway speed to the city one.

The residual convergence standard deviation is 4.2 arcseconds for the 258 mm baseline quad camera and 2.5 arcseconds for the 150 one – it is close to the astronomical seeing that can be 1-4 arcseconds vertically and 5-20 arcseconds with the view close to horizon (for the full thickness of the atmosphere). That means that getting higher depth resolution would require compensating atmospheric turbulence. Luckily we do not need long exposures so active compensation can be similar to the “lucky shot” method by comparing measurements of the same features in the consecutive shots.

Electronic rolling shutter (ERS) limitations. We really love small format high-resolution image sensors perfected by the cellphone industry for the image quality and sensitivity, but that performance currently requires Electronic Rolling Shutter (ERS) that leads to image distortions of the moving objects (or caused by the moving/rotating camera itself). In the presence of movement of the objects (or camera egomotion) the measured disparity depends on the time difference between the moments the same feature is captured by the sub-cameras. The sensors are mechanically aligned to be parallel within 5 pixels that corresponds to 0.2ms of the readout time. That means that horizontal pairs capture all objects almost simultaneously regardless of disparity, for the vertical and diagonal pairs that time difference increases proportionally to the disparity, but for the far objects (where disparity accuracy is most important) that difference is still small (0.03ms for each disparity pixel). Currently posted image sets do not have any ERS correction, and that somewhat deteriorates performance for the near objects, but when considering just the egomotion (both translation and rotation) this effect can be easily compensated, at least for the uniform rotation on movement.

In the posted image set ERS is not an issue for small (<5 pix) disparities, we will add software compensation over the full disparity range of the scenes for uniform movements, and later will use IMU to compensate higher frequency movements also.

Reading quad stereo TIFF image stacks in Python and formatting data for TensorFlow

Oleg Dzhimiev — Fri, 20 Jul 2018 20:10:47 +0000

Fig.1 TIFF image stack

The input is a .tiff – a TIFF image stack generated by ImageJ Java plugin (using bioformats) with Elphel-specific information in ImageJ written TIFF tags. Reading and formatting image data for the Tensorflow can be split into the following subtasks:

convert a TIFF image stack into a NumPy array
extract information from the TIFF header tags
reshape/perform a few array manipulations based on the information from the tags.

To do this we have created a few Python scripts (see python3-imagej-tiff: imagej_tiff.py) that use Pillow, Numpy, Matplotlib, etc.. Usage: ~$ python3 imagej_tiff.py .tiff It will print header info in the terminal and display the layers (and decoded values) using Matplotlib.

Details

Subtasks 1. and 2. are simply done by Pillow. As for 3., if it was a simple TIFF image stack there would be no need to have extra scripts. If the layer were just stacked along depth the shape would be (2178, 2916, 5). The script reshapes the data into tiles and puts the values into a separate array, so the output is 2 Numpy arrays:

image data shape: (242, 324, 9, 9, 4) with the layers along the last dimension
values data shape: (242, 324, 3)

And they are ready to be used in the Tensorflow. For more details and example output, see this wiki page or inspect the source. The most detailed general info is probably in the slides from our presentation for CVPR2018 – format description starts from p.19.

File samples

Actual files can be found here or go to 3d+biquad, open individual models and hit the light green button to ‘Download source files for ml’. Without using Python displaying layers and decoding tags is natively supported by ImageJ or Fiji. To read the tags exiftool or tiffinfo linux programs can be used.

Other libraries

Other Python libraries that could have worked (but examples for Pillow were easier to find):