*Featured on Image Sensors World*

This article describes our next steps that will continue the year-long research on high resolution multi-view stereo for long distance ranging and 3-D reconstruction. We plan to fuse the methods of high resolution images calibration and processing, already emulated functionality of the Tile Processor (TP), RTL code developed for its implementation and the Convolutional Neural Network (CNN). Compared to the CNN alone this approach promises over a hundred times reduction in the number of input features without sacrificing universality of the end-to-end processing. The TP part of the system is responsible for the high resolution aspects of the image acquisition (such as optical aberrations correction and image rectification), preserves deep sub-pixel super-resolution using efficient implementation of the 2-D linear transforms. Tile processor is free of any training, only a few hyperparameters define its operation, all the application-specific processing and “decision making” is delegated to the CNN.

Machine learning is an active development area, and its applications to the 3-D scene reconstruction stimulated by the development of the autonomous vehicles including self-driving cars is no exception. Use of the CNNs to extract surfaces from the random-dot stereograms was published as early as 1992^{[1]}. Most of the modern researches use standard image sets: Middlebury stereo data set^{[2]} for high resolution near objects and KITTI^{[3]}
for the longer range applications. KITTI images are acquired from a moving car, they have attached ground truth data captured by the LIDAR. This image set uses binocular pairs and has relatively low resolution (1.4 MPix) compared to the modern image sensors, and still most of the CNN architectures require from seconds to thousands of seconds even when implemented with GPU devices and so are not yet suitable for the real-time applications.

Most of the stereo image processing CNNs^{[4]}
input raw pixel data and perform unary feature extraction in the parallel subnets (one for each image in a stereo set), merge features and perform additional processing of the resulting 3-D data. This is a so-called “siamese” network architecture that benefits from sharing parameters between the identical subnetworks. It is common to put most resources to the unary part of the processing resulting in truncating of the common stage of the processing that can consist of just a single layer (Fast Architecture in ^{[4]}). Efficient implementation in ^{[5]} limits CNN processing to just the DSI generation and then uses traditional methods of DSI enhancement such as semi-global matching^{[6]}, other architectures split network after exchanging features and generate depth maps for each of the stereo images individually^{[7]}.

Early layers of the various CNNs (and the eye retina too) are very general and even remind the basis functions (Figure 2) of the two-dimensional Fourier (DFT), cosine/sine (DCT and DST) and wavelet (DWT) transforms so it is no surprise that there are works that explore combinations of such transforms and the neural networks. Some of them ^{[8, 9]} exploit the energy concentration property of these transforms that makes possible popular image compression such as JPEG.
Others ^{[10, 11]} evaluate efficiency of the available Fast Fourier Transform implementations to speed-up convolutions by converting image data to the frequency domain and then applying the pointwise multiplication according to the convolution-multiplication property. Improvement is modest, as the frequency domain calculations are most efficient for the large windows, while most modern CNNs use small ones, such as 3×3, where Winograd algorithm is more efficient.

Multi-view high resolution cameras present a special case where frequency domain processing results may lead to the reduction of the CNN input features by two orders of magnitude compared to raw pixel input, the data flow diagram is presented in Figure 3. Four identical subnets process individual channels, each providing 4912×3684 pixel Bayer mosaic color images. As described in the earlier post Tile Processor is using efficient Modulated Complex Lapped Transform (MCLT) conversion of the Bayer mosaic (color) high resolution image data to the frequency domain, and of 4x8x8 coefficients representing full 16×16 input tiles for each of the 3 colors, red and blue result in 1x8x8 each, and green produces 2x8x8 coefficients, to the total of 256. Residual fractional pixel shifts needed for the image rectification are implemented as cosine/sine phase rotators, they are performed in parallel for each color and result in 768 coefficients for each tile. Frequency domain processing includes space-variant optical aberration correction (required for the high resolution small format image sensors), phase correlation for image pairs and textures processing if it is required in addition to the distance (disparity) measurement. Aberration correction is performed in each channel subnet, correlation and texture processing combine data from all four of them. After channels/pairs merging the frequency domain data is converted back to the pixel domain by the IMCLT modules as 16×16 tiles representing 2-D correlation. In most cases the 2-D correlation data is reduced to a 16-element array by summing perpendicular to disparity direction (orthogonal pairs are transposed in the frequency domain before they are combined and fed to the IMCLT), full 2-D correlation may still be used for the minor field calibration. The 16-element array is then processed to calculate sub-pixel argmax (residual disparity) and the corresponding correlation value (confidence) – this is where the number of features is dramatically reduced. It may be useful to increase the number of features and supplement average disparity for all 4 pairs of quad camera with separate horizontal and vertical pairs to improve foreground-background separation. These additional features are calculated by the identical TP phase correlation subnets as shown in Figure 3. With all 3 correlations each tile results in 6 values that are fed to the CNN input as a 614x460x6 tensor, in that case the feature reduction would still be over 40 (128 for a single correlation pair).

Conversion of the raw pixels to the Disparity Space Image (DSI) by the TP involves significant reduction of the (X,Y) resolution. When using four of the 18 MPix (4912×3684) imagers the result DSI resolution is just 614×460, this may seem like a waste of the sensor resolution. Actually, it is not:

- a deep sub-pixel resolution for disparity measurement needed for long-distance ranging requires matching of the large image areas anyway
- most of the image area for the most real-world images corresponds to smooth 3-D surfaces where assumption of a common disparity value for a tile is reasonable.
- the initial image resolution is preserved by the TP when source images are converted to the textures (simultaneously improving quality as the data from 4 rectified images is averaged)
- pixel-accurate distance map may be restored by extra processing the pixel data for selected tiles where depth discontinuity is detected, then assigning each pixel to one of the available surfaces.

Significant (42..128) reduction of the input features is not the only advantage of the TP+CNN combination over the CNN alone. Being “convolutional” CNNs depend on translation symmetry, the groups of related pixels are treated the same way regardless of their localization in the image. That is only an approximation, especially when dealing with the high resolution images and extracting subpixel disparity values. This divergence from the strict convolution model is caused by the optical aberrations and distortions and requires use of the space-variant convolution instead, or performing complete aberration correction and image rectification before the images are fed to the network. Otherwise both the complexity of the network and amount of training data would increase dramatically. Image rectification with pixel (or slightly better) precision is a common task in stereo processing. It involves interpolation and re-sampling of the pixel data, the process that leads to the phase noise introduction, especially harmful when deep super-resolution of the matched images is required. Tile processor implementation combines multiple operations (fractional pixel image shifts, optical aberrations correction, phase correlation of the matched pairs), TP avoids image re-sampling from the sensor pixel grid by replacing it with the phase rotation in the frequency domain.

Final step that reduces the number of features that are sent from the TP to the CNN is extraction of the disparity value by calculation of the argmax of the phase correlation data. This function has to be calculated with subpixel resolution for the data defined on the integer pixel grid. Certain biases are possible, and the TP implementation offers trade-off between speed and accuracy. The result disparity value is a sum of the pre-applied disparity (implemented as a phase rotation in the frequency domain on top of the integer pixel shift) and the argmax value (correlation maximum offset from zero). When higher accuracy is required, a second iteration may be performed by applying the full disparity from the first iteration, then the residual argmax offset will be close to zero and less subject to bias.

Optimal system for the real-time high resolution 3-D scene reconstruction and ranging would require development of the application-specific SoC. If used with a set of four 18 MPix image sensors (such as ON Semiconductor AR1820HS) and a single ×16 1600 MHz DDR4 memory device, the 16 nm technology process, the TP subsystem will be capable of 10 Hz operation covering the full 4912×3684 frames reserving half of the memory bandwidth for other then TP operations.

We plan to emulate such system using available NC393 camera electronic and optical-mechanical components, including multiple 10393 system boards based on Xilinx Zynq 7030 SoC. Each such board has a GigE port and four identical sensor ports routed directly to the FPGA I/O pads allowing flexible assignment of pin functions. Typical applications include up to 8 differential LVDS pairs, clock pair, I²C and clock input. The same connectors can be used for high-speed communication between the 10393 boards. Partitioning the system into multiple boards will allow to fit the required TP functionality into smaller FPGAs, then send the result features (614×460×6) over the GigE to a workstation with GPU for the experiments with different CNN implementations. The system bandwidth will be lower than that of the application-specific SoC, the 10 Hz operation will be possible with 5 MPix sensors (2.5 Hz with 18 MPix).

Inter-board connections are shown in Figure 1 (just the connections, the actual prototype camera will look more like in Figure 4, but with a wider body). Five to seven of the 10393 boards are arranged in 2 layers. Four layer 1 boards use one of the sensor ports to receive image data from the attached sensor, perform image conditioning, flat-field correction and store data in the dedicated DDR3 memory. They later read the data as 16×16 pixels overlapping tiles, calculate the tile centers using calibration data and requested location and nominal disparity from the data received over the GigE. Each tile is transformed to the frequency domain, the data is subject to the space-variant aberration correction. The result frequency domain tiles are output through three remaining sensor ports that are reconfigured to be LVDS transmitters. The layer 2 boards simultaneously receive frequency domain data through all 4 of their sensor ports from the layer 1 and perform phase correlation (pointwise multiplication followed by normalization) on the image pairs. There could be just a single layer 2 board, or up to 3 (limited by the available layer 1 ports) to perform different types of correlations in parallel (all 4 pairs combined, two vertical pairs and separately 2 horizontal pairs for better foreground/background separation. The results of the frequency domain calculations are then transformed to the pixel domain and the argmax is calculated. Then argmax value is used to calculate the full tile disparity, and the corresponding correlation value – as a disparity confidence. The pair (disparity, confidence) for each tile is then sent over GigE to the CNN implemented on a workstation computer.

While the TP functionality is already tested with the software emulation, and the efficient implementation is developed, more research is needed for the CNN part of the system. Available image sets, such as KITTI^{[3]} have insufficient resolution (1.4 MPix) and they use different spatial arrangement of the cameras. We plan to capture high resolution quad camera image sets using available NC393-based cameras that will be upgraded from 5 MPix to 18 MPix sensors of the same 1/2.3″ format so the optical-mechanical design will remain the same. As we are primarily interested in long distance ranging (few hundreds to thousands meters), use of the LIDARs to capture ground truth data is not practical. Instead we plan to mount a pair of identical quad cameras (with the baseline of 150mm) on a car 1500 mm apart, pointed in the same direction, so when the 3-D measurements from these quad cameras are fused, the accuracy of the composite distance data would be ten times better, because the effective baseline will be 1500mm. Of course that method has some limitations (it will not help to improve data from the poorly textured objects) but it will provide higher absolute distance resolution that can be used for the loss function during CNN training. Data from the individual quad cameras will be used for training and testing of the network.

All acquired images, related calibration data and software will be available online under GNU GPL.

[1] Becker, Suzanna, and Geoffrey E. Hinton. “Self-organizing neural network that discovers surfaces in random-dot stereograms.” Nature 355.6356 (1992): 161.

[2] Scharstein, Daniel, et al. “High-resolution stereo datasets with subpixel-accurate ground truth.” German Conference on Pattern Recognition. Springer, Cham, 2014.

[3] Menze, Moritz, and Andreas Geiger. “Object scene flow for autonomous vehicles.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

[4] J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” Journal of Machine Learning Research, vol. 17, no. 1-32, p. 2, 2016.

[5] W. Luo, A. G. Schwing, and R. Urtasun, “Efficient deep learning for stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5695–5703, 2016.

[6] H. Hirschmuller, “Accurate and efficient stereo processing by semi-global matching and mutual information,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2, pp. 807–814, IEEE, 2005.

[7] J A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” arXiv preprint arXiv:1703.04309, 2017.

[8] Sihag, Saurabh, and Pranab Kumar Dutta. “Faster method for Deep Belief Network based Object classification using DWT.” arXiv preprint arXiv:1511.06276 (2015).

[9] Ulicny, Matej, and Rozenn Dahyot. “On using CNN with DCT based Image Data.” Proceedings of the 19th Irish Machine Vision and Image Processing conference IMVIP 2017

[10] Vasilache, Nicolas, et al. “Fast convolutional nets with fbfft: A GPU performance evaluation.” arXiv preprint arXiv:1412.7580 (2014).

[11] Lavin, Andrew, and Scott Gray. “Fast algorithms for convolutional neural networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

]]>- works in JP4 format (COLOR=5). Because in this format demosaicing is not done it does not require extra scan lines, which simplified fpga’s logic.
- fps is controlled:
- by exposure for the sensor in the freerun mode (TRIG=0, delivers max fps possible)
- by external or internal trigger period for the sensor in the snapshot mode (TRIG=4, a bit lower fps than in freerun)

Why “*small tile*“? Most camera images have short (up to few pixels) correlation/mutual information span related to the acquisition system properties – optical aberrations cause a single scene object point influence a small area of the sensor pixels. When matching multiple images increase of the window size reduces the lateral (x,y) resolution, so many of the 3d reconstruction algorithms do not use any windows at all, and process every pixel individually. Other limitation on the window size comes from the fact that FD conversions (Fourier and similar) in Cartesian coordinates are shift-invariant, but are sensitive to scale and rotation mismatch. So targeting say 0.1 pixel disparity accuracy the scale mismatch should not cause error accumulation over window width exceeding that value. With 8×8 tiles (16×16 overlapped) acceptable scale mismatch (such as focal length variations) should be under 1%. That tolerance is reasonable, but it can not get much tighter.

What is “*space variant*“? One of the most universal operations performed in the FD is convolution (also related to correlation) that exploits convolution-multiplication property. Mathematically convolution applies the same operation to each of the points of the source data, so shifted object of the source image produces just a shifted result after convolution. In the physical world it is a close approximation, but not an exact one. Stars imaged by a telescope may have sharper images in the center, but more blurred in the peripheral areas. While close (angularly) stars produce almost the same shape images, the far ones do not. This does not invalidate convolution approach completely, but requires kernel to (smoothly) vary over the input images ^{[1, 2]}, makes it a space-variant kernel.

There is another issue related to the space-variant kernels. Fractional pixel shifts are required for multiple steps of the processing: aberration correction (obvious in the case of the lateral chromatic aberration), image rectification before matching that accounts for lens optical distortion, camera orientation mismatch and epipolar geometry transformations. Traditionally it is handled by the image rectification that involves re-sampling of the pixel values for a new grid using some type of the interpolation. This process distorts the signal data and introduces non-linear errors that reduce accuracy of the correlation, that is important for subpixel disparity measurements. Our approach completely eliminates resampling and combines integer pixel shift in the pixel domain and delegates the residual fractional pixel shift (±0.5 pix) to the FD, where it is implemented as a cosine/sine phase rotator. Multiple sources of the required pixel shift are combined for each tile, and then a single phase rotation is performed as a last step of pixel domain to FD conversion.

Modulated Complex Lapped Transform (MCLT)^{[3]} can be used to split input sequence into overlapping fractions, processed separately and then recombined without block artifacts. Popular application is the signal compression where “processed separately” means compressed by the encoder (may be lossy) and then reconstructed by the decoder. MCLT is similar to the MDCT that is implemented with DCT-IV, but it additionally preserves and allows frequency domain modification of the signal phase. This feature is required for our application (fractional pixel shifts and asymmetrical lens aberrations modify phase), and MCLT includes both MDCT and MDST (that use DCT-IV and DST-IV respectively). For the image processing (2d conversion) four sub-transforms are needed:

- horizontal DCT-IV followed by vertical DCT-IV
- horizontal DST-IV followed by vertical DCT-IV
- horizontal DCT-IV followed by vertical DST-IV
- horizontal DST-IV followed by vertical DST-IV

Figure 1 illustrates the principle of TDAC (time-domain aliasing cancellation) that restores initial data from the series of individually converted subsections after seemingly lossy transformations. Step a) shows the 2*N long (in our case N=8) data subsets extraction, each subset is multiplied by a window function. It is a half period of the sine function, but other windows are possible as long as they satisfy the Princen-Bradley condition. Each of the sections $(a\dots h)$ corresponds to N/2 of the input samples, color gradients indicate input order. Figure 2b has seemingly lossy result of MDCT (left) and MDST (right) performed on the 2*N long input $(a,b,c,d)$ resulting in N-long $(\u2013\stackrel{~}{c}\u2013d,a\u2013\stackrel{~}{b})$ – tilde indicates time-reversal of the subsequence (image has the gradient direction reversed too). Upside-down pieces indicate subtraction, up-right – addition. Each of the 2*N → N transforms is irreversible by itself, TDAC depends on the neighbor sections.

Figure 1c shows the first step of original sequence restoration, it extends N-long sequence using DCT-IV boundary conditions – it continues symmetrically around left boundary, and anti-symmetrically around the right one (both around half-sample from the first/last one), behaving like a first quadrant (0 to π/2) of the cosine function. Conversions for DST branch are not shown, they are similar, just extended like a first quadrant of the sine function.

The result sequence of step c) (now 2*N long again) is multiplied by the window function for the second time in step d), the two added terms of the last column ($d$ and $\stackrel{~}{c}$) are swapped for clarity.

The last image e) places result of the step d) and those of similarly processed subsequences $(c,d,e,f)$ and $(e,f,g,h)$ to the same time line. Fraction $(\stackrel{~}{d},\stackrel{~}{c})$ of the first block compensates $(\u2013\stackrel{~}{d},\u2013\stackrel{~}{c})$ of the second, and $(\stackrel{~}{f},\stackrel{~}{e})$ of the second – $(\u2013\stackrel{~}{f},\u2013\stackrel{~}{e})$ of the third. As the window satisfies Princen-Bradley condition, and it is applied twice, the $c$ columns of the first and second block when added result in $c$ segment of the original sequence. The same is true for the $d$, $e$, and $f$ columns. First and last N samples are not restored as there are no respective left and right neighbors to be processed.

Modified discrete cosine and sine transforms exhibit a perfect reconstruction property when input sequence is split into regular overlapping intervals and this is the case for many applications, such as audio and video compression. But what happens in the case of a space-variant shift? As the total shift is split into integer and symmetrical fractional part even smooth variation of the required shift for the neighbor sample sequences there will be places where the required shift crosses ±0.5 pix boundaries. This will cause the overlap to be N±1 instead of exactly N and the remaining fractional pixel shift will jump from +0.5 pix to -0.5pix or vice versa.

Such condition is illustrated in Figure 2, where fractional pixel shift is increased to ±1.0 instead of ±0.5 to avoid signal shape distortion caused by a fractional shift. The sine window is also modified accordingly to have zeros on both ends and so to have a flat top, but it still satisfies Princen-Bradley condition. The sawtooth waveform represents input signal that has large pedestal = 10 and a full amplitude of 3. The two left input intervals (1 to 16 and 9 to 24) are shown in red, the two right ones (15 to 30 and 23 to 38) are blue. These intervals have extra overlap (N+2) between red and blue ones. Dotted waveforms show window functions centered around input intervals, dashed – inputs multiplied by the windows. Solid lines on the top plot show result of the FD rotation resulting in 1 shift to the right for the red subsequences, to the left – by the blue ones.

The bottom plot of Figure 2 is for the “rectified image”, where the overlapping intervals are spread evenly (0-15, 8-23, 16-31, 24-39), the restored data is multiplied by the windows again (red and blue solid lines) and then added together (dark green waveform) in an attempt to restore the original sawtooth.

There is an obvious problem in the center where the peak is twice higher than the center sawtooth (dotted waveform). It is actually a “wrinkle” caused by the input signal pedestal that was pushed to the center by the increased overlap of the input, not by the sawtooth shape. The FD phase shift moved *windowed* input signal, not just the input signal. So even if the input signal was constant, left two windows would be shifted right by 1, and right ones – left by one sample, distorting their sum. Figure 3 illustrates the opposite case where the input subsequences have reduced rather than increased overlap and FD phase rotation moves windowed data away from the center – the restored signal has a dip in the middle.

This problem can be corrected if the first window function used for input signal multiplication takes the FD shift into account, so that the input windows *after* phase rotation match the output ones. Signals similar to those in Figure 2, but with appropriately offset input windows (dotted waveforms are by 1 pixel asymmetrical) are presented in Figure 4. It shows perfect reconstruction (dark green line) of the offset input sawtooth signal.

The MCLT converter illustrated in Figure 5 takes an array of 256 pixel samples and processes them at the rate of 1 pixel per clock, resulting in 4 of the 64 (8×8) arrays representing FD transformation of this tile. Converter incorporates phase rotator that is equivalent to the fractional pixel shifter with 1/128 pixel resolution. Multiple pixel tiles tiles can be processed immediately after each other or with a minimal gap of 16 pixels needed to restore the pipeline state. *Fold sequencer* receives start signal and simultaneously two 7-bit fractional pixel X and Y shifts in 1/128 increments (-64 to +63). Pixel data is received from the read-only port of the dual-port memory (*input tile buffer*) filled from the external DDR memory. *Fold sequencer* generates buffer addresses (including memory page number) and can be configured for various buffer latency.

*Fold sequencer* simultaneously generates X and Y addresses for the *2-port window ROM* that generates window function (it is a half-sine as discussed above) values that are combined by a multiplier as the 2-d window function is separable. Each of the dimensions calculate window values with appropriate subpixel shift to allow space-variant FD processing. Mapping from the 16×16 to 8×8 tiles is performed according to Figure 1b for each direction, resulting in four 8×8 tiles for DCT/DCT (horizontal/vertical), DST/DCT, DCT/DST and DST/DST future processing, Each of the source pixels contribute to all 4 of the 8×8 arrays, and each corresponding elements of the 4 output arrays share the same 4 contributing pixels – just with different signs. That allows to iterate through the source pixels in groups of 4 once, multiply them by appropriate window values and in 4 cycles add/subtract them in 4 accumulators. These values are registered and multiplexed during the next 4 cycles feeding *512×25 DTT input buffer* (DTT stands for Discrete Trigonometric Transform – a collective name for both DCT and DST).

“Folded” data stored in the *512×25 DTT input buffer* is fed to the 2-dimensional 8×8 pixel DTT module. It is similar to the one described in the DCT-IV implementation blog post, it was just modified to allow all 4 DTT variants, not just the DCT/DCT described there. This was done using the property of DCT-IV/DST-IV that DST-IV can be calculated as DCT-IV if the input sequence is reversed (x0 ↔ x7, x1 ↔ x6, x2 ↔ x5, x3 ↔ x4) and sign of the odd output samples (y1, y3, y5, y7) is inverted. This property can be seen by comparing plots of the basis functions in Figure 6, the proof will be later in the text.

Another memory buffer is needed after the *2d DTT* module as it processes one of four 64-pixel transforms at a time, and the phase rotator (Figure 7) needs simultaneous access to all 4 components of the same FD sample. These 4 components are shown on the left of the diagram, they are multiplied by 4 sine/cosine values (shown as CH, SH, CV and SV) and then combined by the adders/subtracters. Total number of multiplications is 16, and they have to be performed it 4 clock cycles to maintain the same data rate through all the MCLT converter, so 4 multiplier-accumulator modules are needed. One FD point calculation uses 4 different sine/cosine coefficients, so a single ROM is sufficient. The phase rotator uses horizontal and vertical fractional pixel shift values for the second time (first was for the window function) and combines them with the multiplexed horizontal and vertical indexes (3 bits each) as address inputs of the coefficient ROM. Rotator provides rotated coefficients that correspond to the MCLT transform of the pixel data shifted in both horizontal and vertical directions by up to ±0.5 pix at a rate of 1 coefficient per clock cycle, providing both output data and address to the external multi-page memory buffer.

Most camera applications use color image sensors that provide Bayer mosaic color data: one red, one blue and two diagonal green pixels in each 2×2 pixel group. In old times image senor pixel density was below the lenses optical resolution and all the cameras performed color interpolation (usually bilinear) of the “missing” colors. This process implied that each red pixel, for example, had four green neighbors (up, down, left and right) at equal distance, and 4 blue pixels located diagonally. With the modern high-resolution sensors it is not the case, possible distortions are shown on Figure 8 (copied from the earlier post). More elaborate “de-mosaic” processing involves non-linear operations that would influence sub-pixel correlation results.

As the simple de-mosaic procedures can not be applied to the high resolution sensors without degrading the images, we treat each color subchannel individually, merging results after performing the optical aberration correction in the FD, or at least compensating the lateral chromatic aberration that causes pixels x/y shift.

Channel separation means that when converting data to the FD the input data arrays are decimated: for green color only half of the values (located in a checkerboard pattern) are non-zero, and for red and blue subchannels – only quarter of all pixels have non-zero values. The relative phase of the remaining pixels depends on the required integer pixel offset (only ±0.5 are delegated to the FD) and may have 2 values (black vs. white checkerboard cells) for green, and 4 different values (odd/even for each of the horizontal/vertical directions independently) for red and blue. MCLT can be performed on the sparse input arrays the same way as on the complete 16×16 ones, low-pass filters (LPF) may be applied on the later processing stages (LPF may be applied to the deconvolution kernels used for aberration correction during camera calibration). It is convenient to multiply red and blue values by 2 to compensate for the smaller number of participating pixels compared to the green sub-channel.

Direct approach to calculate FD transformation of the color mosaic input image would be to either run the monochrome converter (Figure 5) three times (in 256*3=768 clock cycles) masking out different pixel pattern each time, or implement the same module (including the input pixel buffer) three times to achieve 256 clock cycle operation. And actually the 16×16 pixel buffer used in monochrome converter is not sufficient – even the small lateral chromatic aberration leads to mismatch of the 16×16 tiles for different colors. That would require using larger – 18×18 (for minimal aberration) pixel buffer or larger if that aberration may exceed a full pixel.

Luckily MCLT that consists of MDCT and MDST, they it turn use DCT-IV and DST-IV that have a very convenient property when processing the Bayer mosaic data. Almost the same implementation as used for the monochrome converter can transform color data at the same 256 clock cycles. The only part that requires more resources is the final phase rotator – it has to output 3*4*64 values in 256 clock cycles so three instances of the same rotator are required to provide the full load of the rest of the circuitry. This almost (not including rotators) triple reduction of the MCLT calculation resources is based on “folding” of the input data for the DTT inputs and the following DCT-IV/DST-IV relation.

*DST-IV is equivalent to DCT-IV of the input sequence, where all odd input samples are multiplied by -1, and the result values are read in the reversed order.* In the matrix form it is shown in (1) below:

$${\mathrm{DST}}^{\mathrm{IV}}=\left[\begin{array}{ccccc}0& \mathrm{}& \mathrm{}& \mathrm{}& 1\\ \mathrm{}& \mathrm{}& \mathrm{}& 1& \mathrm{}\\ \mathrm{}& \mathrm{}& 1& \mathrm{}& \mathrm{}\\ \mathrm{}& \mathrm{\cdot \cdot \cdot}& \mathrm{}& \mathrm{}& \mathrm{}\\ 1& \mathrm{}& \mathrm{}& \mathrm{}& 0\end{array}\right]\cdot {\mathrm{DCT}}^{\mathrm{IV}}\cdot \left[\begin{array}{ccccc}1& \mathrm{}& \mathrm{}& \mathrm{}& 0\\ \mathrm{}& -1& \mathrm{}& \mathrm{}& \mathrm{}\\ \mathrm{}& \mathrm{}& 1& \mathrm{}& \mathrm{}\\ \mathrm{}& \mathrm{}& \mathrm{}& \mathrm{\cdot \cdot \cdot}& \mathrm{}\\ 0& \mathrm{}& \mathrm{}& \mathrm{}& -1\end{array}\right]$$ | (1) |

Equations (2) and (3) show definitions of DCT-IV and DST-IV^{[4]}:

$${\mathrm{DCT}}^{\mathrm{IV}}\left(k\right)=\sqrt{\frac{2}{N}}\cdot \sum _{l=0}^{N\u20131}\mathrm{cos}(\frac{\pi}{N}\cdot (l+\frac{1}{2})\cdot (k+\frac{1}{2}\left)\right)$$ | (2) |

$${\mathrm{DST}}^{\mathrm{IV}}\left(k\right)=\sqrt{\frac{2}{N}}\cdot \sum _{l=0}^{N\u20131}\mathrm{sin}(\frac{\pi}{N}\cdot (l+\frac{1}{2})\cdot (k+\frac{1}{2}\left)\right)$$ | (3) |

The modified by (1) DST-IV can be re-writted as the two separate sums for even (*l=2*m*) and odd (*l=2*m+1*) input samples and by replacing output samples *k* with reversed (*N-1-k*). Then after removing full periods (n*2*π) and applying trigonometric identities it can be converted to the same value as DST-IV:

$${\mathrm{DCT}}_{\mathrm{mod}}^{\mathrm{IV}}\left(k\right)=\sqrt{\frac{2}{N}}\cdot (\sum _{m=0}^{N/2\u20131}\mathrm{cos}(\frac{\pi}{N}\cdot (2\cdot m+\frac{1}{2})\cdot ((N\u20131\u2013k)+\frac{1}{2}\left)\right)\u2013\sum _{m=0}^{N/2\u20131}\mathrm{cos}(\frac{\pi}{N}\cdot (2\cdot m+\frac{3}{2})\cdot ((N\u20131\u2013k)+\frac{1}{2}\left)\right))=$$ $$\sqrt{\frac{2}{N}}\cdot (\sum _{m=0}^{N/2\u20131}\mathrm{cos}(\frac{\pi}{N}\cdot (\frac{1}{2}\cdot N\u2013(2\cdot m+\frac{1}{2})\cdot (k+\frac{1}{2})\left)\right)\u2013\sum _{m=0}^{N/2\u20131}\mathrm{cos}(\frac{\pi}{N}\cdot (\frac{3}{2}\cdot N\u2013(2\cdot m+\frac{3}{2})\cdot (k+\frac{1}{2})\left)\right))=$$ $$\sqrt{\frac{2}{N}}\cdot (\sum _{m=0}^{N/2\u20131}\mathrm{cos}(\u2013\frac{\pi}{2}+\frac{\pi}{N}\cdot (2\cdot m+\frac{1}{2})\cdot (k+\frac{1}{2}\left)\right)\u2013\sum _{m=0}^{N/2\u20131}\mathrm{cos}(\frac{\pi}{2}+\frac{\pi}{N}\cdot (2\cdot m+\frac{3}{2})\cdot (k+\frac{1}{2}\left)\right))=$$ $$\sqrt{\frac{2}{N}}\cdot (\sum _{m=0}^{N/2\u20131}\mathrm{sin}(\frac{\pi}{N}\cdot (2\cdot m+\frac{1}{2})\cdot (k+\frac{1}{2}\left)\right)+\sum _{m=0}^{N/2\u20131}\mathrm{sin}(\frac{\pi}{N}\cdot (2\cdot m+\frac{3}{2})\cdot (k+\frac{1}{2}\left)\right))={\mathrm{DST}}^{\mathrm{IV}}\left(k\right)$$ | (4) |

As shown in Figure 1b, the 2*N-long input sequence of four fragments (a,b,c,d) is folded in N-long sequence (5) for DCT-IV input:

$$(a,b,c,d)\u27f6(\u2013\stackrel{~}{c}\u2013d,a\u2013\stackrel{~}{b})$$ | (5) |

and (6) – for DST-IV:

$$(a,b,c,d)\u27f6(\stackrel{~}{c}\u2013d,a+\stackrel{~}{b})$$ | (6) |

where tilde “~” over the name indicates reversal of the segment. Such direction reversal for the sequences of even length (N/2=4 in our case) swaps odd- and even-numbered samples. Each of the halves of each of (5) and (6) show that both have the same $(\u2013d,a)$ for direct and differ in sign for reversed: $(\u2013\stackrel{~}{c},\u2013\stackrel{~}{b})$ and $(\stackrel{~}{c},\stackrel{~}{b})$ terms. Each of the input samples appears exactly once in each of the DCT-IV and DST inputs. Even samples of the input sequence contribute identically to the even samples of both output sequences (through a an d) and contribute with the opposite signs to the odd output samples (through b and c). And similarly the odd-numbered input samples contribute identically to the odd-numbered output samples and with the opposite sign to the even ones.

So, *sparse input sequence with only non-zero values at even positions result in both DCT-IV and DST-IV having the same values at even positions and multiplied by -1 – in the odd. Odd-only input sequences result in same values in odd positions and negatives – in the even ones.*

Now this property can be combined to the previously discussed DST-IV to DCT-IV relation and extended to the two-dimensional Bayer mosaic case. We can start with the red and blue channels that have 1-in-4 non-zero pixels. If the non-zero pixels are in the even columns, then after first horizontal pass the DST-IV output will differ from the DCT-IV only by the output samples order according to (1), as even input values are the same, and odd are negated. Reversed output coefficients order for the horizontal pass means that the DST-IV will be just horizontally flipped with respect to the DCT-IV one. If the non-zero input was for the odd columns, then the DST-IV output will be reversed and negated compared to the DCT-IV one.

The second (vertical) pass is applied similarly. If original pattern had non-zero even rows, the result of vertical DST-IV would be the same as those of the DCT-IV after a vertical flip. If the odd rows were non-zero instead – the result will be vertically flipped and negated with respect of the DCT-IV one.

This result means that for the 1-in-4 Bayer mosaic array only one DCT-IV/DCT-IV transform is need. The three other DTT combinations may be obtained by flipping the DCT-IV/DCT-IV output horizontally and/or vertically and possibly negating all the data. Reversal of the readout order does not require additional hardware resources (just a few gates for a little fancier memory address counter). Data negation is also “free” as it can easily be absorbed by the phase rotator that already has adders/subtracters and just needs an extra inversion sign control. Flips do not depend on the odd/even columns/rows, only negation depends on them.

Green channel has non-zero values in either (0,0) and (1,1) or (0,1) and (1,0) positions. So it can be considered as a sum of two 1-in-4 channels described above. In the (0,0) case neither horizontal, no vertical DST-IV inverts sign, and (1,1) inverts both, so the DST-IV/DST-IV does not have any inversion, and it will stay true for the green color – combination of (0,0) and (1,1). We can not compare DCT-IV/DCT-IV with ether DST-IV/DCT-IV or DCT-IV/DST-IV, but these two will have the same sign compared to each other (zero or double negation). So the DST-IV/DST-IV will be double-flipped (both horizontally and vertically) version of the DCT-IV/DCT-IV output, and DCT-IV/DST-IV – double flipped version of DST-IV/DCT-IV, and calculation for this green pattern requires just two of the 4 DTT operations. Similarly, for the (0,1)/(1,0) green pattern DST-IV/DST-IV will be double-flipped and negated version of the DCT-IV/DCT-IV output, and DCT-IV/DST-IV – double-flipped and negated version of DST-IV/DCT-IV.

Combining results for all three color components: red, blue and green, we need total of four DTT operations on 8×8 tiles: one for red, one for blue and 2 for green component instead of twelve if each pixel had full RGB value.

MCLT converter for the Bayer Mosaic data shown in Figure 9 is similar to the already described converter for the monochrome data. The first difference is that now the *Fold sequencer* addresses larger source tiles – it is run-time configurable to be 16×16, 18×18, 20×20 or 22×22 pixels. Each tile may have different size – this functionality can be used to optimize external memory access: use smaller tiles in the center areas of the sensor with lower lateral chromatic aberrations, read larger tiles in the peripheral areas with higher aberrations. The extended tile should accommodate all three color shifted 16×16 blocks as illustrated in Figure 10.

The *start* input triggers calculation of all 3 colors, it can appear either each 256-th clock cycle or it needs to wait for at least 16 cycles after the end of the previous tile input if started asynchronously. X/Y offsets are provided individually for each color channel and they are stored in a register file inside the module, the integer top-left corner offset is provided separately for each color to simplify internal calculations.

Dual-port window ROM is the same as in the monochrome module, both horizontal and vertical window components are combined by a multiplier and then the result is multiplied by the pixel data received from the external memory buffer with configurable latency.

There are only two (instead the four for monochrome) accumulators required because each of the “folded” elements of the 8×8 DTT inputs has two contributor source pixels at most (for green color), while red and blue use just a single accumulator. Red color is processed first, then blue, and finally green. For the first ones only the DCT-IV/DCT-IV input data is prepared (in 64 clock cycles each), green color requires DST-IV/DCT-IV block additionally (128 clock cycles).

DTT output data goes to three parallel dual-port memory buffers. Red and blue channels each require just single 64-element array that later provides 4 different values in 4 clock cycles for horizontally and vertically flipped data. Green channel requires 2*64 element array. These buffers are filled one at a time (red, blue, green) and each of them feed corresponding phase rotator. Rotator outputs need 256 cycles to provide 4*64=256 FD values, the symmetry exploited during DTT conversion is lost at this stage. The three phase rotators can provide output addresses/data asynchronously or the green and blue outputs can be delayed and output simultaneously with the green channel, sharing the external memory addresses.

Results of the simulation of the MCLT converter are shown in Figure 11. Simulation ran with the clock frequency set to 100 MHz, the synthesized code should be able to run at at least 250 MHz in the Zynq SoC of the NC393 camera – at the same rate as is currently used for the data compression – all the memories and DSPs are fully buffered. Simulation show processing of the 3 tiles: two consecutive and the third one after a minimal pause. The data input was provided by the Java code that was written for the post-processing of the camera images, the same code generated intermediate results and the output of the MCLT conversion. Java used code involved double precision floating point calculations, while the RTL code is based on fixed-point calculations with the precision of the architecture DSP primitives, so there is no bit-to-bit match of the data, instead there is a difference that corresponds to the width of the used data words.

The RTL code (licensed under GNU GPLv3+) developed for the MCLT-based frequency domain conversion of the Bayer color mosaic images uses almost the same resources as the monochrome image transformation for the tiles of the same size – three times less than a full RGB image would require. Input window function modification that accounts for the two-dimensional fractional pixel shift and the post-transform phase rotators allow to avoid re-sampling for the image rectification that degrades sub-pixel resolution.

This module is simulated with Icarus Verilog (a free software simulator), results are compared to those of the software post-processing used for the 3-d scene reconstruction of multiple image sets, described in the “Long range multi-view stereo camera with 4 sensors” post.

The MCLT module for the Bayer color mosaic images is not yet tested in the FPGA of the camera – it needs more companion code to implement a full processing chain that will generate disparity space images (DSI) required for the real-time 3d scene reconstruction and/or output of the aberration-corrected image textures. But it constitutes a critical part of the Tile Processor project and brings the overall system completion closer.

[1] Thiébaut, Éric, et al. “Spatially variant PSF modeling and image deblurring.” SPIE Astronomical Telescopes+ Instrumentation. International Society for Optics and Photonics, 2016. pdf

[2] Řeřábek, M., and P. Pata. “The space variant PSF for deconvolution of wide-field astronomical images.” SPIE Astronomical Telescopes+ Instrumentation. International Society for Optics and Photonics, 2008.pdf

[3] Malvar, Henrique. “A modulated complex lapped transform and its applications to audio processing.” Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on. Vol. 3. IEEE, 1999.pdf

[4] Britanak, Vladimir, Patrick C. Yip, and Kamisetty Ramamohan Rao. Discrete cosine and sine transforms: general properties, fast algorithms and integer approximations. Academic Press, 2010.

]]>- apache2-2.4.18 => apache2-2.4.29
- php-5.6.16 => php-5.6.31
- udev-182 changed to eudev-3.2.2, etc.

```
sysroot_stage_all_append() {
sysroot_stage_dir ${WORKDIR}/headers/include ${SYSROOT_DESTDIR}/usr/include-uapi
}
```

We had this task in Jethro but there was another variable used ```
chosen {
bootargs = "cma=128M console=ttyPS0,115200 root=/dev/mmcblk0p2 rw earlyprintk rootwait rootfstype=ext4";
linux,stdout-path = "/amba@0/serial@e0000000";
};
```

```
chosen {
bootargs = "earlycon cma=128M root=/dev/mmcblk0p2 rw rootwait rootfstype=ext4";
stdout-path = "serial0:115200n8";
};
```

```
phy3: phy@3 {
compatible = "atheros,8035";
device_type = "ethernet-phy";
reg = <0x3>;
};
```

```
phy3: phy@3 {
/* Atheros 8035 */
compatible = "ethernet-phy-id004d.d072";
/* compatible = "ethernet-phy-ieee802.3-c22";*/
device_type = "ethernet-phy";
reg = <0x3>;
};
```

The ```
# This option is for FPGA part
CONFIG_XILINX_DEVCFG=y
# prints time before messages
CONFIG_PRINTK_TIME=y
# dependency for DYNAMIC_DEBUG=y
CONFIG_DEBUG_FS=y
# turned off because old:
CONFIG_XILINX_PS_EMAC=n
CONFIG_XILINX_EMACLITE=n
CONFIG_XILINX_AXI_EMAC=n
```

`ACTION=="add", RUN+="/usr/bin/rsync -a /lib/udev/devices/ /dev/"`

This rule adds up ~6 secs to boot time for some reason vs almost nothing if run from the camera init script - Reminder: the system boots into initramfs first and runs
*init*built by initramfs-live-boot. The script runs*switch_root*in the end. udev daemon gets killed and restarted

- Drupal as a general purpose CMS for the main site
- WordPress for the development blogs
- Mediawiki for the wiki-based documentation.
- Mailman (self hosted) and Mail Archive (external site) for the mailing list that is our main channel of the user technical support
- Gitlab CE for the code and other version-controlled content. We used Github but switched to self-hosted Gitlab CE following FSF recommendations
- Other customized versions of FLOSS web applications, such as OSTicket for support tickets and FrontAccounting for inventory and production
- In-house developed free software web applications, such as x3dom-based 3D scene and map viewer, 3D mechanical assembly viewer available for the assemblies and mechanical components on the wiki, WebGL Panorama Viewer/Editor.

- to have a common search over all subdomains always available (looking glass icon in the top right corner)
- as we can not cross-link properly all the information, then at least we have to communicate the idea that there are multiple subdomains to our visitors.

#Tell search engines we do not cloak Header always add Vary: Referer RewriteCond %{REQUEST_FILENAME} !-f RewriteCond "%{HTTP_REFERER}" "!^((.*)elphel\.com(.*)|)$" RewriteCond %{REQUEST_URI} !category.*feed RewriteCond %{REQUEST_URI} !^/[0-9]+\..+\.cpaneldcv$ <span style="color:#999999">RewriteCond %{REQUEST_URI} !^/\.well-known/pki-validation/[A-F0-9]{32}\.txt(?:\ Comodo\ DCV)?$</span> RewriteRule ^(.*)$ https://www.elphel.com/blog%{REQUEST_URI} [L,R=302]Such redirection is obviously impossible for external sites as Mail Archive and we did not use it for https://git.elphel.com – in both cases if visitors followed those links from the search engines results, they would not expect such pages to be parts of the well cross-linked company web site.

<script src="https://www.elphel.com/js/elphel_messenger.js"></script> <script> ElphelMessenger.init(); </script>

- the top (almost like a “frameset” in older HTML days) page itself has very little content – most is provided in the included iframe elements
- the served content depends on the referrer address – that might be considered as “cloaking.”

- Were we able to communicate the idea that the site consists of multiple loosely-connected subdomains with different navigation?
- Is the framed site navigation intuitive or annoying?
- Does the combined search over multiple subdomains do its job and does it behave as users expect?

- Initialize source/headers directories with bitbake, so it “knows” that everything needs to be rebuilt for the project
- Create a list of the source files (resolving symlinks when needed) and “touch” them, setting modification timestamps. This action prepares the files so the next (first after modification) file access will be recorded as access timestamp. Record the current time.
- Wait a few seconds to reliably distinguish if each file was accessed after modification
- Run bitbake build (“bitbake <target> -c compile -f”)
- Scan all the files from the previously created source list and generate ”include_list” of those that were accessed during the build process.
- As CDT accepts only “exclude” filters in this context, recursively combine full source list and include_list to generate “exclude_list” pruning all the branches that have nothing to include and replacing them with the full branch reference
- Apply the generated exclusion list to the CDT project file “.cproject”

- common (average) distortion of all four lenses approximated by analytical radial distortion model, and
- small residual deviation of each lens image transformation from the common distortion model

- tile center X,Y (for the virtual “center” image),
- center disparity, so the each of the 4 image tiles will be shifted accordingly, and
- the code of operation(s) to be performed on that tile.

- Reads the tile tasks from the shared system memory.
- Calculates locations and loads image and calibration data from the external image buffer memory (using on-chip memory to cache data as the overlapping nature of the tiles makes each pixel to participate on average in 4 neighbor tiles).
- Converts tiles to frequency domain using CLT based on 2d DCT-IV and DST-IV.
- Performs aberration correction in the frequency domain by pointwise multiplication by the calibration kernels.
- Calculates correlation-related data (Figure 4) for the tile pairs, resulting in tile disparity and disparity confidence values for all pairs combined, and/or more specific correlation types by pointwise multiplication, inverse CLT to the pixel domain, filtering and local maximums extraction by quadratic interpolation or windowed center of mass calculation.
- Calculates combined texture for the tile (Figure 5), using alpha channel to mask out pixels that do not match – this is the way how to effectively restore single-pixel lateral resolution after aggregating individual pixels to tiles. Textures can be combined after only programmed shifts according to specified disparity, or use additional shift calculated in the correlation module.
- Calculates other integral values for the tiles (Figure 5), such as per-channel number of mismatched pixels – such data can be used for quick second-level (using tiles instead of pixels) correlation runs to determine which 3d volumes potentially have objects and so need regular (pixel-level) matching.
- Finally tile processor saves results: correlation values and/or texture tile to the shared system memory, so software can access this data.

- drag the 3d view to rotate virtual camera without moving;
- move cross-hair ⌖ icon in the map view to rotate camera around vertical axis;
- toggle ⇅ button and adjust camera view elevation;
- use scroll wheel over the 3d area to change camera zoom (field of view is indicated on the map);
- drag with middle button pressed in the 3d view to move camera perpendicular to the view direction;
- drag the the camera icon (green circle) on the map to move camera horizontally;
- toggle ⇅ button and move the camera vertically;
- press a hotkey
**t**over the 3d area to reset to the initial view: set azimuth and elevation same as captured; - press a hotkey
**r**over the 3d area to set view azimuth as captured, elevation equal to zero (horizontal view).

- Obviously, self-driving cars – increased number of cameras located in a 2d pattern (square) results in significantly more robust matching even with low-contrast textures. It does not depend on sequential scanning and provides simultaneous data over wide field of view. Calculated confidence of distance measurements tells when alternative (active) ranging methods are needed – that would help to avoid infamous accident with a self-driving car that went under a truck.
- Visual odometry for the drones would also benefit from the higher robustness of image matching.
- Rovers on Mars or other planets using low-power passive (visual based) scene reconstruction.
- Maybe self-flying passenger multicopters in the heavy 3d traffic? Sure they will all be equipped with some transponders, but what about aerial roadkills? Like a flock of geese that forced water landing.
- High speed boating or sailing over uneven seas with active hydrofoils that can look ahead and adjust to the future waves.
- Landing on the asteroids for physical (not just Bitcoin) mining? With 150 mm baseline such camera can comfortably operate within several hundred meters from the object, with 1.5 m that will scale to kilometers.
- Cinematography: post-production depth of field control that would easily beat even the widest format optics, HDR with a pair of 4-sensor cameras, some new VFX?
- Multi-spectral imaging where more spatially separate cameras with different bandpass filters can be combined to the same texture in the 3d scene.
- Capturing underwater scenes and measuring how far the sea creatures are above the bottom.
- …

- Camera:
**NC393-F-CS**- Resolution@fps:
*1080p@30fps, 720p@60fps* - Compression quality:
*90%* - Exposure time:
*1.7 ms* - Stream formats:
*mjpeg, rtsp* - Sensor: MT9P001, 5MPx, 1/2.5″
- Lens: Computar f=5mm, f/1.4, 1/2″

- Resolution@fps:
- PC:
*Shuttle box, i7, 16GB RAM, GeForce GTX 560 Ti* - Display:
*ASUS VS24A, 60Hz (=16.7ms), 5ms gtg* - OS:
*Kubuntu 16.04* - Network connection:
*1Gbps,**direct camera-PC via cable* - Applications:
*gstreamer**chrome, firefox**mplayer**vlc*

- Stopwatch: basic javascript

Resolution/fps | Image size^{1}, KB |
Transfer time^{2}, ms |
Data rate^{3}, Mbps |
---|---|---|---|

720p/60 | 250 | 2 | 120 |

1080p/30 | 500 | 4 | 120 |

Resolution | t_{ROW}^{1}, us |
t_{TR}^{2}, us |
---|---|---|

720p | 22.75 | 13.33 |

1080p | 29.42 | 20 |

full res (2592×1936) | 36.38 | 27 |

Resolution | t_{ERS} avg^{1}, ms |
t_{ERS} whole range^{2}, ms |
---|---|---|

720p | 8 |
0.01-16 |

1080p | 16 |
0.02-32 |

Resolution | t_{CAM}, ms |
---|---|

720p | 9.9 |

1080p | 17.9 |

- 30 fps => 33.3 ms
- 60 fps => 16.7 ms

Resolution/fps | Total Latency, ms | Network+PC+SW latency, ms |
---|---|---|

720p@60fps | 33.3-50 |
23.4-40.1 |

1080p@30fps | 33.3-66.7 |
15.4-48.8 |

- For wifi: use 5GHz over 2.4GHz – smaller jitter, non-overlapping channels
- Lower latency software: for mjpeg use
**gstreamer**or vlc (takes an extra effort to setup) over chrome or firefox because they do extra buffering

- Latency in live network video surveillance
- Wifi latencies, 2.4GHz & 5GHz
- This video compares different displays.
- About ERS

mjpeg | rtsp | |
---|---|---|

port 0 | 2323 | 554 |

port 1 | 2324 | 556 |

port 2 | 2325 | 558 |

port 3 | 2326 | 560 |

- For mjpeg:

`~$ gst-launch-1.0 souphttpsrc is-live=true location=http://192.168.0.9:2323/mimg ! jpegdec ! xvimagesink`

- For rtsp:

`~$ gst-launch-1.0 rtspsrc is-live=true location=rtsp://192.168.0.9:554 ! rtpjpegdepay ! jpegdec ! xvimagesink`

`~$ vlc rtsp://192.168.0.9:554`