June 22, 2009
by Andrey Filippov
After finishing the PCB layout of the 10373 board I noticed that it is very easy to add a second sensor board connector.
First of all – there is room for that connector – space wasted between the two tabs with mounting holes that we added at the request of our customer. And the connector nicely fits there.
Then – the bank 0 of the Spartan 6 FPGA has enough pins for 2 sensor ports (each port uses 22 I/O lines).
The 10373 board is expected to have 3-4 times higher processing speed than 10353 so using it with the current sensor boards would keep it under-loaded. For the newer (and faster) sensors I plan to use high speed serial transceivers of the Spartan chips as I wrote earlier, but the first revision will be compatible with the older sensors.
That second connector would cost just the price of the connector as I use free so far space on the board and unused pins on the FPGA. It can be used in multiple ways, such as:
1 – stereo cameras. It will be possible to incorporate code Oleg is working on now for automatic processing of the 3-d information – i.e. for autonomous vehicles. And having higher bandwidth of the system and network it will be possible to stream or record both of the stereo channels at the full resolution/frame rate sensor is capable of – it could be used for stereo cinema applications.
2 – panoramic applications with fewer number of complete camera modules, fewer network ports than needed currently when the 353 cameras are used.
3 – Day/night cameras with both color (with IR cutoff filter) and monochrome (without filter) sensors connected at the same time.
Andrey
Now it looks like this:
June 16, 2009
by Oleg Dzhimiev
1. [Done] Finish with the “full_pages_in_buffer“. Made more convenient to me and corrected the error I had done before (double write request from channel).
2. [In Progress] Verification of everything.
Currently instead of black frames I get different ones (apparently, something’s wrong: black vertical lines after each 1024 pixels) – checking the simulation:
Fig. 1 Test image, size 2592×1940
TODO:
1. Verification.
2. I also didn’t implemented multiplication on the Hamming function before performing the FFTs.
June 15, 2009
by Andrey Filippov
I do not know why search engines love this 8-year old page – current camera model is 393. It’s PCB layout is also available in both Gerber and PDF form on this wiki page.
I’m working now on the 10373 PCB layout – there will be some minor cleanup and double-checking all the new components used, their mechanical dimensions and pinouts, but so far I was able to fit all the components to the same size PCB and route all the traces. Sure I had to use more layers – there are 8 now, including 4 dedicated to ground and power planes, there are now blind laser microvias between the the outer layers and the next ones, some buried vias and reduced traces widths.
And there is even some room left on the board – something to be added later in the next revision.
Andrey
June 1, 2009
by Oleg Dzhimiev
1. [Done] Simulation of data path of Cross-Correlation (C-C) module.
2. [Done] Integration of the C-C block to x359.
3. [In Progress] Simulation of the C-C and x359 as a whole.
3.a. [Done] Set up memory controller.
- Write length is set to 64(8×8) words.
- Fixed image size to 2592×1944.
- C-C block is set to work with 512 words, reading out 256 word at a time with overlapping – 3 256-word reads till another command to the C-C block.
3.b. [Done] Simulate for 1 write and 1 read channels. Here there are no competetive requests from 2 write channels.
3.c. [In Progress] Simulate for 2 write and 2 read channels.
- Fixed reg “full_pages_in_buffer” correct work. It was commented earlier by me as I used only one channel at a time to access SDRAM.
- For write channels – changed reg “full_pages_in_buffer” size to 32 (page size is 64). And if it reaches the maximum then a aditional BRAM is neaded.
TODO:
1. Finish with the “full_pages_in_buffer“.
2. Make an x359.bit and check the work with the whole frame.
3. I also didn’t implemented multiplication on the Hamming function before performing the FFTs.
May 18, 2009
by Andrey Filippov
It seems I finished the first pass of the 10373 circuit design – “first pass” as the design is likely to change somewhat during PCB layout stage. The challenging part turned out to be power management – TI DaVinci (at least in it’s current version as TMS320DM6467) is rather power-hungry, needs significantly more of it than the Axis ETRAX FS that was used in the NC353L camera, even as it it smaller in size – and I’m really anxious to fit more stuff on the same size PCB as in the older cameras.
Sure enough that extra electrical power gives more badly needed computational power too and I wanted to have universal system board that can use the extra power when available but be able to survive (scaling down performance) when used in mobile applications. And I believe I did balance it properly and hope to make it through the PCB layout and fit increased number of components, but four layer PCB of the 303-313-323-333-353 cameras will have to go – number of layers will go up to (most likely) eight.
It may be a tradition or superstition – I am not going to post the circuit diagram until the hardware will be built and tested. So far I can only list some of the features of the 10373 system board:
- TMS320DM6467 as the CPU
- Xilinx Spartan 6 FPGA
- 256 MiB NAND flash
- 256 MiB DDR2 system memory
- 256 MiB DDR3 dedicated FPGA memory
- 48V DC
- GigE
- USB 2.0 host
- ATA port (to be used with the SATA bridge on the 10369 or similar I/O board)
- USB 1.1 device with micro USB connector. It is connected to the CPU serial port to be used as a system console and alternative boot source
April 24, 2009
by Oleg Dzhimiev
1. [Done] Wrote code for the FFT256 DIF.
FFT is performed in 8 conveyor organized stages. Each stage is similar to as described here. So far each stage uses 2 BRAM ports (A for 16-bit Re-part and B for 16-bit Im-part) and the 1 MULT18X18 (for the “butterfly”) – 4 BRAMs + 6 MULT18X18s (MULTs are not used in the last 2 stages). Plus other logic – 1 FFT256 uses 18% of FPGA resourses.
Each BRAM is shared by 2 stages for write and by 2 stages for read (e.g., write – stages 2,3 and read – stages 3,4) – and because the address bus is used for 4 double accesses then each channel double writes/reads every 8 tacts – this results in:
Load time + Computation time + Readout time @10ns Clk ~ 2.5us + 8×10us + 2.5us = 85us =(
1a. [Done] Removed 2 MULTs from the stages 7 & 8 – because the sine and cosine are +/-1 or 0 – no need in multiplication.
2. [In Progress] FFT256 verification – calculated coefficients in OOo Spreadsheet – for tested sequence the results are almost equal – will be better to move it to the testbench.
TODO: (almost the same because initially there was a DIT algorithm and I was writing DIF)
- Write a correlation computation block.
- Make FFT run faster – get away from full buffering between stages.
- Set up memory controller for frames read/write.
- Integrate the correlation block to 10359’s firmware.
April 14, 2009
by Oleg Dzhimiev
1. [Done] Add headers and commit to CVS.
2. [Done] Optimize BRAM usage.
4 BRAMs & 8 MULTs, ~150MHz after Synthesis (will be less of course after implementation)
Fig.1 FFT256 diagram
3. [In Progress] FFT256 Verification.
NOTES:
- Do wires to BRAM36X36 and to MULT18X18 are shared? Can I use all 28 BRAMs and 28 MULTs in one design?
TODO:
- Check if the FFT256 results are correct.
- Write a correlation computation block.
- Write complex multiplication of 2 FFT256s (in order to get a cross-correlation spectrum)
- Add IFFT256 run for the multiplication result.
April 10, 2009
by Andrey Filippov
I’m considering using Xilinx high speed Serial I/O – 4 of the GTP in XC6SLX45T in CSG324 package (15mm x 15mm) to interface the sensor boards – it can be very attractive to be able to migrate to higher performance sensors from the current Aptina 5MPix ones. It would also enable higher bandwidth to the next version of the 10359 board – up to four of 3Gb/s channels, reduce EMI and enable longer cables.
The problem is that normally you need a large chip on the other (sensor) end capable of encoding data in a way compatible with the GTP – it would be an overkill for the simple CMOS sensor module. But it seems there is a way to use a tiny (5mm x 5mm) 3 – channel (plus clock – total 4) Texas Instruments SN65LVDS301 FlatLink(tm)3G that encodes 24 bit parallel data plus 3 sync signals into serial data. Maximal “pixel” rate – 65 MHz (not enough for MT9P031) but with additional register or better CPLD (like XC2C64A in CP56 6mm x 6mm package) it would be possible to double the pixel rate and add some scrambling to reduce disparity – not that critical as the data rate per pair would be only 480Mb/s. It seems that it would be possible to make GTP to accept that slow data rate, bond 4 channels (each still has to be able to recover it’s own clock) and use SN65LVDS301’s clock as a master in this bonding with a comma mask ox 10’b1111111111 and comma pattern of 10’b1111100000.
Most likely I’ll still use smaller XC6SLX25 (w/o “T”) first and use the bank that has GTP-s in 45T to connect the 30-pin parallel flex connector (compatible with the 10353 and the existent sensor boards) first, upgrading to high speed serial I/O in the next revision/model. that will be rather simple task as it will not require major PCB redesign – FPGA is side-by-side with the sensor connector, so those traces do not intersect with other ones.
by Oleg Dzhimiev
1. [Done] Add 1/256 after the last stage and check the simulation.
- Added a 1-bit right shift (1/2) after each “butterfly” – in the end it is 1/256 accumulated. This can be an early rounding. With 12 bits of color after all the sums in 8 stages the final value can be up to 20 bits wide. Only 16 bits are stored so the 1-bit right shift keeps the value within 12. Alternatively, a 4-bit right shift can be applied after every 4 stages.
2. [Done] Make a simple simulation with FFT256 and IFFT256.
- Made a testbench FFT256-IFFT256. The input and the output differ. Need to check more properly overflows in calculations and cosine tables.
- Corrected a mistake concerning two’s complement 8 bits expanding to 18 (just filling the upper part with the 7th bit(Hi)). Same for 16 bit values before 18×18 multiplication.
3. [In Progress] Optimize BRAM usage, the goal is 5 or 6 BRAMs & 8 MULTs (not sure for the input and output buffer)
- The plan is to use 4 BRAM32X32s and the output will be not buffered – the “butterfly” output needs to be delayed by 2 tacts then. And a specific writes and write order to input buffer will be needed.
TODO:
- Optimize BRAM usage.
- Add headers and commit to CVS.
- Write a correlation computation block.
- Set up memory controller for frames read/write.
- Integrate the correlation block to 10359’s firmware.
April 9, 2009
by Oleg Dzhimiev
1. Wrote a short manual on how to connect 10359 with 10353 and sensors: link
2. Almost finished coding FFT256 main module with no integration to 10359 code.
FFT is performed in 8 conveyor organized stages. Each stage is similar to as described here. So far each stage uses 1 BRAM and the 1 MULT18X18 (for the “butterfly”) + 1 BRAM for the input buffer – 9 BRAMs + 8 MULT18X18s. Plus other logic – 1 FFT256 uses 16% of FPGA resourses.
But, each ‘stage’ uses only a half of the BRAM (256 real and 256 imaginary). So it’s possible to optimize the module, the ways are:
- Currently the “butterfly” writes out Re & Im parts simultaneously using both (A and B) BRAM ports and only lower 512 cells. It is possible to use only one BRAM port doing writing and reading ‘Re’s and ‘Im’s sequentially and using lower 512 cells (as Re parts are calculated earlier) while another port can be used for other stage and access higher 512 cells.
- There are pauses between accesses to a BRAM – so stage-modules can be made working in turns.
Load time + Computation time + Readout time @10ns Clk ~ 2.5us + 8x5us + 2.5us = 45us.
But here it takes them 15us@25MHz to compute FFT256. They probably use only 1 tact for the “butterfly”. As I use only one MULT18X18 it takes 4 tacts to get the result:
4 x 128 x 10 ns ~ 5.12us per “butterfly”. With 1 tact & 4 MULTs it could be 1.28us and the overall FFT256 time 15us. But their clock is 4 times slower. How did they do that? Is it possibe to make a 4us FFT256 @96MHz?
With the 8-stage conveyor structure it speeds up by 8 times against non-conveyored computation. For a full resolution frame 2592×1944 with 128 bit overlap in a line it will take ~ (45us x (2592/128) x 1944) / 8 ~ 50us x 20 x 2000 / 8 = 0.25s or 4 fps
- Add 1/N after the last stage and check the simulation.
- Make a simple simulation with FFT256 and IFFT256.
- Optimize BRAM usage, the goal is 5 or 6 BRAMs & 8 MULTs (not sure for the input and output buffer)
- Can I get out of the 16-bit value border? Not cool if yes.
« Previous Page —
Next Page »