by Oleg Dzhimiev
1. [Done] Wrote code for the FFT256 DIF.
FFT is performed in 8 conveyor organized stages. Each stage is similar to as described here. So far each stage uses 2 BRAM ports (A for 16-bit Re-part and B for 16-bit Im-part) and the 1 MULT18X18 (for the “butterfly”) – 4 BRAMs + 6 MULT18X18s (MULTs are not used in the last 2 stages). Plus other logic – 1 FFT256 uses 18% of FPGA resourses.
Each BRAM is shared by 2 stages for write and by 2 stages for read (e.g., write – stages 2,3 and read – stages 3,4) – and because the address bus is used for 4 double accesses then each channel double writes/reads every 8 tacts – this results in:
Load time + Computation time + Readout time @10ns Clk ~ 2.5us + 8×10us + 2.5us = 85us =(
1a. [Done] Removed 2 MULTs from the stages 7 & 8 – because the sine and cosine are +/-1 or 0 – no need in multiplication.
2. [In Progress] FFT256 verification – calculated coefficients in OOo Spreadsheet – for tested sequence the results are almost equal – will be better to move it to the testbench.
TODO: (almost the same because initially there was a DIT algorithm and I was writing DIF)
- Write a correlation computation block.
- Make FFT run faster – get away from full buffering between stages.
- Set up memory controller for frames read/write.
- Integrate the correlation block to 10359’s firmware.
by Oleg Dzhimiev
1. [Done] Add headers and commit to CVS.
2. [Done] Optimize BRAM usage.
4 BRAMs & 8 MULTs, ~150MHz after Synthesis (will be less of course after implementation)
Fig.1 FFT256 diagram
3. [In Progress] FFT256 Verification.
NOTES:
- Do wires to BRAM36X36 and to MULT18X18 are shared? Can I use all 28 BRAMs and 28 MULTs in one design?
TODO:
- Check if the FFT256 results are correct.
- Write a correlation computation block.
- Write complex multiplication of 2 FFT256s (in order to get a cross-correlation spectrum)
- Add IFFT256 run for the multiplication result.
by Oleg Dzhimiev
1. [Done] Add 1/256 after the last stage and check the simulation.
- Added a 1-bit right shift (1/2) after each “butterfly” – in the end it is 1/256 accumulated. This can be an early rounding. With 12 bits of color after all the sums in 8 stages the final value can be up to 20 bits wide. Only 16 bits are stored so the 1-bit right shift keeps the value within 12. Alternatively, a 4-bit right shift can be applied after every 4 stages.
2. [Done] Make a simple simulation with FFT256 and IFFT256.
- Made a testbench FFT256-IFFT256. The input and the output differ. Need to check more properly overflows in calculations and cosine tables.
- Corrected a mistake concerning two’s complement 8 bits expanding to 18 (just filling the upper part with the 7th bit(Hi)). Same for 16 bit values before 18×18 multiplication.
3. [In Progress] Optimize BRAM usage, the goal is 5 or 6 BRAMs & 8 MULTs (not sure for the input and output buffer)
- The plan is to use 4 BRAM32X32s and the output will be not buffered – the “butterfly” output needs to be delayed by 2 tacts then. And a specific writes and write order to input buffer will be needed.
TODO:
- Optimize BRAM usage.
- Add headers and commit to CVS.
- Write a correlation computation block.
- Set up memory controller for frames read/write.
- Integrate the correlation block to 10359’s firmware.
by Oleg Dzhimiev
1. Wrote a short manual on how to connect 10359 with 10353 and sensors: link
2. Almost finished coding FFT256 main module with no integration to 10359 code.
FFT is performed in 8 conveyor organized stages. Each stage is similar to as described here. So far each stage uses 1 BRAM and the 1 MULT18X18 (for the “butterfly”) + 1 BRAM for the input buffer – 9 BRAMs + 8 MULT18X18s. Plus other logic – 1 FFT256 uses 16% of FPGA resourses.
But, each ‘stage’ uses only a half of the BRAM (256 real and 256 imaginary). So it’s possible to optimize the module, the ways are:
- Currently the “butterfly” writes out Re & Im parts simultaneously using both (A and B) BRAM ports and only lower 512 cells. It is possible to use only one BRAM port doing writing and reading ‘Re’s and ‘Im’s sequentially and using lower 512 cells (as Re parts are calculated earlier) while another port can be used for other stage and access higher 512 cells.
- There are pauses between accesses to a BRAM – so stage-modules can be made working in turns.
Load time + Computation time + Readout time @10ns Clk ~ 2.5us + 8x5us + 2.5us = 45us.
But here it takes them 15us@25MHz to compute FFT256. They probably use only 1 tact for the “butterfly”. As I use only one MULT18X18 it takes 4 tacts to get the result:
4 x 128 x 10 ns ~ 5.12us per “butterfly”. With 1 tact & 4 MULTs it could be 1.28us and the overall FFT256 time 15us. But their clock is 4 times slower. How did they do that? Is it possibe to make a 4us FFT256 @96MHz?
With the 8-stage conveyor structure it speeds up by 8 times against non-conveyored computation. For a full resolution frame 2592×1944 with 128 bit overlap in a line it will take ~ (45us x (2592/128) x 1944) / 8 ~ 50us x 20 x 2000 / 8 = 0.25s or 4 fps
- Add 1/N after the last stage and check the simulation.
- Make a simple simulation with FFT256 and IFFT256.
- Optimize BRAM usage, the goal is 5 or 6 BRAMs & 8 MULTs (not sure for the input and output buffer)
- Can I get out of the 16-bit value border? Not cool if yes.