April 9, 2009

FFT256

by Oleg Dzhimiev

1. Wrote a short manual on how to connect 10359 with 10353 and sensors: link
2. Almost finished coding FFT256 main module with no integration to 10359 code.

  • Resources

FFT is performed in 8 conveyor organized stages. Each stage is similar to as described here. So far each stage uses 1 BRAM and the 1 MULT18X18 (for the “butterfly”) + 1 BRAM for the input buffer – 9 BRAMs + 8 MULT18X18s. Plus other logic – 1 FFT256 uses 16% of FPGA resourses.

But, each ‘stage’ uses only a half of the BRAM (256 real and 256 imaginary). So it’s possible to optimize the module, the ways are:

  1. Currently the “butterfly” writes out Re & Im parts simultaneously using both (A and B) BRAM ports and only lower 512 cells. It is possible to use only one BRAM port doing writing and reading ‘Re’s and ‘Im’s sequentially and using lower 512 cells (as Re parts are calculated earlier) while another port can be used for other stage and access higher 512 cells.
  2. There are pauses between accesses to a BRAM – so stage-modules can be made working in turns.
  • Performance Time

Load time + Computation time + Readout time @10ns Clk ~ 2.5us + 8x5us + 2.5us = 45us.

But here it takes them 15us@25MHz to compute FFT256. They probably use only 1 tact for the “butterfly”. As I use only one MULT18X18 it takes 4 tacts to get the result:

4 x 128 x 10 ns ~ 5.12us per “butterfly”. With 1 tact & 4 MULTs it could be 1.28us and the overall FFT256 time 15us. But their clock is 4 times slower. How did they do that? Is it possibe to make a 4us FFT256 @96MHz?

With the 8-stage conveyor structure it speeds up by 8 times against non-conveyored computation. For a full resolution frame 2592×1944 with 128 bit overlap in a line it will take ~ (45us x (2592/128) x 1944) / 8 ~ 50us x 20 x 2000 / 8 = 0.25s or 4 fps

  • TODO:
  1. Add 1/N after the last stage and check the simulation.
  2. Make a simple simulation with FFT256 and IFFT256.
  3. Optimize BRAM usage, the goal  is 5 or 6 BRAMs &  8 MULTs (not sure for the input and output buffer)
  4. Can I get out of the 16-bit value border? Not cool if yes.

Leave a Reply

Your email address will not be published. Required fields are marked *


two + = 3