Synchronizing Verilog, Python and C
Elphel NC393 as all the previous camera models relies on the intimate cooperation of the FPGA programmed in Verilog HDL and the software that runs on a general purpose CPU. Just as the FPGA manufacturers increase the speed and density of their devices, so do the Elphel cameras. FPGA code consists of the hundreds of files, tens of thousand lines of code and is constantly modified during the lifetime of the product both by us and by our users to accommodate the cameras for their applications. In most cases, if it is not just a bug fix or minor improvement of the previously implemented functionality, the software (and multiple layers of it) needs to be aware of the changes. This is both the power and the challenge of such hybrid systems, and the synchronization of the changes is an important issue.
Contents
Verilog parameters
Verilog code of the camera consists of the parameterized modules, we try to use parameters and generate Verilog operators in most cases, but `define macros and `ifdef conditional directives are still used to switch some global options (like synthesis vs. compilation, various debug levels). Eclipse-based VDT that we use for the FPGA development is aware of the parameters, and when the code instantiates a parametrized module that has parameter-dependent widths of the ports, VDT verifies that the instance ports match the signals connected to them, and warns the developer if it is not the case. Many parameters are routed through the levels of the hierarchy so the deeper instances can be controlled from a single header file, making it obvious which parameters influence which modules operations. Some parameters are specified directly, while some have to be calculated – it is the case for the register address decoders of the same module instances for different channels. Such channels have the same relative address maps, but different base addresses. Most of the camera parameters (not counting the trivial ones where the module instance parameters are defined by the nature of the code) are contained in a single x393_parameters.vh header file. There are more than six hundred of them there and most influence the software API.
Development cycle
When implementing some new camera FPGA functionality, we start with the simulation – always. Sometimes very small changes can be applied to the code, synthesized and tested in the actual hardware, but it almost never works this way – bypassing the simulation step. So far all the simulation we use consit of the plain old Verilog test benches (such as this or that) – not even System Verilog. Most likely for simulating CPU+FPGA devices ideal would be the use the software programming language to model the CPU side of the SoC and keep Verilog (or VHDL who prefers it) to the FPGA. Something like cocotb may work, especially we are already manually translating Verilog into Python, but we are not there yet.
Translaing Verilog to Python
So the next step is as I just mentioned – manual translation of the Verilog tasks and functions used in simulation to Python that code that can run on the actual hardware. The result does not look extremely pythonian as I try to follow already tested Verilog code, but it is OK. Not all the translation is manual – we use a import_verilog_parameters.py module to “understand” the parameters defined in Verilog files (including simple arithmetic and logical operations used to generate derivative parameters/localparams in the Verilog code), get the values from the same source and so reduce the possibility to accidentally use old software with the modified FPGA implementation. As the parameters are known to the program at a run time and PyDev (running, btw, in the same Eclipse IDE as the VDT – just as a different “perspective”) can not catch the misspelled parameter names. So the program has an option to modify itself and generate pre-defines for each of the parameter. Only the top part of the vrlg module is human-generated, everything under line 120 is automatically generated (and has to be re-generated only after adding new parameters to the Verilog source).
Hardware testing with Python programs
When the Verilog code is manually translated (or while new parts of the code are being translated or developed from scratch) it is possible to operate the actual camera. The top module is still called test_mcntrl as it started with DDR3 memory calibration using Levenberg-Marquardt algorithm (luckily it needs to run just once – it takes camera 10 minutes to do the full calibration this way).
This program keeps track of the Verilog parameters and macros, exposes all the functions (with the names not beginning with the underscore character), extracts docstrings from the code and combines it with the generated list of the function parameters and their default values, provides search/help for the functions with regexp (a must when there are hundreds of such functions). Next code ran in the camera:
1 2 3 4 5 6 7 8 9 |
x393 +0.043s--> help w.*_sensor_r === write_sensor_reg16 === defined in x393_sensor.X393Sensor, /usr/local/bin/x393_sensor.py: 496) Write i2c register in immediate mode @param num_sensor - sensor port number (0..3), or "all" - same to all sensors @param reg_addr16 - 16-bit register address (page+low byte, for MT9P006 high byte is an 8-bit slave address = 0x90) @param reg_data16 - 16-bit data to write to sensor register Usage: write_sensor_reg16 <num_sensor> <reg_addr16> <reg_data16> x393 +0.010s--> |
And the same one in PyDev console window of Eclipse IDE – “simulated” means that the program could not detect the FPGA and so it is not the target hardware:
1 2 3 4 5 6 7 8 9 |
x393(simulated) +0.121s--> help w.*_sensor_r === write_sensor_reg16 === defined in x393_sensor.X393Sensor, /home/andrey/git/x393/py393/x393_sensor.py: 496) Write i2c register in immediate mode @param num_sensor - sensor port number (0..3), or "all" - same to all sensors @param reg_addr16 - 16-bit register address (page+low byte, for MT9P006 high byte is an 8-bit slave address = 0x90) @param reg_data16 - 16-bit data to write to sensor register Usage: write_sensor_reg16 <num_sensor> <reg_addr16> <reg_data16> x393(simulated) +0.001s--> |
Python program was also used for the AHCI SATA controller initial development (before adding it was possible to add is as Linux kernel platform driver, but number of parameters there is much smaller, and most of the addresses are defined by the AHCI standard.
Synchronizing parameters with the kernel drivers
Next step is to update/redesign/develop the Linux kernel drivers to support camera functionality. Learning the lessons from the previous camera models (software was growing with the hardware incrementally) we are trying to minimize manual intervention into the process of synchronizing different layers of code (including the “hardware” one). Previous camera interface to the FPGA consisted of the hand-crafted files such as x353.h. It started from the x313.h (for NC313 – our first camera based on Axis CPU and Xilinx FPGA – same was used in NC323 that scanned many billions of book pages), was modified for the NC333 and later for our previous NC353 used in car-mounted panoramic cameras that captured most of the world’s roads.
Each time the files were modified to accommodate the new hardware, it was always a challenge to add extra bits to the memory controller addresses, image frame widths and heights (they are now all 16-bit wide – enough for the multi-gigapixel sensors). With Python modules already knowing all the current values of the Verilog parameters that define software interface it was natural to generate the C files needed to interface the hardware in the same environment.
Implementation of the register access in the FPGA
The memory-mapped registers in the camera share the same access mechanism – they use MAXIGP0 (CPU master, general purpose, channel 0) AXI port available in SoC, generously mapped there to 1/4 of the whole 32-bit address range (0x40000000.0x7fffffff). While logically all the locations are 32-bit wide, some use just 1 byte or even no data at all – any write to such address causes defined action.
Internally the commands are distributed to the target modules over a tree of byte-parallel buses that tolerate register insertion, at the endpoints they are converted to the parallel format by cmd_deser.v instances. The status data from the modules (sent by status_generate.v) is routed as messages (also in byte-parallel format to reduce the required FPGA routing resources) to a single block memory that can be read over the AXI by the CPU with zero delay. The status generation by the subsystems is individually programmed to be either on demand (in response to the write operation by the CPU) or automatically when the register data changes. While this write and read mechanism is common, the nature of the registers and data may be very different as the project combines many modules designed at different time for different purposes. All the memory mapped locations in the design fall into 3 categories:
- Read only registers that allow to read status from the various modules, DMA pointers and other small data items.
- Read/write registers – the ones where result of writing does not depend on any context. The full write register address range has a shadow memory block in parallel, so reading from that address will return the data that was last written there.
- Write-only registers – all other registers where write action depends on the context. Some modules include large tables exposed through a pair of address/data locations in the address map, many other have independent bit fields with the corresponding “set” bit, so internal values are modified for only the selected field.
Register access as C11 anonymous members
All the registers in the design are 32-bit wide and are aligned to 4-byte ranges, even as not all of them use all the bits. Another common feature of the used register model is that some modules exist in multiple instances, each having evenly spaced base addresses, some of them have 2-level hierarchy (channel and sub-channel), where the address is a sum of the category base address, relative register address and a linear combination of the two indices.
Individual C typedef is generated for each set of registers that have different meanings of the bit fields – this way it is possible to benefit from the compiler type checking. All the types used fit into the 32 bits, and as in many cases the same hardware register can accept alternative values for individual bit fields, we use unions of anonymous (to make access expressions shorter) bit-field structures.
Here is a generated example of such typedef code (full source):
383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 |
// I2C contol/table data typedef union { struct { u32 tbl_addr: 8; // [ 7: 0] (0) Address/length in 64-bit words (<<3 to get byte address) u32 :20; u32 tbl_mode: 2; // [29:28] (3) Should be 3 to select table address write mode u32 : 2; }; struct { u32 rah: 8; // [ 7: 0] (0) High byte of the i2c register address u32 rnw: 1; // [ 8] (0) Read/not write i2c register, should be 0 here u32 sa: 7; // [15: 9] (0) Slave address in write mode u32 nbwr: 4; // [19:16] (0) Number of bytes to write (1..10) u32 dly: 8; // [27:20] (0) Bit delay - number of mclk periods in 1/4 of the SCL period u32 /*tbl_mode*/: 2; // [29:28] (2) Should be 2 to select table data write mode u32 : 2; }; struct { u32 /*rah*/: 8; // [ 7: 0] (0) High byte of the i2c register address u32 /*rnw*/: 1; // [ 8] (0) Read/not write i2c register, should be 1 here u32 : 7; u32 nbrd: 3; // [18:16] (0) Number of bytes to read (1..18, 0 means '8') u32 nabrd: 1; // [ 19] (0) Number of address bytes for read (0 - one byte, 1 - two bytes) u32 /*dly*/: 8; // [27:20] (0) Bit delay - number of mclk periods in 1/4 of the SCL period u32 /*tbl_mode*/: 2; // [29:28] (2) Should be 2 to select table data write mode u32 : 2; }; struct { u32 sda_drive_high: 1; // [ 0] (0) Actively drive SDA high during second half of SCL==1 (valid with drive_ctl) u32 sda_release: 1; // [ 1] (0) Release SDA early if next bit ==1 (valid with drive_ctl) u32 drive_ctl: 1; // [ 2] (0) 0 - nop, 1 - set sda_release and sda_drive_high u32 next_fifo_rd: 1; // [ 3] (0) Advance I2C read FIFO pointer u32 : 8; u32 cmd_run: 2; // [13:12] (0) Sequencer run/stop control: 0,1 - nop, 2 - stop, 3 - run u32 reset: 1; // [ 14] (0) Sequencer reset all FIFO (takes 16 clock pulses), also - stops i2c until run command u32 :13; u32 /*tbl_mode*/: 2; // [29:28] (0) Should be 0 to select controls u32 : 2; }; struct { u32 d32:32; // [31: 0] (0) cast to u32 }; } x393_i2c_ctltbl_t; |
Some member names in the example above are commented out (like /*tbl_mode*/ in lines 398, 408 and 420). This is done so because some bit fields (in this case bits [29:28]) have the same meaning in all alternative structures, and auto-generating complex union/structure combinations to create a valid C code with each member having unique name would produce rather clumsy code. Instead this script makes sure that same named members really designate the same bit fields, and then makes them anonymous while preserving names for a human reader. The last member (u32 d32:32;) is added to each union making it possible to address each of them as an unsigned long variable without casting.
And this is a snippet of the part of the generator code that produced it:
1725 1726 1727 1728 1729 1730 1731 1732 1733 |
def _enc_i2c_tbl_wmode(self): dw=[] dw.append(("rah", vrlg.SENSI2C_TBL_RAH, vrlg.SENSI2C_TBL_RAH_BITS, 0, "High byte of the i2c register address")) dw.append(("rnw", vrlg.SENSI2C_TBL_RNWREG, 1, 0, "Read/not write i2c register, should be 0 here")) dw.append(("sa", vrlg.SENSI2C_TBL_SA, vrlg.SENSI2C_TBL_SA_BITS, 0, "Slave address in write mode")) dw.append(("nbwr", vrlg.SENSI2C_TBL_NBWR, vrlg.SENSI2C_TBL_NBWR_BITS,0, "Number of bytes to write (1..10)")) dw.append(("dly", vrlg.SENSI2C_TBL_DLY, vrlg.SENSI2C_TBL_DLY_BITS, 0, "Bit delay - number of mclk periods in 1/4 of the SCL period")) dw.append(("tbl_mode", vrlg.SENSI2C_CMD_TAND, 2, 2, "Should be 2 to select table data write mode")) return dw |
The vrlg.* values used above are in turn read from the x393_parameters.vh Verilog file:
374 375 376 377 378 379 380 381 382 383 384 385 386 |
//i2c page table bit fields parameter SENSI2C_TBL_RAH = 0, // high byte of the register address parameter SENSI2C_TBL_RAH_BITS = 8, parameter SENSI2C_TBL_RNWREG = 8, // read register (when 0 - write register parameter SENSI2C_TBL_SA = 9, // Slave address in write mode parameter SENSI2C_TBL_SA_BITS = 7, parameter SENSI2C_TBL_NBWR = 16, // number of bytes to write (1..10) parameter SENSI2C_TBL_NBWR_BITS = 4, parameter SENSI2C_TBL_NBRD = 16, // number of bytes to read (1 - 8) "0" means "8" parameter SENSI2C_TBL_NBRD_BITS = 3, parameter SENSI2C_TBL_NABRD = 19, // number of address bytes for read (0 - 1 byte, 1 - 2 bytes) parameter SENSI2C_TBL_DLY = 20, // bit delay (number of mclk periods in 1/4 of SCL period) parameter SENSI2C_TBL_DLY_BITS= 8, |
Auto-generated files also include x393.h, it provides other constant definitions (like valid values for the bit fields) – lines 301..303, and function declarations to access registers. Names of the functions for read-only and write-only are derived from the address symbolic names by converting them to the lower case, the ones which deal with read/write registers have set_ and get_ prefixes attached.
301 302 303 304 305 306 307 308 309 |
#define X393_CMPRS_CBIT_CMODE_JPEG18 0x00000000 // Color 4:2:0 #define X393_CMPRS_CBIT_FRAMES_SINGLE 0x00000000 // Use single-frame buffer #define X393_CMPRS_CBIT_FRAMES_MULTI 0x00000001 // Use multi-frame buffer // Compressor control void x393_cmprs_control_reg (x393_cmprs_mode_t d, int cmprs_chn); // Program compressor channel operation mode void set_x393_cmprs_status (x393_status_ctrl_t d, int cmprs_chn); // Setup compressor status report mode x393_status_ctrl_t get_x393_cmprs_status (int cmprs_chn); |
Register access functions are implemented with readl() and writel(), this is a corresponding section of the x393.c file:
264 265 266 267 268 |
// Compressor control void x393_cmprs_control_reg (x393_cmprs_mode_t d, int cmprs_chn) {writel(d.d32, mmio_ptr + (0x1800 + 0x40 * cmprs_chn));} // Program compressor channel operation mode void set_x393_cmprs_status (x393_status_ctrl_t d, int cmprs_chn) {writel(d.d32, mmio_ptr + (0x1804 + 0x40 * cmprs_chn));} // Setup compressor status report mode x393_status_ctrl_t get_x393_cmprs_status (int cmprs_chn) { x393_status_ctrl_t d; d.d32 = readl(mmio_ptr + (0x1804 + 0x40 * cmprs_chn)); return d; } |
There are two other header files generated from the same data, one (x393_defs.h) is just an alternative way to represent register addresses – instead of the getter and setter functions it defines the preprocessor macros:
247 248 249 250 |
// Compressor control #define X393_CMPRS_CONTROL_REG(cmprs_chn) (0x40001800 + 0x40 * (cmprs_chn)) // Program compressor channel operation mode, cmprs_chn = 0..3, data type: x393_cmprs_mode_t (wo) #define X393_CMPRS_STATUS(cmprs_chn) (0x40001804 + 0x40 * (cmprs_chn)) // Setup compressor status report mode, cmprs_chn = 0..3, data type: x393_status_ctrl_t (rw) |
The last generated file – x393_map.h uses the preprocessor macro format to provide a full ordered address map of all the available registers for all channels and sub-channels. It is intended to be used just as a reference for developers, not as an actual include file.
Conclusions
The generated code for Elphel NC393 camera is definitely very hardware-specific, its main purpose is to encapsulate as much as possible of the hardware interface details and so to reduce dependence of the higher layers of software on the modifications of the HDL code. Such tasks are common to other projects that involve CPU/FPGA tandems, and similar approach to organizing software/hardware interface may be useful there too.
https://news.ycombinator.com/item?id=11395603