Fixed-point DSP chip TMS320F2812 realizes fast algorithm application-EEWORLD

Collect

　　1 Introduction to TMS320F2812

　　TMS320F2812 is a high-performance, multi-functional, cost-effective 32-bit fixed-point DSP chip for control from TI. The chip is compatible with the TMS320LF2407 instruction system and can work at a maximum frequency of 150MHz. It has 18k×16-bit 0-wait cycle on-chip SRAM and 128k×16-bit on-chip FLASH (access time 36ns). Its on-chip peripherals mainly include 2×8-channel 12-bit ADC (fastest conversion time 80ns), 2-channel SCI, 1-channel SPI, 1-channel McBSP, 1-channel eCAN, etc. It also has two event management modules (EVA, EVB), including 6-channel PWM/CMP, 2-channel QEP, 3-channel CAP, and 2-channel 16-bit timers (or TxPWM/TxCMP). In addition, the device has three independent 32-bit CPU timers and up to 56 independently programmable GPIO pins, which can be expanded to 1M×16-bit program and data memory. The TMS320F2812 adopts Harvard bus structure, has password protection mechanism, and can perform dual 16×16 multiplication and addition and 32×32 multiplication and addition operations, thus taking into account the dual functions of control and fast calculation.

　　This paper focuses on the fast calculation that can be achieved through reasonable system configuration and programming of the TMS320F2812 fixed-point DSP chip.

　　2 TMS320F2812 basic system configuration

1 TMS320F2812 clock

　　The on-chip peripherals of TMS320F2812 can be divided into the following four groups according to the input clock:

　　(1) SYSOUTCLK group: includes CPU timer and eCAN bus, which can be dynamically modified by PLLCR register;

　　(2) OSCCLK group: mainly watchdog circuit, the frequency division coefficient is set by the WDCR register;

　　(3) Low-speed group: There are SCI, SPI, McBSP, and the frequency division coefficient can be set by the LOSPCP register;

　　(4) High-speed group: including EVA/B and ADC, the frequency division coefficient can be set by the HISPCP register.

　　In order to make the system work faster, except for a few places such as timers and SCI that require low-speed clocks, other peripherals can work at 150MHz clock.

　　2.2 Storage Space

　　Figure 1 shows the internal storage space mapping diagram of TMS320F2812. TMS320F2812 is a Harvard structure DSP, which means that it can fetch instructions, read data, and write data at the same time in the same clock cycle. Logically, there are 4M×16-bit program space and 4M×16-bit data space, but physically, the program space and data space have been unified into a 4M×16-bit storage space. The order of priority of each bus from high to low is: data write, program write, data read, program read. The 256k×16-bit SARAM extended by CY7C1041 is located in Zone 6 (0x100000~0x13FFFF), and the access time is not less than 12ns; the 128k×16-bit FLASH space (0x3D8000~0x3F7FFF) has an instruction fetch time of not less than 36ns. In order to maximize the working speed of the device, while programming the FLASH register to make it work at a higher speed, the programs with strict time requirements (such as delay calculation subroutines, FIR filter subroutines, etc.) and variables (such as FIR filter coefficients, weight vectors of adaptive algorithms, etc.) can be moved to the H0, L0, L1, M0, and M1 spaces for operation.

Internal storage space map of TMS320F2812

　　2.3 Interruptions

　　The TMS320F28x series DSPs have a very rich set of peripherals on the chip, and each on-chip peripheral can generate one or more interrupt requests. The interrupt consists of two levels, one of which is the PIE interrupt and the other is the CPU interrupt. The CPU interrupt has 32 interrupt sources, including RESET, NMI, EMUINT, ILLEGAL, 12 user-defined software interrupts USER1 to USER12, and 16 maskable interrupts (INT1 to INT14, RTOSINT, and DLOGINT). All software interrupts are non-maskable interrupts. Since the CPU does not have enough interrupt sources to manage all on-chip peripheral interrupt requests, a peripheral interrupt expansion controller (PIE) is set up in the TMS320F28x series DSP to manage interrupt requests caused by on-chip peripherals and external pins.

　　There are 96 PIE interrupts, which are divided into 12 groups. Each group has 8 on-chip peripheral interrupt requests. The 96 on-chip peripheral interrupt request signals can be recorded as INTx.y (x=1,2,…,12; y=1,2,…,8). Each group outputs an interrupt request signal to the CPU, that is, the output INTx (x=1,2,…,…12) of PIE corresponds to the INT1~INT12 of the CPU interrupt input. Of the 96 possible PIE interrupt sources of the TMS320F28x series DSP, 45 are used by the TMS320F2812, and the rest are reserved for future DSP devices.

　　ADC, timer, SCI programming, etc. are all performed in interrupt mode, which can improve CPU utilization.

　　2.4 Reset Boot

　　Figure 2 shows the on-chip boot ROM space mapping of TMS320F2812. The boot program is configured at 0x3FFC00~0x3FFFBF in Figure 2. According to Figure 1, set VMAP=1, MP/MC=0, ENPIE=0, and the reset vector points to 0x3FFFC0 on the chip. The content of 0x3FFFC0 on the chip is 0x3FFC00, which points to the boot program in Figure 2. Configure GPIOF4 (SCITXDA)=1 in Table 2, then turn to 0x3F7FF6 in FLASH to start executing the program, and finally set the jump instruction at 0x3F7FF6 to point to the beginning of the user program to start running the user program. Since PIE interrupts are used in actual applications, in the user application, the PIE interrupt vector table should be initialized first, and then PIE should be enabled. TMS320F2812 on-chip boot ROM space mapping

　　3 Programming Design

　　Programming is an important part of achieving normal system operation and fast calculation. Under the condition of reasonable system configuration, the key to fast calculation with fixed-point chip is to use integers instead of floating-point numbers for calculation and processing. When using C compiler, in order to generate the best code, the following principles should be followed:

　　(1) Convert division to multiplication and try to make the compiler generate MAC instructions to fully utilize the DSP's hardware multiplier resources for fast calculations. The MAC operands should be local variables that can be allocated to registers (or to an accumulator).

　　(2) Use static direct insertion functions whenever possible to save the additional overhead of function calls.

　　(3) For the upper limit of the FOR loop, using a constant or a variable with a constant attribute can generate a repeated instruction RPT.

　　3.1 ADC Programming

　　TMS20F2812 has a 12-bit ADC with two 8-to-1 multiplexers and dual sample/holds. The analog input range is 0-3V, the fastest conversion rate is 80ns, and a 10kSPS sampling rate is selected. It uses the EVA timer (0.1ms) automatic trigger method, can sample 4 channels at the same time, and use the interrupt method at the end of each conversion to record the sampling results (shift right 4 bits).

　　Conversion result = (212

-1) × (input analog signal - ADCLO) / 3

　　During ADC conversion, first initialize the DSP system, then set the PIE interrupt vector table, and then initialize the ADC module. Next, load the entry address of the ADC interrupt into the interrupt vector table and turn on the interrupt. Then start the 0.1ms timer and wait for the ADC interrupt. Finally, read the ADC conversion result in the ADC interrupt and start the next interrupt with software.

　　3.2 FIR filter programming

　　The target signal is very sensitive to some low-frequency interference, which will directly affect the validity of the positioning results and data. In order not to affect the calculation of the time-delay data after filtering, a linear phase FIR filter can be used. The filter coefficient h(i) is generated by MATLAB, and then solidified into the program after being shaped. The purpose of doing this (rather than calculating the filter coefficient separately) is to achieve fast filtering without increasing the positioning calculation time of the entire measurement system too much.

　　3.3 Transplantation of positioning algorithm

　　Since the positioning algorithm uses the adaptive delay estimation method, the amount of calculation is very large, and the performance requirements of the DSP chip are relatively high. TMS320F2812 has a 32-bit hardware multiplier and accumulator, and its RPT instruction is very suitable for loop calculations. The processing capacity can reach 150MIPS, so it has high performance. However, it is a fixed-point processing chip, and fixed-point algorithms are needed to solve the problem of large processing volume. Therefore, 16-bit integer variables (Q=12: determined by ADC conversion accuracy) should be used for the initial data and weight vector, and 32-bit integer variables (Q=20: try to meet the calculation accuracy without overflowing the result) should be used for the intermediate results generated in the loop calculation; as for the calculation of trigonometric functions, the table lookup method can be used and the table in Figure 2 can be used for fast calculation.

　　The C compiler has a floating-point library, so the results of floating-point and fixed-point arithmetic can be compared. For 4 channels of 1024-point data processing, it takes about 3.6 seconds to implement with floating-point arithmetic, while it only takes 1.3 seconds with fixed-point arithmetic.

　　In addition, the algorithm can be optimized. The first is to configure the frequently used intermediate variables to the memory with a waiting cycle of 0; the second is to use FLASH acceleration technology (enable the ENPIPE bit of the FOPT register to implement the FLASH pipeline mode of the pre-pointing mechanism), which can achieve a processing capacity of 100 to 120MIPS, which is much higher than its own 36ns reading capacity. It should be noted that due to the protection mechanism of TMS320F2812, the program that accesses the FLASH register must be moved to L0 and L1 for execution. Despite this, transplanting this algorithm with a relatively high time requirement to memory H0 can achieve a processing speed of up to 150MIPS, and the function memcpy() can be used to complete the program migration.

　　4 Conclusion

　　When the amount of calculation is large, floating-point DSP chips are usually selected. In fact, in order to make full use of the on-chip resources of fixed-point DSP chips, the method introduced in this article can also be used to select fixed-point chips to achieve higher calculation speeds, which can save hardware design costs and cycles, and reduce power consumption.

Reference address：Fixed-point DSP chip TMS320F2812 realizes fast algorithm application

Previous article：Design of video processing system based on FPGA+DSP architecture
Next article：Parallel Processing Methods of DSP

Popular Resources
Popular amplifiers