Research on TMS320C62x software development method-EEWORLD

Collect

Abstract: This article explains the system structure and software design method of TMS320C62x, and introduces the implementation of the broadband millimeter wave radar target delay neural network recognition algorithm on the TMS320C6201 EVM board. Through program verification, relatively good processing results were achieved.

Keywords: TMS320C6201 DSP software design, real-time processing target recognition

Digital signal processing (DSP) technology has achieved rapid development in recent years. Currently, DSP chips have been widely used in communications, image processing, voice processing, radar and other fields. TI is one of the major suppliers of DSP chips in the world today. Its TMS320C6000 is a new generation of high-performance DSP chips in the TMS320 series, of which the fixed-point series is TMS320C62x and the floating-point series is TMS320C67x. The TMS320C6201 chip is a representative product of the fixed-point series, with a processing capacity of up to 1600MIP. This article focuses on the software design method of TMS320C62x. Yan explains the system structure of TMS320C62x and the software design method based on TMS320C62x to implement the broadband millimeter wave radar target delay neural network recognition algorithm on TMS320C6201.

1 System structure of TMS320C62x

The system structure of TMS320C62x is shown in Figure 1. The TMS320C62x processor consists of three main parts: CPU core, peripherals and memory. The 8 functional units in the CPU core can complete parallel operations. The functional units perform operations such as logic, displacement, multiplication, addition, and data addressing. The architecture of the TMS320C6000 series chips adopts the Very Long Instruction Word (VLIW) method, with a single instruction word length of 32 bits, and each 32-bit instruction occupies one functional unit. The instruction fetching, instruction distribution and instruction decoding unit can transfer 8 instructions from the program memory to the functional unit per cycle. These 8 instructions form an instruction package with a total word length of 8×32=256 bits. A special instruction distribution module is set up inside the chip, which can distribute each 256-bit instruction package to 8 functional units, and run them in parallel. The maximum clock frequency of the TMS320C62x chip can reach 200MHz. When 8 functional units are running at the same time, the chip's processing power is as high as 1600MIP.

The total on-chip memory capacity of the TMS320C62x chip is 1M, of which 2K×256 bits are used for program memory and program cache, with a width of 256 bits; 64K bytes are used for data memory and data cache, and users can access 8-bit, 16-bit and 32-bit memory. bits of data. The peripheral modules of the TMS320C62x chip include multi-channel buffered serial ports, clocks, external memory interface EMIF, DAA controller, host port and Power-down logic, etc. The DMA controller can control the transfer of data between different areas of the memory space; external memory interface The maximum capacity of off-chip memory that EMIF can access is 64MB, the data bus width is 32 bits, and it also provides read and write support for 8-bit and 16-bit memories; the 16-bit wide host port HPI can access the storage space and devices of TMS320C62x ;A variety of peripheral modules make the TMS320C62x chip very powerful.

2 Software design method of TMS320C62x

When developing application software, users should first clarify the function and performance requirements of the application software, and then design the software according to the three stages of the code development process: the first stage is to develop C code; the second stage is to optimize the C code; the third stage is writing linear assembly code. The above three stages are not necessary to go through. If the function and performance requirements of the application software have been realized at a certain stage, then there is no need to enter the next stage. The code development flow chart is shown in Figure 2.

2.1 Develop C code

Key points to consider when developing C language code include: ① data results; ② analyze C code performance; ③ use lookup tables; ④ use integers (int) to represent floating point numbers.

2.1.1 Data structure

The TMS320C62x compiler defines a size for each data structure, 8 bits for character type (char), 16 bits for short integer type (short), 32 bits for integer type (int), and 40 bits for long integer type (long). The floating point type (float) is 32 bits, and the double precision floating point type (double) is 64 bits. The rules that should be followed when writing C code are: avoid treating int and long as the same size in the code, because the compiler uses 40-bit operations for long data; for fixed-point multiplication, short data should be used whenever possible. This data type allows more efficient use of the TMS320C62x multiplier; the int or unsigned int type should be used for the loop counter instead of the short or unsigned short type to avoid unnecessary sign extension.

2.1.2 Analyze C code performance

Using the Profile tool of the debugger, you can get a statistical table about the execution of each specific code segment in the C code, and you can also get the number of CPU clock cycles used for the execution of a specific code segment, so you can find out the C language code that affects the overall performance of the software program. segments to improve (usually looping code segments that affect software program bus performance).

2.1.3 Using lookup tables

In C language code, statements or functions that obtain results through direct calculation can use lookup tables or constant numerical codes, so the instruction execution speed can be improved.

2.1.4 Use integers (int) to represent floating point numbers

Since TMS320C62x is a fixed-point chip, it does not support floating-point operations. For pool-point addition, subtraction, multiplication and division operations, the floating-point operation should be converted into a series of fixed-point operations through the TMS320C62x compiler, and this series of fixed-point operations should be processed by the functional unit of the TMS320C62x chip. Floating-point operations are relatively time-consuming. During program writing, fixed-point data structures should be used as much as possible. For C language, the data structure of integer type (int) should be used as much as possible. In the C language based on TMS320C62x, integer data occupies 4 bytes, and the maximum data range it can represent is: -214783648~2147483647. Because the actual calculated data is usually floating point, floating point data needs to be converted into integer data for processing through scaling to improve the processing speed of the application. The selection of the number of decimal places is very critical, not only to ensure that the converted data processing accuracy meets the requirements, but also to prevent data overflow during the data processing process.

2.2 Optimize C code

Optimizing C code includes cooperating with compiler options, using inline functions, using word access to short integer data, and using software pipelining. Coder options control the operation of the compiler, some of which enable optimization of C code.

2.2.1 Indicate irrelevant instructions to the compiler

In order for instructions to operate in parallel, the encoder must determine the relationship or dependence between instructions, that is, one instruction must occur after another instruction, and only unrelated instructions can be executed in parallel. If the compiler cannot determine that two instructions are unrelated, the compiler assumes that they are related and will arrange for them to be executed serially. Users can specify relevant instructions through the following methods:

·The keyword const can specify a target. const indicates that a variable or the storage unit of a variable remains unchanged. Using const can improve the performance and adaptability of the code.

·Use the -pm option together with the -03 option to determine program priority. In program priority, all source files are compiled into a module that is optimized and generated by the compiler, allowing the compiler to more efficiently eliminate dependencies.

·Use the -mt option to indicate to the compiler that there are no memory dependencies in the code, which allows the compiler to optimize under the assumption that there are no memory dependencies.

2.2.2 Using intrinsics

The inline functions provided by the TMS320C62x compiler are special functions that are directly mapped to inline C6000 instructions. Users can use inline functions to quickly optimize C code.

2.2.3 Using words to access short integer data

Some instructions in inline functions operate on the upper 16-bit and lower 16-bit fields stored in 32-bit registers. When there are a large number of short integer data to operate, you can use words (integers) to access two short integer data at a time, and then use inline functions to operate on these data, thereby reducing memory access.

2.2.4 Using software pipeline

Software pipelining is a technology that uses the method of arranging loop instructions to execute multiple iterations of the loop in parallel. When compiling, using the -o2 option and -o3 option, the compiler can implement software flow for loop code. The minimum number of loop iterations that a software pipeline structure needs to execute in order to fill the software pipeline is called the minimum loop count. When the compiler cannot determine the size of the total number of loops and the minimum number of loops, two forms of loops are generated; when the total number of loops is less than the minimum number of loops, the unpipelined loop is executed; when the total number of loops is greater than the minimum number of loops, the software pipelined loop is executed cycle. You can use the -ms option to cause the compiler to generate only one form of a loop based on the number of loops. Users can pass cycle number information to the compiler through a series of methods: use the -03 and -pm options to allow the optimizer to access the entire program and understand the cycle number information; use the -nassert inline function to prevent redundant loops from occurring. Users can use speculative execution (-mh option) to eliminate the draining of software pipeline loops, thereby reducing code size.

Since the compiler only performs software pipelining on the innermost loop in a nested loop, loop unrolling is performed on the inner loop with a small execution cycle, and software pipelining is performed on the outer loop, which can improve the performance of the C code.

Problems that should be noted when using software pipelining include: Although software pipelining loops can contain inline functions, they cannot contain function calls; conditional termination instructions are not allowed in the loop; loop control variables cannot be modified in the loop body; if the loop body If complex conditional code requires more than 5 conditional registers or the code size requires more than 32 registers, this loop cannot be software piped.

2.3 Writing linear assembly code

Writing linear assembly code is the third stage of the code development process. To improve code performance, critical C code that affects application speed can be rewritten in linear assembly. Linear assembly files are the input files to the assembly optimizer. Linear assembly code is similar to ordinary C6000 assembly code. The difference is that writing linear assembly code does not need to specify the registers used, whether the instructions are parallel or not, the delay cycle of the instructions and the functional units used by the instructions. The assembly optimizer will determine these according to the situation. information. Linear assembly files use some assembly optimizer pseudo-instructions to distinguish linear assembly from ordinary assembly code. The .cproc command and the .endpro command limit the code segment optimized by the assembly optimizer. The .cproc command is placed at the beginning of the code segment, and the .endproc command Placed at the end of the code segment; the .reg command causes the assembly optimizer to select a register for a value that is consistent with the functional unit selected by the instruction that operates on the value; the .trip command indicates the number of iterations of the loop. Methods to optimize linear assembly code include: specifying functional units for linear assembly instructions so that the final assembly instructions are executed in parallel; using words to access short integer data; and using software pipelining to optimize loops.

Writing linear assembly code is very labor intensive and requires a long development cycle. Moreover, the developed assembly code cannot be transplanted to other DSP platforms like C code, so it is recommended to use the first stage for software design as much as possible. If the performance requirements are still not met, then write linear assembly code for the critical C code segments.

3 Implementation of broadband millimeter wave radar target delay neural network recognition algorithm on TMS320C6201

The wideband millimeter wave radar target recognition algorithm has strict requirements for real-time processing. For example, the target recognition processing time of a certain wideband radar seeker is less than 1.5ms, which means that the target recognition algorithm is required to complete processing of a set of data within 1.5ms. Correctly identify the target. Artificial neural network (ANN) has parallel processing capabilities and network information storage capabilities, and can meet the parallel computing requirements of broadband millimeter-wave radar target recognition systems and the smaller data storage space requirements of radars. ANN technology has great potential in radar target recognition. Introducing a delay unit into the multi-layer feedforward perceptron neural network model can increase the memory function of the neural network, and the neural network model derived thereby is suitable for processing sequence data. The basic principle of the wideband millimeter wave radar target delay neural network identification algorithm is: preprocessing the one-dimensional range image with incoherent averaging, adaptive thresholds, and equal-distance interval peak non-sampling, and obtaining relatively stable low-dimensional samples as features The vectors are provided to the time-delay neural network classifier for automatic classification and recognition. The delay neural network is a three-layer network. The number of nodes in the input layer of the network is 17, the number of nodes in the hidden layer of the network is 10, and the number of nodes in the output layer of the network is 3. A large number of training sample sets are used to train the delay neural network to obtain the weights of the delay neural network, and then the software design method of TMS320C62x is applied to implement the broadband millimeter wave radar target delay neural network recognition algorithm on the TMS320C6201EVM board.

During the program implementation process, the lookup table method is used to improve the program execution speed for the two frequently used function values 1.0/(1.0+exp(-x)) and tanh(x). According to the test of the program, use integers (int) to represent floating point numbers, and select the lower 13 bits of integer numbers to represent decimals. The development software Code Composer Studio (CCS) that supports the TMS320C6201 EVM board is a development software package that integrates compilation, connection, real-time debugging, tracking and analysis of applications. CCS can speed up users' development progress, enhance users' application capabilities, and enable users to create and debug digital signal processing applications in real time. In the integrated development environment of the development software CCS, the C code of the delay neural network identification algorithm can be written and optimized by comprehensively applying the software design method of TMS320C62x.

After actual testing, the program execution time of implementing the broadband millimeter wave radar target delay neural network recognition algorithm on TMS320C6201 is 0.850ms, which meets the real-time requirements of the target recognition algorithm and achieves relatively good processing results.

Reference address：Research on TMS320C62x software development method

Previous article：FPGA implementation of HDTV receiver Lu Viterbi decoder
Next article：Several issues that should be paid attention to in the design of TMS320F206