TI C6000 Software Development Process and Program Optimization Technology

灞波儿奔 · Published on 2018-12-25 15:59

TI C6000 Software Development Process and Program Optimization Technology [Copy link]

The software development process of TI C6000 focuses on the C/C++ program optimization technology of the C6000 series, including optimization process, C/C++ code optimization method, and linear assembly code optimization method. It provides comprehensive program optimization technology and methods for DSP C/C++ software development, which has important practical significance for the development of actual systems. At present, assembly language and C language are mostly used for programming on DSP platforms. In order to pursue high efficiency of code, assembly language was generally used to compile in the past. DSP program assembly language is concise and efficient, and can directly operate the internal registers, storage space, and peripherals of DSP, but its readability, modifiability, and portability are poor; as the application scope of DSP continues to expand and the application becomes more and more complex, the shortcomings of assembly language programs in readability, modifiability, portability, and reusability are becoming increasingly prominent, and the contradiction between software requirements and software productivity is becoming increasingly serious. The introduction of high-level languages (such as C language, C++, Java) can solve this contradiction. Among high-level languages, C is a more efficient high-level language, which is superior to assembly instructions in terms of readability and portability. Various DSP chip companies have successively launched corresponding C language compilers. However, due to the particularity of the DSF structure, the C language compiler on this platform cannot give full play to the performance advantages of DSP devices. The efficiency of C language programs with the same functions is often only a few or even a few tenths of that of directly written assembly programs. Therefore, it is necessary to further optimize the programs written in C language according to the characteristics of DSP. l Introduction to TMS320C6000 processor TMS320C6000 is a new generation of high-performance DSP chips in the TMS320 series products, which are divided into two major series. The fixed-point series are TMS320C62xx and TMS320C64xx; the floating-point series is TMS320C67xx. Since the development of TMS320C6000 is mainly oriented to data-intensive algorithms, it has rich internal resources and powerful computing power, so it is widely used in digital communication and image processing. The eight functional units in the C6000 series CPU can operate in parallel, and two of them are hardware multiplication units, which greatly improves the multiplication speed. DSP adopts Harvard bus structure with independent program bus and data bus. The width of the on-chip program bus alone can reach 256 bits, that is, 8 32-bit instructions can be executed in parallel per cycle; the width of the two sets of data buses on the chip is 32 bits respectively; in addition, DSP also has a 32-bit DMA dedicated bus for transmission. The flexible bus structure greatly alleviates the limitation of data bottleneck on system performance. The general register group of C6000 can support 32-bit and 40-bit fixed-point data operations. In addition, C67xx and C64xx also support 64-bit double-precision data and 64-bit double-word fixed-point data operations respectively. In addition to multi-functional units, pipelining technology is another major means to improve the efficiency of DSP program execution. Due to the special structure of TMS320C6000, various operations performed by functional units at the same time can be synchronously executed by the VLlW long instruction allocation module, so that 8 parallel instructions can pass through each beat of the pipeline at the same time, greatly improving the throughput of the machine. 2 C6000 software development process C/C++ source files are first converted into C6000 assembly source code by the C/C++ compiler (C/C++cornpiler). The compiler, optimizer and overlap tool are the components of the C/C++ compiler. The compiler enables users to complete compilation, assembly and connection in one step; the optimizer adjusts and modifies the code to improve the efficiency of the C program; the overlap tool overlaps the C/C++ statements and the corresponding assembly statements. The assembly source code is then translated into a machine language target file by the assembler (Assembier). The machine language is based on the Common Object File Format (COFF). The linker connects the target files to generate an executable file. It needs to complete address relocation and resolve external references. After obtaining the executable file, you can debug it. You can use the software simulator to accurately simulate the instructions and running time on the PC; use the XDS hardware simulator to debug on the target board. After debugging, you can download it to the target board for independent operation. 3 Program optimization process and methods 3.1 Program optimization stage Due to the complexity of DSP applications, when using C language for DSP software development, the algorithm is generally simulated on a PC or workstation based on a general-purpose microprocessor first, and then the C program is transplanted to the DSP platform after the simulation passes. Therefore, the software development and optimization process of DSP is mainly divided into three stages: C code development stage; C code optimization stage; manual assembly code rewriting stage. Stage 1: Users without C6000 knowledge can develop their own C code, and then use the code analysis tool in CCS to determine the inefficient sections that may exist in the C code, preparing for further code optimization. Stage 2: C code optimization stage. In this stage, the intrinsics function and compiler compilation options are mainly used to improve the performance of the code. After optimization, the efficiency of the code is checked using a software simulator. If the expected efficiency is still not achieved, the third stage is entered. Stage 3: Write linear assembly optimization. In this stage, the user extracts the most time-consuming code, rewrites it in linear assembly, and then uses the assembly optimizer to optimize the code. When writing linear assembly for the first time, you can ignore pipelines and register allocation. Then, improve the performance of linear assembly code and add more details to the code, such as allocating registers. Since this stage takes more time than the second stage, the optimization of the entire code is completed in the second stage as much as possible, and linear assembly code optimization is used less. 3.2 C/C++ Code Optimization Methods In order to achieve the best performance of C/C++ code, you can use compilation options, software pipelining, inline functions and loop unrolling to optimize the code to improve the code execution speed and reduce the code size. 3.2.1 Compiler Option Optimization The C/C++ compiler can optimize the code at different levels. High-level optimization is completed by a dedicated optimizer, and low-level optimization related to the target DSP is completed by the code generator. Figure 3 is an execution diagram of the compiler, optimizer and code generator. When the optimizer is activated, the process shown in Figure 3 will be completed. The C/C++ language source code first passes through a parser that completes preprocessing, generating an intermediate file (.if) as the input of the optimizer (Optimi-zer). The optimizer generates an optimization file (.opt), which is used as the input of the code generator (Code Genera-tor) to complete further optimization, and finally generates an assembly file (.asm). The simplest way to perform optimization is to use the cl6x compiler and set the -On option in the command line. n is the optimization level (n is 0, 1, 2, 3), which controls the type and degree of optimization. 3.2.2 Software Pipelining Optimization Software pipelining is a technique for arranging loop instructions so that multiple iterations of the loop are executed in parallel. When you compile a C/C++ program with the -02 and -03 options, the compiler collects information from the program and attempts to software pipeline the program loop. Figure 4 shows a software pipelined loop. In Figure 4, A, B, C, D, and E represent the instructions in one iteration; A1, A2, A3, A4, and A5 represent the stages of execution of an instruction. In a loop, up to 5 instructions can be executed in parallel in one cycle, which is the loop kernel shown in the shaded part of the figure. The part before the loop core is called pipelined loop prolog, and the part after the loop core is called pipelined loop epilog. 3.2.3 Inline function optimization The performance of compiled code can be significantly improved by improving C language programs in the following ways: (1) Use intrinsics (intrinsics) to replace complex C/C++ codes; (2) Use words to access the data in the high 16-bit and low 16-bit fields of 32-bit registers; (3) Use double words to access 32-bit data stored in 64-bit registers (only for C64xx/C67xx). The C6000 compiler provides many intrinsic functions, which directly correspond to C62X/C64X/C67X instructions and can quickly optimize C code. These inline functions are not easy to implement using C/C++. Inline functions are specially marked with a leading underscore "_", and their use is the same as calling functions. For example, saturated addition in C language can only be written as a function that requires multiple cycles: This complex code can be implemented using the _sadd() inline function, which is a single-cycle C6x instruction. To improve the data processing rate of C6000, one Load/Store instruction should be able to access multiple data. C6000 has instructions related to inline functions, such as _add2(), _mpyhl(), _mpylh(), etc. These operands are stored in the high and low parts of 32-bit registers in the form of 16-bit data. When the program needs to operate on a series of short data, it can use words to access two short data at a time, and then use the corresponding C6000 instructions to process the data. Similarly, in C64x or C67x, it is sometimes necessary to execute a 64-bit LDDW to access two 32-bit data, four 16-bit data, or even eight 8-bit data. 3.2.4 Loop unrolling Loop unrolling is another way to improve performance, that is, to unroll the iterations of a small loop so that each iteration of the loop appears in the code. This method can increase the number of instructions executed in parallel. When each iteration operation does not fully utilize all the resources of the C6000 structure, loop unrolling can be used to improve performance. There are three ways to unroll a loop: (1) The compiler automatically performs loop unrolling; (2) Use the UNROLL pseudo-instruction in the program to suggest the compiler to do loop unrolling; (3) The user unrolls the loop himself in the C/C++ code. 3.3 Assembly Optimization If you are still not satisfied with the performance of your C/C++ code after using all the C/C++ optimization methods, you can write a linear assembly program and then use the assembly optimizer to optimize it to generate high-performance code. 3.3.1 Writing Linear Assembly Using the C6000 profiling tools, you can find the most time-consuming part of the code, which is the part that needs to be rewritten in linear assembly. Linear assembly code is similar to assembly source code, but there is no instruction delay and register usage information in linear assembly code. The purpose of this is to allow the assembly optimizer to set this information for itself. When writing linear assembly code, you need to know: assembly optimizer pseudo-instructions, options that affect the behavior of the assembly optimizer, TMS320C6000 instructions, linear assembly source statement syntax, specifying registers or register groups, specifying functional units, source code comments, etc. 3.3.2 Assembly Optimizer Optimization The main tasks of the assembly optimizer are: (1) Arrange instructions to maximize the parallel capability of C6000; (2) Ensure that instructions meet the latency requirements of C6000; (3) Allocate registers for source code. C6000 series DSP C/C++ code optimization is much more convenient than traditional code optimization, but it still requires certain experience and skills to truly maximize the efficiency of its chip. This requires not only that the developer is familiar with its hardware system, but also that the compiler's compilation principle has a certain understanding. In addition, it is difficult to reach the peak of DSP chip, that is, 8 instructions in parallel, at the C language level. In most cases, only 6 or 7 instructions in parallel can be achieved. In actual development, if the optimization result has reached 6 or 7 parallel instructions but is still far from the real-time requirement, it is not economical to spend a lot of manpower to strive to achieve 8 parallel instructions. At this time, other technical improvements or strategic adjustments should be considered to achieve the goal. (2) Use word to access data stored in the high 16-bit and low 16-bit fields of a 32-bit register; (3) Use double word to access 32-bit data stored in a 64-bit register (only for C64xx/C67xx). The C6000 compiler provides many inline functions that directly correspond to C62X/C64X/C67X instructions and can quickly optimize C code. These inline functions are not easy to implement in C/C++. Inline functions are specially marked with a leading underscore "_" and are used in the same way as calling functions. For example, saturated addition in C language can only be written as a function that requires multiple cycles: This complex code can be implemented using the _sadd() inline function, which is a single-cycle C6x instruction. result=_sadd(a, b); To improve the data processing rate of C6000, one Load/Store instruction should be able to access multiple data. C6000 has instructions related to inline functions, such as _add2(), _mpyhl(), _mpylh(), etc. These operands are stored in the high and low parts of 32-bit registers in the form of 16-bit data. When the program needs to operate on a series of short data, it can use word 1 to access 2 short data, and then use the corresponding C6000 instructions to process the data. Similarly, in C64x or C67x, it is sometimes necessary to execute 64-bit LDDW to access two 32-bit data, four 16-bit data, or even eight 8-bit data. 3.2.4 Loop unrolling Loop unrolling is another way to improve performance, that is, to expand the iterations of a small loop so that each iteration of the loop appears in the code. This method can increase the number of instructions executed in parallel. When each iteration operation does not fully utilize all the resources of the C6000 structure, loop unrolling can be used to improve performance. There are three ways to unroll a loop: (1) The compiler automatically performs loop unrolling; (2) Use the UNROLL pseudo-instruction in the program to suggest the compiler to do loop unrolling; (3) The user unrolls the loop in the C/C++ code. 3.3 Assembly Optimization After using all C/C++ optimization methods on the C/C++ code, if you are still not satisfied with the performance of the code, you can write a linear assembly program and then use the assembly optimizer to optimize it to generate high-performance code. 3.3.1 Writing Linear Assembly Using the C6000 profiling tools (Profiling Tools) can find the most time-consuming part of the code, which is the part that needs to be rewritten in linear assembly. Linear assembly code is similar to assembly source code, but it does not contain instruction latency and register usage information. The purpose of this is to allow the assembly optimizer to set this information for itself. When writing linear assembly code, you need to know: assembly optimizer pseudo-instructions, options that affect the behavior of the assembly optimizer, TMS320C6000 instructions, linear assembly source statement syntax, specifying registers or register groups, specifying functional units, source code comments, etc. 3.3.2 Assembly Optimizer Optimization The main tasks of the assembly optimizer are: (1) Arrange instructions to maximize the parallel capabilities of C6000; (2) Ensure that instructions meet the latency requirements of C6000; (3) Allocate registers for source code. DSP C/C++ code optimization of the C6000 series is much more convenient than traditional code optimization, but it still requires certain experience and skills to truly bring out the working efficiency of the chip. This not only requires developers to be familiar with its hardware system, but also requires a certain understanding of the compilation principle of the compiler. In addition, it is very difficult to reach the peak value of DSP chips, that is, 8 instructions in parallel, at the C language level. In most cases, only 6 or 7 instructions in parallel can be achieved. In actual development, if the optimization result has reached 6 or 7 instructions in parallel but is still far from the real-time requirement, it is not economical to spend a lot of manpower to strive to achieve 8 instructions in parallel. At this time, other technical improvements or strategic adjustments should be considered to achieve the goal. (2) Use word to access data stored in the high 16-bit and low 16-bit fields of a 32-bit register; (3) Use double word to access 32-bit data stored in a 64-bit register (only for C64xx/C67xx). The C6000 compiler provides many inline functions that directly correspond to C62X/C64X/C67X instructions and can quickly optimize C code. These inline functions are not easy to implement in C/C++. Inline functions are specially marked with a leading underscore "_" and are used in the same way as calling functions. For example, saturated addition in C language can only be written as a function that requires multiple cycles: This complex code can be implemented using the _sadd() inline function, which is a single-cycle C6x instruction. result=_sadd(a, b); To improve the data processing rate of C6000, one Load/Store instruction should be able to access multiple data. C6000 has instructions related to inline functions, such as _add2(), _mpyhl(), _mpylh(), etc. These operands are stored in the high and low parts of 32-bit registers in the form of 16-bit data. When the program needs to operate on a series of short data, it can use word 1 to access 2 short data, and then use the corresponding C6000 instructions to process the data. Similarly, in C64x or C67x, it is sometimes necessary to execute 64-bit LDDW to access two 32-bit data, four 16-bit data, or even eight 8-bit data. 3.2.4 Loop unrolling Loop unrolling is another way to improve performance, that is, to expand the iterations of a small loop so that each iteration of the loop appears in the code. This method can increase the number of instructions executed in parallel. When each iteration operation does not fully utilize all the resources of the C6000 structure, loop unrolling can be used to improve performance. There are three ways to unroll a loop: (1) The compiler automatically performs loop unrolling; (2) Use the UNROLL pseudo-instruction in the program to suggest the compiler to do loop unrolling; (3) The user unrolls the loop in the C/C++ code. 3.3 Assembly Optimization After using all C/C++ optimization methods on the C/C++ code, if you are still not satisfied with the performance of the code, you can write a linear assembly program and then use the assembly optimizer to optimize it to generate high-performance code. 3.3.1 Writing Linear Assembly Using the C6000 profiling tools (Profiling Tools) can find the most time-consuming part of the code, which is the part that needs to be rewritten in linear assembly. Linear assembly code is similar to assembly source code, but it does not contain instruction latency and register usage information. The purpose of this is to allow the assembly optimizer to set this information for itself. When writing linear assembly code, you need to know: assembly optimizer pseudo-instructions, options that affect the behavior of the assembly optimizer, TMS320C6000 instructions, linear assembly source statement syntax, specifying registers or register groups, specifying functional units, source code comments, etc. 3.3.2 Assembly Optimizer Optimization The main tasks of the assembly optimizer are: (1) Arrange instructions to maximize the parallel capabilities of C6000; (2) Ensure that instructions meet the latency requirements of C6000; (3) Allocate registers for source code. DSP C/C++ code optimization of the C6000 series is much more convenient than traditional code optimization, but it still requires certain experience and skills to truly bring out the working efficiency of the chip. This not only requires developers to be familiar with its hardware system, but also requires a certain understanding of the compilation principle of the compiler. In addition, it is very difficult to reach the peak value of DSP chips, that is, 8 instructions in parallel, at the C language level. In most cases, only 6 or 7 instructions in parallel can be achieved. In actual development, if the optimization result has reached 6 or 7 instructions in parallel but is still far from the real-time requirement, it is not economical to spend a lot of manpower to strive to achieve 8 instructions in parallel. At this time, other technical improvements or strategic adjustments should be considered to achieve the goal. The most time-consuming part of the code can be found by using the Assembly Tools. This is the part that needs to be rewritten in linear assembly. Linear assembly code is similar to assembly source code, but it does not contain instruction latency and register usage information. The purpose of this is to allow the assembly optimizer to set this information for itself. When writing linear assembly code, you need to know: assembly optimizer pseudo-instructions, options that affect the behavior of the assembly optimizer, TMS320C6000 instructions, linear assembly source statement syntax, specifying registers or register groups, specifying functional units, source code comments, etc. 3.3.2 Assembly Optimizer Optimization The main tasks of the assembly optimizer are: (1) Arrange instructions to maximize the parallel capabilities of the C6000; (2) Ensure that instructions meet the latency requirements of the C6000; (3) Allocate registers for the source code. DSP C/C++ code optimization of the C6000 series is much more convenient than traditional code optimization, but it still requires certain experience and skills to truly bring out the working efficiency of the chip. This not only requires developers to be familiar with its hardware system, but also requires a certain understanding of the compilation principle of the compiler. In addition, it is very difficult to reach the peak value of DSP chips, that is, 8 instructions in parallel, at the C language level. In most cases, only 6 or 7 instructions in parallel can be achieved. In actual development, if the optimization result has reached 6 or 7 instructions in parallel but is still far from the real-time requirement, it is not economical to spend a lot of manpower to strive to achieve 8 instructions in parallel. At this time, other technical improvements or strategic adjustments should be considered to achieve the goal. The most time-consuming part of the code can be found by using the Assembly Tools. This is the part that needs to be rewritten in linear assembly. Linear assembly code is similar to assembly source code, but it does not contain instruction latency and register usage information. The purpose of this is to allow the assembly optimizer to set this information for itself. When writing linear assembly code, you need to know: assembly optimizer pseudo-instructions, options that affect the behavior of the assembly optimizer, TMS320C6000 instructions, linear assembly source statement syntax, specifying registers or register groups, specifying functional units, source code comments, etc. 3.3.2 Assembly Optimizer Optimization The main tasks of the assembly optimizer are: (1) Arrange instructions to maximize the parallel capabilities of the C6000; (2) Ensure that instructions meet the latency requirements of the C6000; (3) Allocate registers for the source code. DSP C/C++ code optimization of the C6000 series is much more convenient than traditional code optimization, but it still requires certain experience and skills to truly bring out the working efficiency of the chip. This not only requires developers to be familiar with its hardware system, but also requires a certain understanding of the compilation principle of the compiler. In addition, it is very difficult to reach the peak value of DSP chips, that is, 8 instructions in parallel, at the C language level. In most cases, only 6 or 7 instructions in parallel can be achieved. In actual development, if the optimization result has reached 6 or 7 instructions in parallel but is still far from the real-time requirement, it is not economical to spend a lot of manpower to strive to achieve 8 instructions in parallel. At this time, other technical improvements or strategic adjustments should be considered to achieve the goal.