Although the C6000 series DSP launched by Texas Instrument has significantly improved signal processing capabilities, the continuous improvement in information processing capability requirements has made the optimization of DSP programs increasingly a very important link in DSP development work. This article discusses the transplantation and optimization strategies and techniques of the Viterbi algorithm for 2Mbps video data streams.
1 Introduction to the principle of Viterbi algorithm
The Viterbi decoding algorithm is a maximum likelihood decoding method proposed by Viterbi in 1967. The decoder tries to find the correct original code sequence according to the maximum likelihood criterion based on the received sequence R. With the development of large-scale integrated circuit technology, convolutional coding technology using the Viterbi algorithm has become a widely used error correction scheme. The Viterbi decoding process can be represented by a state diagram. Figure 1 shows the state transition diagram of two states. Sj,t and Sj+N/2,t represent the two states at time t. At time t+1, these two state values are transferred to states S2j,t+1 and S2j+1,t+1 according to the path being 0 or 1. For each possible state transition, the path metric is calculated based on the received noisy sequence R, and then the minimum metric path (survivor path) of each state is selected. The Viterbi algorithm is to look back L steps forward by finding the minimum path in the state diagram, and the final result is the decoding output. In the convolutional code (n, k, m) notation, the parameter k represents the number of input information codes for each time, n represents the number of output convolutional codes for encoding, and m is called the constraint length (k=m is used in some books +1 is the constraint length, which can also be called (2, 1, 2) code grid diagram. r=k/n is called the information rate, that is, coding efficiency. This article uses (2, 1, 3) code, which is about the speed The length is 2 and the number of states is 2 2=-4.
2 Introduction to target processors
The TMS320C6000 series of DSPs (digital signal processors) is a parallel processing digital signal processor launched by TI, which is based on TI's VLIW technology. This article uses TMS320C6211. The operating frequency of the processor can reach 150MHz after being multiplied, and each clock cycle can execute up to 8 instructions in parallel, thus achieving 1200MIPS fixed-point computing capability. The C6000 series CPU adopts Harvard architecture, with its program bus and data bus separated, and instruction fetching and execution can run in parallel. The program bus width is 256 bits. Each instruction fetch operation fetches 8 instructions, which is called an instruction fetch packet. Each instruction occupies 1 functional unit during execution. The instruction fetch, instruction distribution and instruction decoding units all have the ability to read and transfer eight 32-bit instructions per cycle. The C6000 series CPU has 2 similar data channels A and B for data processing. Each channel has 4 functional units (.L, .S, .M, .D) and 1 group includes 16 (C64 has 32 A) general-purpose register set of 32-bit registers, each functional unit completes certain arithmetic or logical operations.
The special structure of C6000 enables multiple instructions to be processed in different functional units overlappingly, greatly improving the processing capability of the microprocessor. In addition, in terms of its CPU hardware structure, the C6000 pipeline is divided into three stages: fetching, decoding, and executing. Each stage contains several beats. Pipeline processing allows different execution stages of several instructions to be executed in parallel, thereby greatly improving program running speed.
3 Algorithm programming implementation and optimization
According to the C6000 software programming process, the programming and optimization of the Viterbi algorithm can be divided into three stages. These three stages are: developing C code, optimizing C code, and writing linear assembly code. In the process of code writing and optimization, these three stages do not have to be passed. As long as the function and performance requirements of the algorithm code have been met at a certain stage, there is no need to continue to the following stages. ①Develop C code. This stage is completely based on the task requirements to complete the algorithm code writing work. Compile and functionally verify the code under CCS (Code Composer Studio), the integrated development environment of C6000, and then use CCS debugging tools (such as Profiler) to find out the most time-consuming and most time-consuming parts of the program by setting breakpoints in the program. The code segment that most affects overall performance. To improve code performance, proceed to the next stage. The following is the core loop that completes the algorithm function in the Viterbi algorithm code for (2, 1, 3) codes. It is also the most time-consuming and inefficient section that affects the overall performance of the code. for(c=0;c
Therefore, while considering the influencing factors, the loop code of the Viterbi algorithm is adjusted as follows;
*Use inline functions (intrinsics) to replace complex C language programs. The C6000 compiler provides many intrinsics to quickly optimize C code. Intrinsics are inline functions that directly participate in C6000 assembly instruction mapping. _extu(x,y,z) is used here to simplify the hamm code part.
*Although software pipeline loops can contain intrinsics, they cannot contain function calls. Therefore, the calling function hamm needs to be expanded and implemented in a loop.
*Since the compiler only performs pipelining on the innermost loop, in order to improve performance, you should create a larger inner loop as much as possible. As you can see in the code, the innermost loop is the two loops of i. Only pipelining it will not greatly improve the performance of the entire code. So one idea is to expand all the i and j loops so that the compiler directly faces the largest C loop to maximize the role of software pipelining.
*In addition, if the variables in the code after unrolling the loop can determine their running values, try to substitute them with real values. This reduces the number of variables, which also reduces the number of registers that need to be allocated (there are 32 in the C62xxCPU register). After making the above adjustments, I ran the code and tested and developed it, but the performance did not improve much. I used the compiler feedback table (feedback) to observe and found that the loop did not flow. Why is this? It turns out that after expanding the inner loop, the code size in the C loop is too large, and the number of required registers is larger than the 32 registers of C62XX, so software pipeline cannot be carried out. In order to solve this problem, the loop needs to be simplified or split into several smaller loops. Here, first expand the small loop inside the C loop, and then split it into two loops that complete metric calculation and cumulative metric comparison respectively, thus reducing the code size in each loop. Due to space limitations, only the loop code for cumulative metric comparison is written here. /*Complete the loop for cumulative metric comparison*/ accum00=accum_err_metric[0];accum10=accum_err_metric[1]; accum20=accum_err_metric[2];accum30=accum_err_metric[3]; for(c=0;c
*Selection of compiler optimization options. The C6000 C/C++ compiler provides a large number of compilation options for users to choose from during compilation. Some of these options directly affect or control the compiler optimization process, thus affecting the code optimization performance of the compiled output. Choosing the appropriate options can greatly improve optimization performance. The optimization options used here are: -03 - Indicates that the highest degree of optimization can be obtained. The compiler will perform various methods to optimize loops, such as software pipelining, loop unrolling, etc. -pm - Try to use the -pm option in conjunction with the -o3 option for optimization. -pm is a program-level optimization that allows the optimizer to access the entire program and understand the number of loops. -op1—External variables are used, but no external function calls are used. -g - Enables symbolic debugging and assembly source statement debugging. In addition, there are many considerations and optimization debugging methods, such as eliminating memory dependencies and using wide-length memory access for short word length data. Due to space limitations, we cannot list them all here. For detailed information, please refer to the TMS320C6000 Optimizing C Compiler User's Guide in the TMS320C6000 Code Composer Studio Manuals. Test results: After the above optimization, the running time (clock cycle) has been reduced to 406, the performance of the code has been greatly improved, and it has met the system requirements. ③ From the above, it can be seen that the main code that affects performance in the program is usually a loop. A better way to optimize a loop is to extract the loop and make it a separate file , rewrite, recompile and run it separately. In order to improve code performance, the key C code segments that affect speed can be rewritten in linear assembly. The efficiency after optimization using the assembly optimizer is very high. If the code performance is still If the requirements are not met, you can proceed to the third stage, extract it, write it all in linear assembly, and call the rewritten part in the form of a function in the code. The format of rewriting the loop code segment into linear assembly to call the function is as follows: . global_KernelLoop; Add _ before the function name definition _KernelLoop:.cproc channel_data, branch_metric_array, depunc; Define the entry parameter variable.reg c,q0,q1,y1,y2,x1,x2,cc,temp,temp0,temp1,temp2,temp3 .reg counter, valuel, value2, value3 defines the intermediate variable no_mdep; indicates that the memory address is not relevant zero c; initializes the variable zero cc loop: .trip 24; declares a loop 24 times; runs the statement [counter]sub counter, 23, counter; loop Count[counter]b loop; loop jump.return; complete return.endproc;
Finish
Writing linear assembly requires a lot of work, the development cycle is long, and it cannot be transplanted to other types of DSPs like C language programs, so try to complete the work in the first and second stages. If the performance requirements still cannot be met, the key code segments will be rewritten in linear assembly. Conclusion This article has implemented the optimization of the Viterbi decoding algorithm for (2, 1, 3) convolutional codes on TI's TMS320C6211 hardware platform, which meets the system's requirements for real-time processing of 2Mb/s video data streams. When processing 1Kb data, the entire code takes about 2100 clock cycles to run, and the DSP resource occupancy rate is less than 40%. At present, with the continuous breakthroughs in theoretical technology, especially the introduction of new generation technical standards such as real-time image compression technology such as H.264, how to use high-speed DSP to develop and implement complex algorithms has become the focus of research. Therefore, this article uses the Viterbi algorithm as an example to introduce the programming optimization of TMS320C6000, which is highly practical.
Previous article:DSP-based power line carrier OFDM modem
Next article:Design of IDE interface emulator for TMS320F240
- Popular Resources
- Popular amplifiers
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- Rambus Launches Industry's First HBM 4 Controller IP: What Are the Technical Details Behind It?
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- Why Qorvo is leading the new era of Wi-Fi connectivity? You will know after reading this.
- Software defines IoT chips, and technology integration promotes LPWAN2.0 ubiquitous IoT (Part 1)
- About the power factor of switching power supply and its correction method
- EEWORLD University ---- RFID Training Series
- How to use jumper caps in circuits
- Calculate engine speed using the charging signal from the battery on a diesel engine
- Vicor engineers take you through power solutions for tethered, aerial/underwater drones
- 14. [Learning LPC1768 library functions] DAC experiment
- [Linux Touch Call] If you give up the market, why shouldn’t I grab the territory?
- Sensor Basics