DSP-based channel decoding algorithm optimization-EEWORLD

Collect

　　Although the C6000 series DSP launched by Texas Instrument has significantly improved signal processing capabilities, the continuous improvement in information processing capability requirements has made the optimization of DSP programs increasingly a very important link in DSP development work. This article discusses the transplantation and optimization strategies and techniques of the Viterbi algorithm for 2Mbps video data streams.

　　1 Introduction to the principle of Viterbi algorithm

　　The Viterbi decoding algorithm is a maximum likelihood decoding method proposed by Viterbi in 1967. The decoder tries to find the correct original code sequence according to the maximum likelihood criterion based on the received sequence R. With the development of large-scale integrated circuit technology, convolutional coding technology using the Viterbi algorithm has become a widely used error correction scheme. The Viterbi decoding process can be represented by a state diagram. Figure 1 shows the state transition diagram of two states. Sj,t and Sj+N/2,t represent the two states at time t. At time t+1, these two state values are transferred to states S2j,t+1 and S2j+1,t+1 according to the path being 0 or 1. For each possible state transition, the path metric is calculated based on the received noisy sequence R, and then the minimum metric path (survivor path) of each state is selected. The Viterbi algorithm is to look back L steps forward by finding the minimum path in the state diagram, and the final result is the decoding output. In the convolutional code (n, k, m) notation, the parameter k represents the number of input information codes for each time, n represents the number of output convolutional codes for encoding, and m is called the constraint length (k=m is used in some books +1 is the constraint length, which can also be called (2, 1, 2) code grid diagram. r=k/n is called the information rate, that is, coding efficiency. This article uses (2, 1, 3) code, which is about the speed The length is 2 and the number of states is 2 2=-4.

　　2 Introduction to target processors

　　The TMS320C6000 series of DSPs (digital signal processors) is a parallel processing digital signal processor launched by TI, which is based on TI's VLIW technology. This article uses TMS320C6211. The operating frequency of the processor can reach 150MHz after being multiplied, and each clock cycle can execute up to 8 instructions in parallel, thus achieving 1200MIPS fixed-point computing capability. The C6000 series CPU adopts Harvard architecture, with its program bus and data bus separated, and instruction fetching and execution can run in parallel. The program bus width is 256 bits. Each instruction fetch operation fetches 8 instructions, which is called an instruction fetch packet. Each instruction occupies 1 functional unit during execution. The instruction fetch, instruction distribution and instruction decoding units all have the ability to read and transfer eight 32-bit instructions per cycle. The C6000 series CPU has 2 similar data channels A and B for data processing. Each channel has 4 functional units (.L, .S, .M, .D) and 1 group includes 16 (C64 has 32 A) general-purpose register set of 32-bit registers, each functional unit completes certain arithmetic or logical operations.

　　The special structure of C6000 enables multiple instructions to be processed in different functional units overlappingly, greatly improving the processing capability of the microprocessor. In addition, in terms of its CPU hardware structure, the C6000 pipeline is divided into three stages: fetching, decoding, and executing. Each stage contains several beats. Pipeline processing allows different execution stages of several instructions to be executed in parallel, thereby greatly improving program running speed.

　　3 Algorithm programming implementation and optimization

　　According to the C6000 software programming process, the programming and optimization of the Viterbi algorithm can be divided into three stages. These three stages are: developing C code, optimizing C code, and writing linear assembly code. In the process of code writing and optimization, these three stages do not have to be passed. As long as the function and performance requirements of the algorithm code have been met at a certain stage, there is no need to continue to the following stages. ①Develop C code. This stage is completely based on the task requirements to complete the algorithm code writing work. Compile and functionally verify the code under CCS (Code Composer Studio), the integrated development environment of C6000, and then use CCS debugging tools (such as Profiler) to find out the most time-consuming and most time-consuming parts of the program by setting breakpoints in the program. The code segment that most affects overall performance. To improve code performance, proceed to the next stage. The following is the core loop that completes the algorithm function in the Viterbi algorithm code for (2, 1, 3) codes. It is also the most time-consuming and inefficient section that affects the overall performance of the code. for(c=0;caccum_err_metric[j][0]+branch_metric] {accum_err_metric[nextstate[j][i][1]=accum_err_metric[j][0]+branch_metri; state_history[nextstate[j][i]][sh_ptr]=j; } }*/end of i<2*/ }/*end of j=0;i--) hamm+=(target_vector>>i)%26;amp;0x01; return hamm; } After verifying the function of the algorithm code and testing the performance of the code by setting breakpoints, this loop consumes Hours (clock cycles) are 1790. Obviously, if the performance cannot meet the requirements, it is necessary to enter the second stage of code optimization. ②Generally, in code debugging, the loop code segment has the greatest impact on performance. Software pipelining is a technology used to arrange the execution of instructions within a loop, making full use of resources such as CPU functional units as much as possible, so that multiple iterations of the loop can be executed in parallel. In the C/C++ compiler of C6000, the use of software pipeline to optimize the compiled program code is a core technology. Therefore, before further optimization, it is necessary to adjust and simplify the structure of the code as much as possible and remove factors that affect software pipeline so that it can be fully pipelined by the compiler. This is very important to greatly improve the performance of the entire code.

　　Therefore, while considering the influencing factors, the loop code of the Viterbi algorithm is adjusted as follows;

　　 *Use inline functions (intrinsics) to replace complex C language programs. The C6000 compiler provides many intrinsics to quickly optimize C code. Intrinsics are inline functions that directly participate in C6000 assembly instruction mapping. _extu(x,y,z) is used here to simplify the hamm code part.

　　*Although software pipeline loops can contain intrinsics, they cannot contain function calls. Therefore, the calling function hamm needs to be expanded and implemented in a loop.

　　*Since the compiler only performs pipelining on the innermost loop, in order to improve performance, you should create a larger inner loop as much as possible. As you can see in the code, the innermost loop is the two loops of i. Only pipelining it will not greatly improve the performance of the entire code. So one idea is to expand all the i and j loops so that the compiler directly faces the largest C loop to maximize the role of software pipelining.

　　*In addition, if the variables in the code after unrolling the loop can determine their running values, try to substitute them with real values. This reduces the number of variables, which also reduces the number of registers that need to be allocated (there are 32 in the C62xxCPU register). After making the above adjustments, I ran the code and tested and developed it, but the performance did not improve much. I used the compiler feedback table (feedback) to observe and found that the loop did not flow. Why is this? It turns out that after expanding the inner loop, the code size in the C loop is too large, and the number of required registers is larger than the 32 registers of C62XX, so software pipeline cannot be carried out. In order to solve this problem, the loop needs to be simplified or split into several smaller loops. Here, first expand the small loop inside the C loop, and then split it into two loops that complete metric calculation and cumulative metric comparison respectively, thus reducing the code size in each loop. Due to space limitations, only the loop code for cumulative metric comparison is written here. /*Complete the loop for cumulative metric comparison*/ accum00=accum_err_metric[0];accum10=accum_err_metric[1]; accum20=accum_err_metric[2];accum30=accum_err_metric[3]; for(c=0;cadd2){accum00=add2;state_history[0][sh_ptr]=0;} else{accum00=add1;state_history[0][sh_ptr]=1;} if(add3>add4){accum20=add4;state_history[2] [sh_ptr]=0;} else{accum20=add3;state_history[2][sh_ptr]=1;} if(add5>add6){accum30=add6;state_history[3][sh_ptr]=2;} else{accum30= add5;state_history[3][sh_ptr]=3;} if(add7>add8){accum10=add8;state_history[1][sh_ptr]=2;} else{accum10=add7;state_history[1][sh_ptr]=3 ;} } accum_err_metric[0]=accum00;accum_err_metic[1]=accum10; accum_err_metric[2]=accum20;accum_err_metric[3]=accum30; where accum_err_metric[i] is the cumulative metric value of state i, branch_metric_array[][] is For the calculated magnitude at each moment, the two-dimensional digital mextstate[j][i] in the original code is substituted with a real value. In addition, you should pay attention to one thing when considering programming: the command to fetch data (load) in the program is very time-consuming, so you should consider minimizing the operations on the data array as much as possible. In the improvement of the above program, first remove the cumulative measurement value to be processed by the loop from the array, then use accumXX and addX as the intermediate variables of each iteration, and put the final result into the data after the loop. This greatly reduces the number of operations on the array, further improving optimization.

　　*Selection of compiler optimization options. The C6000 C/C++ compiler provides a large number of compilation options for users to choose from during compilation. Some of these options directly affect or control the compiler optimization process, thus affecting the code optimization performance of the compiled output. Choosing the appropriate options can greatly improve optimization performance. The optimization options used here are: -03 - Indicates that the highest degree of optimization can be obtained. The compiler will perform various methods to optimize loops, such as software pipelining, loop unrolling, etc. -pm - Try to use the -pm option in conjunction with the -o3 option for optimization. -pm is a program-level optimization that allows the optimizer to access the entire program and understand the number of loops. -op1—External variables are used, but no external function calls are used. -g - Enables symbolic debugging and assembly source statement debugging. In addition, there are many considerations and optimization debugging methods, such as eliminating memory dependencies and using wide-length memory access for short word length data. Due to space limitations, we cannot list them all here. For detailed information, please refer to the TMS320C6000 Optimizing C Compiler User's Guide in the TMS320C6000 Code Composer Studio Manuals. Test results: After the above optimization, the running time (clock cycle) has been reduced to 406, the performance of the code has been greatly improved, and it has met the system requirements. ③ From the above, it can be seen that the main code that affects performance in the program is usually a loop. A better way to optimize a loop is to extract the loop and make it a separate file , rewrite, recompile and run it separately. In order to improve code performance, the key C code segments that affect speed can be rewritten in linear assembly. The efficiency after optimization using the assembly optimizer is very high. If the code performance is still If the requirements are not met, you can proceed to the third stage, extract it, write it all in linear assembly, and call the rewritten part in the form of a function in the code. The format of rewriting the loop code segment into linear assembly to call the function is as follows: . global_KernelLoop; Add _ before the function name definition _KernelLoop:.cproc channel_data, branch_metric_array, depunc; Define the entry parameter variable.reg c,q0,q1,y1,y2,x1,x2,cc,temp,temp0,temp1,temp2,temp3 .reg counter, valuel, value2, value3 defines the intermediate variable no_mdep; indicates that the memory address is not relevant zero c; initializes the variable zero cc loop: .trip 24; declares a loop 24 times; runs the statement [counter]sub counter, 23, counter; loop Count[counter]b loop; loop jump.return; complete return.endproc;

　　Finish

　　Writing linear assembly requires a lot of work, the development cycle is long, and it cannot be transplanted to other types of DSPs like C language programs, so try to complete the work in the first and second stages. If the performance requirements still cannot be met, the key code segments will be rewritten in linear assembly. Conclusion This article has implemented the optimization of the Viterbi decoding algorithm for (2, 1, 3) convolutional codes on TI's TMS320C6211 hardware platform, which meets the system's requirements for real-time processing of 2Mb/s video data streams. When processing 1Kb data, the entire code takes about 2100 clock cycles to run, and the DSP resource occupancy rate is less than 40%. At present, with the continuous breakthroughs in theoretical technology, especially the introduction of new generation technical standards such as real-time image compression technology such as H.264, how to use high-speed DSP to develop and implement complex algorithms has become the focus of research. Therefore, this article uses the Viterbi algorithm as an example to introduce the programming optimization of TMS320C6000, which is highly practical.

Reference address：DSP-based channel decoding algorithm optimization

Previous article：DSP-based power line carrier OFDM modem
Next article：Design of IDE interface emulator for TMS320F240