C6000 DSP software optimization experience summary and sharing

Jacktang

C6000 DSP software optimization experience summary and sharing [Copy link]

1. Common options for c6x compilation (I) The c6x compiler is "cl6x.exe" Method of use Cl6x [options] [filenames] Cl6x: Compiler Options: Compiler options Filenames: C or assembly source file Description: Compiler options are one or two letters, not case sensitive. A "-" symbol is required in front of the compile option. One-letter options can be combined together. For example, "-sgq" is the same as "-s -g -q". Two-letter options can also be combined if the first letter is the same. For example, "-mgt" is the same as "-mg -mt". (II) Optimization options -mt: Indicates that alaising technology is not used in the program, which allows the compiler to optimize better. -o3: The strongest optimization at the file level, generally this option should be used when compiling. However, in some cases, using this option to optimize the program may cause errors (-o2 has the same phenomenon, -o0 and -o1 will not cause errors). It may be that an error occurred when optimizing loops and organizing pipelines. If this phenomenon occurs, you can use the -g option at the same time, and there will be no error in program optimization, but the optimization effect will decrease. In addition, you can adjust the expression of the program to avoid compiler errors. -pm: Optimize at the program level. You can combine all files for optimization, mainly removing functions that are not called, variables that are always constants, and unused function return values. It is recommended that programmers do this optimization work themselves. Using this option to compile under win98 may result in the compiler not being found. -ms0: Do not use redundant loops for optimization and reduce the size of the program. Generally, this option has no obvious effect on program size optimization. -mh[n]: Remove the epilog of the pipeline and reduce the size of the program. This option is more effective. However, there may be a problem that the read address exceeds the valid range, so you need to add some padding at the beginning and end of the data segment, or ensure that the addresses in the front and back of the array are valid when allocating memory. The optional parameter n gives the length of this padding in bytes. (III) Options for retaining compilation and optimization information -k: Retain the assembly language file generated after optimization. -s: Add optimization information to the assembly language file. If not, add the C language source program as a comment. -mw: Add software pipeline information to the assembly language file. (IV) Options for debugging and profiling -g: Allow symbolic debugging. Include symbolic information and line number information in the "out" file. Debugging and profiling can be performed at the C language level. Using -g, -mt and -o3 together can ensure maximum optimization while being able to perform symbolic debugging. -mg: Allow profile optimized programs. Include symbolic information and very little line number information in the "out" file. Allow profiling of C language function basics. If these two options are used together, the -g option may be ignored and the result is the same as using only -mg. (V) Other types -mln: Generate a program in large memory mode. -ml0: By default, set variables (arrays and structures) are treated as far types. -ml1: By default, all functions are treated as far type -ml2: Equal to -ml0 plus -ml1 -ml3: By default, all data and functions are treated as far type (VI) Recommended compilation methods Cl6x -gk -mt -o3 -mw -ss "filename" Method 1 is used for program debugging. This method has strong optimization capabilities and supports symbolic debugging. No errors will occur during the compilation process. Since the generated "out" file contains symbol information and line number information, it is relatively large. Cl6x -k -mgt -o3 -mw -ss "filename" Method 2 is used for program profiling. This method has the strongest optimization capabilities (the same as method 3 in most cases) and supports profiling of programs. The file only contains symbol information and a small amount of line number information, so the "out" file is relatively small. Cl6x -k -mt -o3 -mw -ss "filename" Mode 3 is used for the final release version of the program. It can optimize the program the most strongly and remove all symbols and line number information, so the "out" file is relatively small. Programs composed of multiple files should write a makefile, put the compilation parameters in the file, and indicate the version number of the compiler used. (VII) Connection parameters -heap: specify the size of the heap -stack: specify the size of the stack The various connection options should be placed in the "cmd" file. 2. Summary of the optimization of double loops and multiple loops Double loops and multiple loops seem to be more complicated, but in fact, the optimization method of multiple loops is relatively simple, which lies in one word: "disassembly". Once this step is completed, the multiple loops become a single-layer loop, and the optimization can be done according to the ordinary single-layer loop. The characteristic of multiple loops is that when the optimizer optimizes, only a pipeline is formed in the innermost loop, so the loop statement cannot fully utilize the software pipeline of C6, and for the case of a small number of internal loops, the number of cycles consumed in prolog and eplog cannot be ignored. In this case, you can consider splitting the multiple loops into a single loop. You can split the outer loop or the inner loop, depending on the specific situation. In this way, you can make full use of the Pipeline formed by the optimizer. For example: void fir2(const short input[], const short coefs[], short out[]) { int i, j; int sum = 0; for (i = 0; i < 40; i++) { for (j = 0; j < 16; j++) sum += coefs[j] * input[i + 15 - j]; out = (sum >> 15); } The inner loop has fewer iterations and a small amount of computation. In terms of resources, it only occupies one multiplier, and the multiplier is only used once in one cycle. In fact, we can use two multipliers in one cycle; void fir2_u(const short input[], const short coefs[], short out[]) { int i,j; int sum; for (i = 0; i < 40; i++) { sum = coefs[0] * input[i + 15]; sum += coefs[1] * input[i + 14]; sum += coefs[2] * input[i + 13]; sum += coefs[3] * input[i + 12]; sum += coefs[4] * input[i + 11]; sum += coefs[5] * input[i + 10]; sum += coefs[6] * input[i + 9]; sum += coefs[7] * input[i + 8]; sum += coefs[ 8] * input[i + 7]; sum += coefs[9] * input[i + 6]; sum += coefs[10] * input[i + 5]; sum += coefs[11] * input[i + 4]; sum += coefs[12] * input[i + 3]; sum += coefs[13] * input[i + 2]; sum += coefs[14] * input[i + 1]; sum += coefs[ 15] * input[i + 0]; out = (sum >> 15); } Although the code length increases, it becomes a single loop, and all operations are included in the pipeline, and the Piped loop kernel generates Two multipliers are used in each cycle, which fully utilizes the internal resources of DSP and improves the operation efficiency. Another example: tot = 4; for (k = 0; k < 4; k++) { max = 0; for (i = k; i < 44; i += STEP) { s = 0; for (j = i ; j < 44; j++) s = L_mac(s, x[j], h[j - i]); y32 = s; s = L_abs(s); if (L_sub(s, max) > (Word32) 0 ) max = s; } tot = L_add(tot, L_shr(max, 1)); } There are three layers of loops in this multi-layer loop. The innermost loop has a small amount of computation, only one multiplication and accumulation operation. As we know, C6 can perform two multiplication and accumulation operations in one packet. Therefore, in order to increase the internal To reduce the number of layers of the outer loop, we can split the operation of the first layer of loop and add the operations it is responsible for to the inner loop, that is, to do four multiplication and accumulation operations in the inner loop at a time. Multiple operations are pipelined to improve the operation efficiency. The optimized C code is as follows: tot = 4; max0=0; max1=0; max2=0; max3=0; for (i = 0; i <44; i += STEP) //STEP=4, 11 times cirs { //code for (j=0;j<=40-i;j++) {s0=(Word32)(_sadd(s0,_smpy(hh[j], xx[j+i]))); s1=(Word32)(_sadd(s1,_smpy(hh[j],xx[j+i+1]))); s2=(Word32)(_sadd(s2,_smpy(hh[j],xx[j+ i+2]))); s3=(Word32)(_sadd(s3,_smpy(hh[j],xx[j+i+3]))); } } //code CCS optimization: 3. 16 bits Convert to 32-bit operation, use intrinsic functions, use const, etc. 1. Source code: Word32 L_mpy_ll(Word32 L_var1, Word32 L_var2) { double aReg; Word32 lvar; /* (unsigned)low1 * (unsigned)low1 */ aReg = (double)(0xffff & L_var1) * (double)(0xffff & L_var2) * 2.0; /* >> 16 */ aReg = (aReg / 65536); aReg = floor(aReg); /* (unsigned)low1 * (signed)high2 */ aReg += (double)(0xffff & L_var1) * ((double)L_shr(L_var2,16) ) * 2.0; /* (unsigned)low2 * (signed)high1 */ aReg += (double)(0xffff & L_var2) * ((double)L_shr(L_var1,16)) * 2.0; /* >> 16 */ aReg = (aReg / 65536); aReg = floor(aReg); /* (signed)high1 * (signed)high2 */ aReg += (double)(L_shr(L_var1,16) ) * (double)(L_shr(L_var2,16)) * 2.0; /* saturate result.. */ lvar = L_saturate(aReg); return(lvar); } 2. Adapted code: static inline Word32 L_mpy_ll(Word32 L_var1, Word32 L_var2) { Word32 aReg_hh; Word40 aReg,aReg_ll,aReg_lh,aReg_hl; aReg_ll = (Word40)_mpyu(L_var1, L_var2)>>16; aReg_lh = (Word40)_mpyluhs(L_var1, L_var2); aReg_hl = (Word40)_mpyhslu(L_var1, L_var2); aReg_hh = _smpyh(L_var1,L_var2); aReg = _lsadd(aReg_ll, _lsadd(aReg_lh, aReg_hl)); aReg = _lsadd(aReg>>15, aReg_hh); return(_sat(aReg)); } 3. Optimization method description: The intrinsics provided by the C6000 compiler can quickly optimize C code. The leading underscore of an intrinsic indicates that it can be called in the same way as a function, that is, it is directly inlined as a C6000 function. For example, in the source code of the above example, intrinsics are not used, and each line of C code requires multiple instruction cycles. In the rewritten code, each line of code only requires one instruction cycle. For example, in "aReg_ll = (Word40)_mpyu(L_var1, L_var2)>>16", "_mpyu" is an intrinsics function, which means multiplying the high 16 bits of two unsigned numbers and returning the result. For all the intrinsics instructions and their functions supported by C6000, please refer to pages 265 and 266 of the book "Principles and Applications of TMS320C6000 Series DSP", which also provides other examples. These intrinsic functions are defined in the C6X.h file in the C6000\CGTOOLS\Include directory where CCS is located. The following example is an example of using intrinsics to optimize C code extracted from the "Programmer''s Guide" of C6000. Source code: int dotprod(const short *a, const short *b, unsigned int N) { int i, sum = 0; for (i = 0; i < N; i++) sum += a * b; return sum; } Adapted code: int dotprod(const int *a, const int *b, unsigned int N) { int i, sum1 = 0, sum2 = 0; for (i = 0; i < (N >> 1); i++) { sum1 += _mpy (a, b); sum2 += _mpyh(a, b); } return sum1 + sum2; } Tips: After all the debugging of C language is passed, you can try to adapt as many statements as possible using intrinsics functions, especially in the loop body. This adaptation can greatly reduce the execution time.