DSP Program Structure Programming Notes

Aguilera

DSP Program Structure Programming Notes [Copy link]

If the program can be written to adapt to the structure of the DSP to a certain extent, the optimization function of the C compiler can greatly improve the performance of the algorithm on the basis of the program level [1-5]. The details are as follows:

(1) Try to use int type intermediate variables

In image processing programs, data is usually 8-bit variables, while the internal registers and data channels of C6XDSP are 32-bit. In the filtering, convolution and other processes of image processing, more intermediate variables are required. If 8-bit intermediate variable storage is used, the compiler will inevitably be forced to use additional data adjustment instructions. Therefore, using 32-bit intermediate variables has the best efficiency.

(2) Use shift operations instead of division operations

① The shift operation in DSP has hardware support and is completed by one instruction, while the division operation is implemented using a program, which is more complex and time-consuming.

② DSP's floating-point operations often use the method of calling sub-functions, which is inefficient and the compiler cannot perform software pipeline optimization. Shift operations can replace some fixed-value floating-point operations, such as:
for(I=0;I<1000;I++) for(I=0;I<1000;I++)
{ {
b+=35*0.325; b+=((35*333)>>10);
} }

The above two programs perform the same function, but the first one contains floating-point multiplication and takes 222,775 seconds to run; the second one only contains fixed-point multiplication and shift operations and takes only 881 seconds to run. It can be seen that the efficiency difference is very large.

(3) Using C6x intrinsic functions

The C6x compiler provides many instructions that can complete many complex functions in a single cycle, thereby improving the running speed of the code. For example:
"saturated addition":
int sadd(int a, int b)
{
int result;
result = a + b;
if (((a ^ b) & 0x80000000) == 0)
{
if ((result ^ a) & 0x80000000)
{
result = (a < 0) ? 0x80000000 : 0x7fffffff;
}
}
return (result);
}

Such a complex algorithm can be implemented using an inline function _sadd(a,b).

(4) Use 32-bit data types to access 16-bit data

Since the C6x series DSP has 32-bit registers and internal channels, half of the register space and channel bandwidth are often wasted when operating 16-bit data such as short types. In order to make full use of these resources, the compiler sets up some inline functions to operate two 16-bit data at the same time, such as _add2 and _sub2.
short in1[] and short in2[] are two arrays of short type with N items. The following is the operation of adding the corresponding items of the two.
for (i = 0; i < (N/2); i++)
_amem4(&sum[i]) = add2(_amem4_const(&in1[i]), _amem4_const(&in2[i]));
In the program, _amem4_const and _amem4 align in1, in2 and sum to 32 bits, and then perform addition and read storage operations on the two short data at the same time, thereby improving the operation efficiency.

(5) Use the restrict keyword to eliminate memory association

To improve code efficiency, the C6x compiler always arranges as many instructions in parallel as possible, and whether instructions can be run in parallel depends on the correlation between them. It is difficult for the compiler to determine whether memory read and write instructions are independent, such as the following program:
void vecsum(short *sum, short *in1, short *in2, unsigned int N)
{
int i;
for (i = 0; i < N; i++)
sum[i] = in1[i] + in2[i];
}

In the program, the storage of sum affects the read address of in1 and in2. Only after sum is stored can the read operation of in1 and in2 be performed again. This problem is called the "alias problem" because sum may be the same address as in1, making it impossible to parallelize the operations of reading data and writing results.
In order to allow the compiler to safely parallelize the reading of source data and the writing of result data, the restrict variable can be used to declare that the current array name (or pointer) is the only variable pointing to this memory, as shown in the following program:
void vecsum(short *sum, short * restrict in1, short * restrict in2, unsigned int N)
{
int i;
for (i = 0; i < N; i++)
sum[i] = in1[i] + in2[i];
}
This can eliminate the above memory correlation and improve pipeline efficiency.

(6) Software Pipeline Optimization

In program optimization, the loop part is often the most time-consuming step. Software pipelining is a method used to optimize loop steps so that the instructions inside the loop can be executed in parallel as much as possible. Selecting the -o2 or -o3 switch in the C6201 compiler can turn on the compiler's software pipelining optimization function.

Effective method to form software pipeline: If -o2 or -o3 option is selected in compiler options, the compiler can automatically optimize software pipeline. To form efficient software pipeline, you can use MUST_ITERATE to give loop variable information and loop unrolling. To form software pipeline, you should avoid: ① Software pipeline optimization is only performed on the innermost loop. ② There are too many codes in the loop body. ③ The code in the loop body is too complicated.

In fixed-point DSP, floating-point operations are implemented by sub-function software. Therefore, in the program loop of fixed-point DSP, if floating-point operations are performed, software pipeline cannot be formed. The solution is to manually calibrate floating-point numbers. For floating-point operations with determined operands, if the result can be an integer, the multiplication and division operations can be implemented by using the multiplication plus shift method. For example, 35×0.25 can be converted into (35×1)>>2.