1. Compilation feedback and optimization
The TIC6000 compiler can optimize the code in various ways to increase the execution speed and reduce the size of the code. These optimizations include simplifying loops, software pipelining, reordering statements and expressions, and allocating variables to registers.
(1) -O3: indicates the use of the highest level of optimization. The compiler will use various optimization techniques as much as possible, such as software pipelining, loop optimization, loop unrolling, and reordering of function declarations.
(2) -ms: When you want to reduce code size, you do not need to lower the optimization level. You can use the -ms option to control code size. When -O2 or -O3 is set, the compiler mainly optimizes performance. At this time, use the code size flag -msn to make the compiler pay more attention to code size. Using the -ms option with different levels n can make a trade-off between code performance and size. Using a high -O level together with a high -ms level can get the smallest code size.
(3) -pm: Program-level optimization. It compiles all source files into an intermediate file for optimization. This intermediate file is called a module. The compiler optimizes and generates code for this module. Since the compiler can "see" the entire program, it can perform optimizations that file-level optimization cannot achieve. For example, if the value of a special variable in a function is always the same, the compiler will directly replace the variable with this value, passing the value directly instead of passing the parameter.
(4)mt: Indicates that no aliases are used to access variables. It tells the compiler that it can assume that pointers to objects in the code are unique, and the compiler can optimize more aggressively.
(5)-mv6400+: Enable optimization for 64x+ CPU architecture.
2. Loop Optimization
Generally speaking, the operation of a program is mainly consumed in the execution of loops, so how to improve the performance of loops is of decisive significance to the speed of the entire program. Software pipelining is based on the parallel architecture of C6000 and is a technology that arranges loop instructions so that multiple iterations of the loop can be executed in parallel. Software pipelining is usually only performed on the innermost loop. If the inner loop is too short and the number of loops is too small, the software pipeline will enter the emptying stage before it is full, and the parallel processing characteristics of the CPU cannot be fully utilized. In the second step of the EHMM algorithm, i.e. the calculation of observation probability (EstimateObsPro function), there are 5 layers of loops, and the amount of calculation is very large. The innermost loop counts the number of substates. The number of substates of EHMM is only 3 or 6, which is a very low number of loops; and the second outer loop counts the observation values of each row. For a 100×100 picture, if the step size of the scanning window is 2, each row can generate about 50 observation values, and the loop count reaches 50, which is much higher than the number of substates. We therefore rewrote the order of this part of the loop so that the innermost loop counts the observations for each row, which resulted in a speedup of about 7%.
The judgment statements, jump statements and abnormal termination statements (break) in the loop body will seriously damage the software pipeline. For example, in the EstimateObsPro function, there is a judgment on the number of mixed items m. Since we set the GMM of all sub-states to have the same number of mixed items, this judgment can be removed. Finally, for the dynamic memory used by the innermost loop, the open source code of OpenCV allocates it every time the loop is entered. We moved this operation outside the loop body so that the memory allocated once can be used throughout the loop. The above two modifications further reduce the CPU cycle.
We know that each loop will iterate multiple times before it ends. The number of iterations is the loop counter. The compiler uses the loop count to determine whether a loop can be pipelined. In order to fill the pipeline, the loop structure of software pipeline requires that the iterations in the loop be executed at least a certain number of times. When the compiler cannot determine the loop count, it will generate two versions for the loop code, one is the pipeline version and the other is the non-pipeline version. The compiler determines which version to execute by judging the loop count, but there is always one version that is not executed in a loop. This unexecuted loop is a redundant loop. By setting the MUST_ITERATE program instruction, the compiler can be told the number of executions of the loop, thereby turning off the generation of redundant loops. And because the compiler controls the loop count information, it can arrange the pipeline more effectively, thereby improving the execution speed. For example, in the observation probability loop of EHMM, the following program instruction is used. The various parameters of this program instruction tell the compiler in turn that the minimum execution number of the loop is 3, the maximum execution number is 6, and the execution number is always a multiple of 3.
3. Inline Optimization
Inline is a feature of the C6000 compiler. It is a set of special functions with C call interfaces. These functions are directly mapped to C6000 CPU instructions. Any statement that is not easy to express with C/C++ expressions can be implemented through inline functions. For example, fixed-point DSP addition usually needs to consider overflow, that is, the result of addition exceeds the maximum (minimum) value that can be expressed by the word length. The result of general addition will wrap around, and the result of saturated addition is the maximum (minimum) value of the word length. Use C/C++ to express saturated addition as follows:
The complex expression above can be replaced by an inline: _sadd(a,b), which will be mapped to a single-cycle C6000 instruction, greatly saving the amount of calculation. In our EHMM program, four inline functions are mainly used, namely _mpy32ll(), _smpy(), _sadd(), _ssub(). The first two are C64x+-specific instructions. The first is a 32-bit multiplication with a 64-bit result, and the second is a fixed-point Q3l multiplication with a 32-bit result. The latter two are saturated addition and saturated subtraction. The performance improvement of the main functions after using inline functions is shown in the following table:
4. Memory, CACHE and DMA optimization
The DSP core of DM6446 is a C64x+ core, which integrates 32K of L1 program cache, 80K of L1 data cache and 64K of L2 program/data cache. Both levels of cache can be used as on-chip RAM and cache. The CPU can access the L1 cache without blocking, while the L2 cache will block 2 to 8 cycles. How to reasonably use these on-chip memory resources plays a decisive role in program performance. After analysis, we divide the on-chip space of the C64x+ core into the following:
L1DRAM and CPU are at the same speed, but the capacity is limited. Due to the complexity of the algorithm, the data used by the algorithm cannot be loaded into LIDSRAM. Therefore, during the calculation process, we need to gradually transfer the data to be used into the on-chip memory and transfer the used data out of the on-chip memory. The transfer of data between on-chip and off-chip memory can be completed through DMA, so that the transfer of data can be carried out concurrently with the calculation to form a pipeline. We used this technology in the Adaboost detection algorithm. Since the strong classifier of Adaboost is hierarchical and there is no correlation between the two layers, we split two buffers bufA and bufB in LIDRAM for ping-pong operation. This process is shown in Figure 7-5:
The performance comparison before and after the improvement is as follows:
The cache analysis tool can be used to analyze the cache hits. For places where the cache utilization is not efficient, the code or memory can be adjusted to improve the cache utilization. By looking at the compiled map file, the code size of the EstimateObsProb function, which accounts for more than 70% of the computation, is about 3.7KB, so the entire function can be loaded into the L1Pcache. The cache analysis tool also shows that the L1Pcache hit rate is as high as 99%, so the key part is the data cache, and the L1Dcache hit rate needs to be improved and its miss rate reduced. The following are the types of misses and how to avoid them:
(1) Conflict miss: This is caused by more than one piece of data being referenced and mapped to the same cache block, and the miss is not caused by the cache being too small. Conflict miss can be eliminated by changing the relative position of the data code in the memory;
(2) Forced miss: also known as first reference miss. When data or a program is accessed for the first time, its content will definitely not be cached in the cache, so a miss will inevitably occur. This kind of miss is unavoidable.
(3) Capacity loss: This is caused by the capacity of the cauche being smaller than the referenced data. To eliminate capacity loss, divide the accessed data into small blocks or divide the loop so that the data can be accessed as fully as possible before being removed from the cache.
After cache optimization, the hit rate of L1D cache also reached more than 90%, and the recognition speed increased by about 7%.
|