TI C6000 Optimization Advanced: Loops are the most important!
[Copy link]
Software Pipelining
1. C6000 Pipeline
The processing of an instruction is not completed in one step, it is divided into three stages: fetch, decode, and execute. Putting each stage into an independent process workshop to form a pipeline processing process can greatly speed up the processing speed of instructions.
As shown in Figure 1, the three instructions after pipeline arrangement only require five cycles, which is significantly reduced compared to the nine cycles of sequential execution. As the number of instructions increases, the advantage of the pipeline will become more obvious.
Figure 1: Simple pipeline arrangement
In fact, the C6000 architecture further divides each stage into multiple sub-stages, each of which consumes 1 CPU cycle.
Instruction fetch (4 sub-stages):
-
PG: Program address generate (update program counter register)
-
PS: Program address send (to memory)
-
PW: Program (memory) access ready wait
-
PR: Program fetch packet receive (fetch packet = eight 32-bit instructions)
Decoding (2 sub-stages):
Execution (1-10 sub-stages, different between instructions):
Figure 2 High-performance C6000 pipeline
2. Pipeline blocking
The pipeline will be blocked when the following two situations occur:
-
When the current instruction is a load, complex multiply, or other instruction with multiple delay slots, the next instruction requires multiple cycles and can only continue execution after it returns a result.
-
When a jump instruction appears, the CPU cannot predict which branch instruction to execute next, so the jump target instruction must wait until the jump instruction is executed to the E1 stage before entering the pipeline.
In order to make full use of pipeline resources and avoid blocking caused by delay slots, the C6000 architecture adds a new processing mechanism in software and hardware:
3. Software Pipelining
Software pipelining ≠ assembly line!
Software pipelining technology refers to the compiler re-adjusting the position of instructions so that the pipeline that would otherwise be blocked can be fully utilized. The emphasis is on the word "software".
For example, processing the following loop:
for(i=0; i<15; i++)
{
sum += tab;
}
The traditional instruction flow (Solution 1) and the instruction flow after software pipelining (Solution 2) are shown in Figure 3:
Figure 3 Traditional orchestration vs software pipeline orchestration
As can be seen from the figure, instructions that have been re-arranged by software will no longer cause pipeline blockage, which improves operating efficiency while not affecting the implementation of code functions.
Explain three terms related to software pipelining:
-
Pipeline Kernel: A piece of code that fully utilizes the pipeline
-
Pipeline filling (Prolog): a piece of filling process code before the pipeline core
-
Epilog: The epilog code after the epilog core.
4. SPLOOP Buffer
Software pipelining will bring two major disadvantages: the increase in the size of the assembly file code and the impact on the interrupt properties of the code.
SPLOOP Buffer is designed to solve the above problems. SPLOOP Buffer is a storage area inside C6000, which is used to load SPLOOP instructions. When a SPLOOP is executed for the first time, the relevant instructions of the loop are copied to SPLOOP Buffer, and the entire loop operation process will fetch instructions from here until the loop ends.
C6000 also provides special registers and operation instructions for the use of SPLOOP Buffer. If the programmer uses assembly/linear assembly programming, he needs to be familiar with these instructions and registers, and understand the unique execution mechanism of SPLOOP Buffer (reference [1]). If C/C++ programming is used, the compiler will automatically generate the corresponding instructions.
Taking the storage block copy function as an example, the encoding effect before and after using SPLOOP Buffer is shown:
Figure 4: Memery copy before using SPLOOP buffer
Figure 5: Memery copy after using SPLOOP buffer
*Note: SPLOOP Buffer can only store up to 14 execution packages (each execution package can contain 8 instructions, sequential/parallel), so if the loop body is complex, SPLOOP Buffer cannot be used.
5. Factors that lead to software pipelining failure
In the CCS development environment, turn on the -O2/-O3 optimization option, and the compiler will automatically perform software pipelining for the appropriate code. Therefore, programmers need to pay attention to making the designed loop body meet the conditions for software pipelining.
The following factors may cause software pipelining to fail:
-
Assembly statements embedded in C/C++ code
-
Complex flow control statements such as goto, break, etc. appear
-
A loop contains a call (except for inline functions)
-
Too many instructions need to be software pipelining
-
The loop counter is not initialized
-
The loop variable is modified during the loop
-
Software pipelining is disabled: -O2 or -O3 options are not used; -ms2 or -ms3 options are used; -mu is used to disable software pipelining
|