TI C6000 Optimization Advanced: Loops are the most important!

Aguilera

TI C6000 Optimization Advanced: Loops are the most important! [Copy link]

Software Pipelining

1. C6000 Pipeline

The processing of an instruction is not completed in one step, it is divided into three stages: fetch, decode, and execute. Putting each stage into an independent process workshop to form a pipeline processing process can greatly speed up the processing speed of instructions.

As shown in Figure 1, the three instructions after pipeline arrangement only require five cycles, which is significantly reduced compared to the nine cycles of sequential execution. As the number of instructions increases, the advantage of the pipeline will become more obvious.

Figure 1: Simple pipeline arrangement

In fact, the C6000 architecture further divides each stage into multiple sub-stages, each of which consumes 1 CPU cycle.

Instruction fetch (4 sub-stages):

PG: Program address generate (update program counter register)
PS: Program address send (to memory)
PW: Program (memory) access ready wait
PR: Program fetch packet receive (fetch packet = eight 32-bit instructions)

Decoding (2 sub-stages):

DP: Instruction dispatch (or assign, to the functional units)
DC: Instruction decode

Execution (1-10 sub-stages, different between instructions):

E1 – E10, where E1 is the first sub stage in the execute stage

Figure 2 High-performance C6000 pipeline

2. Pipeline blocking

The pipeline will be blocked when the following two situations occur:

When the current instruction is a load, complex multiply, or other instruction with multiple delay slots, the next instruction requires multiple cycles and can only continue execution after it returns a result.
When a jump instruction appears, the CPU cannot predict which branch instruction to execute next, so the jump target instruction must wait until the jump instruction is executed to the E1 stage before entering the pipeline.

In order to make full use of pipeline resources and avoid blocking caused by delay slots, the C6000 architecture adds a new processing mechanism in software and hardware:

Software: Provide software pipelining instruction arrangement
Hardware: Provide SPLOOP buffer (software pipelining loop buffer)

3. Software Pipelining

Software pipelining ≠ assembly line!

Software pipelining technology refers to the compiler re-adjusting the position of instructions so that the pipeline that would otherwise be blocked can be fully utilized. The emphasis is on the word "software".

For example, processing the following loop:

for(i=0; i<15; i++)

{

sum += tab;

}

The traditional instruction flow (Solution 1) and the instruction flow after software pipelining (Solution 2) are shown in Figure 3:

image.png (156.49 KB, downloads: 0)

download attach save to album

2020-1-2 20:14 上传

Figure 3 Traditional orchestration vs software pipeline orchestration

As can be seen from the figure, instructions that have been re-arranged by software will no longer cause pipeline blockage, which improves operating efficiency while not affecting the implementation of code functions.

Explain three terms related to software pipelining:

Pipeline Kernel: A piece of code that fully utilizes the pipeline

Pipeline filling (Prolog): a piece of filling process code before the pipeline core

Epilog: The epilog code after the epilog core.

4. SPLOOP Buffer

Software pipelining will bring two major disadvantages: the increase in the size of the assembly file code and the impact on the interrupt properties of the code.

SPLOOP Buffer is designed to solve the above problems. SPLOOP Buffer is a storage area inside C6000, which is used to load SPLOOP instructions. When a SPLOOP is executed for the first time, the relevant instructions of the loop are copied to SPLOOP Buffer, and the entire loop operation process will fetch instructions from here until the loop ends.

C6000 also provides special registers and operation instructions for the use of SPLOOP Buffer. If the programmer uses assembly/linear assembly programming, he needs to be familiar with these instructions and registers, and understand the unique execution mechanism of SPLOOP Buffer (reference [1]). If C/C++ programming is used, the compiler will automatically generate the corresponding instructions.

Taking the storage block copy function as an example, the encoding effect before and after using SPLOOP Buffer is shown:

image.png (199.51 KB, downloads: 0)

download attach save to album

2020-1-2 20:14 上传

Figure 4: Memery copy before using SPLOOP buffer

image.png (82.52 KB, downloads: 0)

download attach save to album

2020-1-2 20:15 上传

Figure 5: Memery copy after using SPLOOP buffer

*Note: SPLOOP Buffer can only store up to 14 execution packages (each execution package can contain 8 instructions, sequential/parallel), so if the loop body is complex, SPLOOP Buffer cannot be used.

5. Factors that lead to software pipelining failure

In the CCS development environment, turn on the -O2/-O3 optimization option, and the compiler will automatically perform software pipelining for the appropriate code. Therefore, programmers need to pay attention to making the designed loop body meet the conditions for software pipelining.

The following factors may cause software pipelining to fail:

Assembly statements embedded in C/C++ code

Complex flow control statements such as goto, break, etc. appear

A loop contains a call (except for inline functions)

Too many instructions need to be software pipelining

The loop counter is not initialized

The loop variable is modified during the loop

Software pipelining is disabled: -O2 or -O3 options are not used; -ms2 or -ms3 options are used; -mu is used to disable software pipelining

alan000345

Very good sharing.

TI C6000 Optimization Advanced: Loops are the most important! [Copy link]

Latest reply