2939 views|0 replies

1140

Posts

0

Resources
The OP
 

TI C6000 Optimization Manual - Make Code Look Like Nails [Copy link]

The emergence of DSP chips is to solve a large number of digital computing problems. By integrating dedicated adders, multipliers, address generators, complex logic and other hardware units, DSP can achieve faster digital computing than ordinary microcontrollers, making the processor more suitable for processing occasions with high real-time and high complexity. Because of this, a very important part of DSP programming is to make the code run as efficiently as possible.

This article is based on the TI C6000 hardware architecture and introduces the main code optimization methods for C language programming.

Guidelines

Before optimizing, you should establish the following beliefs:

The loop is the most important. Obviously, almost all time-consuming operations are performed in the loop. We can almost say that code optimization is loop optimization.

Worst-case principle. TI CCS compiler has an integrated optimizer that can optimize the performance of C/C++ code and assembly code. However, excessive optimization may cause program errors. Therefore, in the absence of information, the compiler always optimizes with the worst-case scenario in mind to prioritize program correctness.

If the hardware architecture and compiler optimization tools provided by TI are a hammer, then programmers should make the code work like a nail.

The performance of highly optimized C/C++ code can be very close to that of handwritten assembly code. Due to the complexity of assembly code, we can give priority to C/C++ for program writing. Figure 1 is a schematic diagram of the comparison of computing performance between various programming languages and their optimized versions provided by TI.

Figure 1 Performance comparison of each language before and after optimization

Optimization is not necessary. Generally, we can determine the degree of optimization of the program through the four advanced development stages shown in Figure 2.


Figure 2 General process of DSP program development

Utilities

As mentioned in the metaphor of the "hammer" and the "nail", the programmer's optimization job is to make the code as fully utilized as possible by the "hammers" such as the processor's functional units and compilers, so it is necessary to first establish a basic understanding of them.

1. High-performance C6000 DSP architecture

8 parallel functional units, 2 sets of registers, separate program and data storage; 256-bit instruction fetch package, which can fetch 8 32-bit instructions at a time; 2-way 64-bit data load/store. All of these point to one core - parallel processing.

Figure 3 C6000 DSP architecture

2. C6000 assembly line

The completion of an instruction operation actually requires multiple processes in three stages: instruction fetch, decoding, and execution. TI provides software pipelining to allow multiple "jobs" to process multiple operations simultaneously in a pipeline manner, greatly improving the computing throughput.

The C64x+, C674x and C66x series cores also add a software pipeline loop cache (SPLOOP buffer) unit, which allows the software pipeline to load data more quickly and can be temporarily interrupted.

However, the SPLOOP buffer cannot be used to handle loops that exceed 14 execute packets

For complex loops, such as nested loops, conditional branches inside loops, and function calls inside loops, the effectiveness of the compiler may be compromised.

3. SIMD(Single Instruction, Multiple Data)

C6000 supports single instruction multiple data access, and only one instruction can operate 64 bits of data at a time. These 64 bits can be composed of multiple double words/words/bytes of data.

4. C6000 C Compiler

The compiler is the final executor of code optimization. It analyzes relevant information of the code and makes optimization decisions. But sometimes, the compiler cannot obtain some important information for optimization only by analyzing the code. At this time, it is very important for the programmer to actively provide the necessary information to the compiler.

We can use compilation options, keywords, and pragma compilation instructions to inform the compiler of optimization-related information. At the same time, we can also use the compiler's optimization return information to further adjust the optimization strategy.

5. Others

Embedded functions
TI provides a set of embedded functions for programmers to call. The embedded functions are composed of some specific instructions. Together with the hardware function unit of the DSP chip, they can efficiently complete some complex operations that are difficult to complete using C language.

The intrinsic operations are not function calls (though they have the appearance of function calls), so no branching is needed.Instead, the use of intrinsic is a way to tell the compiler to issue one or more particular instructions of the C6000 instruction set.

Optimized library functions
TI also encapsulates some common calculation modules into library functions. These library functions have been deeply optimized and can complete the corresponding calculations very efficiently.

Optimization strategy

1. Choose the appropriate compilation options (introduction)

-o0/1/2/3: The most important optimization options; if -o3 is selected, the compiler will try all possible optimization methods, but sometimes it may cause errors in the optimized program. -o0 and -o1 will not produce optimization errors, but the optimization performance will be greatly reduced.
-g: Allows the compiler to insert symbolic debugging information. It is a very good tool in the development and debugging phase, but it should be avoided in the compilation of the final product code because it will reduce parallel processing instructions and take up additional code space, greatly affecting code performance.
-mt: Instructs the compiler that the pointer parameters in all functions in the application do not point to the same address, but it only acts on the pointers in the function parameters and is invalid for local pointers inside the function.
-k and -mw: These two options are not related to the final code performance, but they can be used to obtain optimization feedback information from the compiler to help programmers adjust optimization strategies. The -k option retains the compiler's assembly output; -mw outputs relevant information about software pipelining.

2. The “restrict” keyword

The effect of restrict is similar to the -mt compilation option, which tells the compiler that a pointer will not point to the same memory as other pointers in the function.

the restrict keyword can be applied to any pointers, regardless of whether it is a parameter, a local variable, or an element of a referenced dataobject. Moreover, because the restrict keyword is only effective in the function it is in, it can be used whenever needed

Note: The "restrict" keyword cannot be added randomly. We need to understand the on-chip memory composition of C6000. Restrict is only legal when the memory pointed to by the two pointers is in different blocks.

3. Providing information through pragmas

You can insert some specific syntax compilation instructions into the code to tell the compiler some information about the code. Because the compiler always optimizes the code with the worst plan in the absence of information, if the programmer can provide some key information, it will greatly help the compiler make good decisions. The most commonly used are MUST_ITERATE and UNROLL.

MUST_ITERATE: Provides some exact information about the number of loops: the minimum possible number of loops, the maximum possible number of loops, and the number of loops that are multiples of a factor. Its usage syntax is as follows:


When some parameters cannot be determined, defaults are allowed:

UNROLL: The unroll instruction tells the compiler to properly unroll the loop code. Its syntax is as follows:


Before the unroll instruction, it is also a good idea to tell the compiler to use the MUST_ITERATE directive to loop multiples of the unroll factor, to avoid generating extra code to handle exceptions, such as:

There are two benefits of unrolling: first, it allows the compiler to use each computing unit more evenly; second, the compiler has more opportunities to use SIMD instructions. However, one thing to note is that unrolling will increase the size of the loop body.

4. Notes on loop body optimization

The key to loop optimization is to enable the loop to be arranged into software pipeline.

For complex loops, such as nested loops, conditional branches inside loops, and function calls inside loops, the effectiveness of the compiler may be compromised. When the situation becomes too complex, the compiler might not be able to pipeline at all.

The compiler performs software pipelining only on the inner loop.

A software-pipelined loop can contain instrinsics but not function calls.

There must not be break and goto statements in the loop structure, and there must not be conditional termination or instructions that cause the loop to exit early.

The conditional code should be as simple as possible. In C64XX, when the conditional code requires more than 6 registers, the loop cannot be software-pipelined.

Avoid making the loop body too complicated, which may cause insufficient register sets.

If the lifetime of a register is required to be too long, the code cannot be software-pipelined.

Do not include code in a loop structure that changes the loop counter value.

5. Use SIMD to process multiple data in parallel

TI provides multiple SIMD-supporting instructions, such as LDDW and STDW, which are used for parallel 64-bit data loading and storage.

Assuming all the data used are actually 16-bit data, it would be ideal if the LDDW and STDW instructions can operate on 4 elements every time. However, if thedata is declared as 32-bit type, LDDW and STDW can only operate on 2 elements every time. Therefore,to fully utilize SIMD instructions, it is important to choose the smallest data size that works for the data.

Although the C64x+, C674x, and C66x cores support the use of SIMD instructions for non-aligned data, when the amount of data is large, you should try to ensure that the data is boundary-aligned so that it can be fully accessed in parallel.

6. Use built-in functions whenever possible

Built-in functions are used to directly call C6000 assembly operations, which are often complex to implement in C language.

The intrinsic operations are not function calls (though they have the appearance of function calls), so no branching is needed.Instead, the use of intrinsic is a way to tell the compiler to issue one or more particular instructions of the C6000 instruction set.

7. TI library functions

For common operations, TI provides efficient implementation library functions, including basic general modules as shown in Figure 4. Of course, there are also corresponding library functions for some special fields (such as image, video, communication, etc.). These library functions have been deeply optimized and run very efficiently.


Figure 4 TI basic library function categories

8. Use inline functions instead of function calls

Inline Function is a kind of C language syntax. Its definition is similar to that of an ordinary function, but the compiler does not treat it as a function when processing it. Instead, it automatically embeds its function body into the called location.

Since function calls in the loop body will affect the arrangement of software pipelines and generate call overhead, it is a good choice to use inline functions instead of function calls.

But please note: inline functions may increase the size of the code, so avoid inline function bodies that are too large or called frequently.

9. Use logical operations instead of multiplication and division whenever possible

The execution time of multiplication and division instructions is much longer than that of logical shift instructions, especially division instructions. When designing, you can make some adjustments based on the actual situation and try to use logical shift operations instead of multiplication and division operations. This can speed up the running time of the instructions.

This post is from DSP and ARM Processors
 

Guess Your Favourite
Find a datasheet?

EEWorld Datasheet Technical Support

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京B2-20211791 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号
快速回复 返回顶部 Return list