Highly efficient embedded program development-EEWORLD

Collect

Introduction
In multimedia, communication and other applications with high computational complexity, embedded system programs often require special design to meet the requirements of many constraints such as manufacturing cost, power consumption, performance and real-time performance. This requires designers to have a set of practical programming guidelines when designing embedded software for specific applications. In actual program design, engineers especially need to consider the use of variables and the processing of loop programs.

Variable usage
When developing a real program, the use of variables is crucial. Using global variables is more efficient than passing parameters to functions, which eliminates the need to push and pop parameters when calling functions. Of course, using global variables will have some side effects on the program. The order in which variables are defined will result in different data layouts in the final image, as shown in Figure 1.

Figure 1 Variable image order disorder

It can be seen that when declaring variables, it is necessary to consider how to best control the memory layout. The best way is to define all variables of the same type together when programming.

Usually, engineers try to use short or char to define variables to save memory space. When the number of local variables of a function is limited, the compiler will assign local variables to internal registers, and each variable occupies one register. In this case, using short and char type variables will not only not save space, but will bring other side effects. As shown in Figure 2: Assume that a is any possible register to store the local variables of the function. For the same operation of adding 1, the 32-bit int type variable is the fastest, using only one addition instruction. For 8-bit and 16-bit variables, after completing the addition operation, they also need to perform sign extension in a 32-bit register. Among them, for signed variables, two instructions, logical left shift and arithmetic right shift, are required to complete the sign extension; for unsigned variables, a logical and instruction is required to clear the sign bit. Therefore, it is most effective to use 32-bit int or unsigned int local variables. In some cases, a function reads local variables from external memory for calculation. In this case, the non-32-bit variables need to be converted to 32 bits. As for the problem that the original overflow exception may be hidden after the 8-bit or 16-bit variable is expanded to 32 bits, it needs further careful consideration.

Figure 2 Addition program for different types of local variables

In programs, switch case statements are often used. Each test and jump implemented by machine language is just to decide what to do next, which wastes processor time. In order to increase speed, specific situations can be sorted according to their relative frequency of occurrence. That is, put the most likely situation first and the situation with a low probability of occurrence at the end, which will reduce the average execution time of the code.

Usually, engineers always try to avoid using redundant variables to simplify the program. Generally, this is correct, but there are exceptions, as shown below:
int f(void);
int g(void);
file://f() and g() do not access the global variable errs
int errs; file://global variable
void test1(void)
{ errs += f();
errs += g();
}
void test2(void)
{ int localerrs = errs;
// define redundant local variables
localerrs += f();
localerrs += g();
errs = localerrs;
}
In the first case, test1(), each time the global variable errs is accessed, it must first be downloaded from the corresponding memory to the register, and then stored back to the original memory after the f() or g() function call. In this example, a total of two such download/store operations are required. In the second case, test2(), the local variable localerrs is assigned to the register, so that the entire function only needs to download/store the global variable memory once. Saving the number of memory accesses as much as possible is very useful for improving system performance.

Processing of loop programs
Counting loops are commonly used flow control structures in programs. In C, for loops like the following are everywhere:
for(loop=1; loop<=limit; loop++)

This cumulative counting method conforms to the general natural thinking habits, so it is used more than the following decremental counting method:
for(loop<=limit; loop!=0; loop--)
There is no difference in efficiency between the two in logic, but when mapped to the specific architecture, there is a big difference.

The accumulation method uses one more instruction than the decrement method. When the number of loops is large, the two codes will have a significant difference in performance. The essential reason is that when a non-zero constant comparison is performed, a special CMP instruction must be used to execute; when a variable is compared with zero, the ARM instruction can directly use the conditional execution feature (NE) to make a judgment. In many cases, loop unrolling is automatically completed by the compiler, but it should be noted that for loops where intermediate variables or results are changed, the compiler often refuses to unroll, and at this time, engineers need to do the unrolling work themselves.

It is especially noteworthy that on CPUs with internal instruction caches (such as the ARM946ES chip), because the code for loop unrolling is very large, cache overflows often occur. At this time, the unrolled code will frequently call back and forth between the CPU's cache and memory, and because the cache is very fast, loop unrolling will actually slow down. At the same time, loop unrolling will affect vector operation optimization.

The ARM processor core has special instructions for NZ (zero comparison and branch), which are very fast. If your loop is not sensitive to direction, you can loop from large to small. It should be noted that if the pointer operation uses the i value, this method may cause a serious error of pointer index out of bounds (i = MAX+1). Of course, you can correct it by adding or subtracting i, but if this is done, it will not improve efficiency.

Conclusion
This article summarizes some programming techniques for high-efficiency embedded ARM program development. In actual embedded system development, it can greatly improve the performance of the system, especially in high-complexity applications such as multimedia and communication, which has guiding significance for program design.

References:
1 Marshall P. Cline and Greg A. Lomow. C++ FAQs, Addison-Wesley, 1995
2 Bruce Eckel. Thinking in C++ (C++ Programming Thoughts, translated by Liu Zongtian et al.), Machinery Industry Press, 2000

Keywords：Embedded Reference address：Highly efficient embedded program development

Previous article：Embedded Linux System Graphics and Graphical User Interface
Next article：Design and Analysis of Bootloader in Loongson Tax Control SoC

Popular Resources
Popular amplifiers