ARM Programming Optimization-EEWORLD

Collect

Program optimization refers to the process of adjusting and improving the program using software development tools after software programming is completed, so that the program can make full use of resources, improve operating efficiency, and reduce code size. According to the different optimization focuses, program optimization can be divided into running speed optimization and code size optimization. Running speed optimization refers to reducing the number of instructions required to complete a specified task by adjusting the application structure and other means on the basis of fully understanding the characteristics of software and hardware. On the same processor, a speed-optimized program takes less time to complete a specified task than an unoptimized program, that is, the former has higher operating efficiency than the latter. Code size optimization refers to taking measures to reduce the amount of program code as much as possible while enabling the application to correctly complete the required functions.

However, in the actual programming process, the two goals of program optimization (running speed and code size) are usually contradictory. In order to improve the running efficiency of the program, it is often necessary to sacrifice storage space and increase the amount of code. For example, the methods often used in programming, such as replacing calculations with table lookup and loop unrolling, are likely to increase the amount of program code. In order to reduce the amount of program code and compress the memory space, it may be necessary to reduce the running efficiency of the program. Therefore, before optimizing the program, the corresponding strategy should be determined according to the actual needs. When the processor resources are tight, the optimization of running speed should be considered; when the use of memory resources is limited, the optimization of code size should be given priority.

1. Optimize program running speed

Methods for optimizing program running speed can be divided into the following categories.

1.1 General Optimization Methods

(1) Reduce computing intensity

Use left/right shift operations to replace multiplication/division by 2: Usually, multiplication or division by a power of 2 can be done by left shift or right shift by n bits. In fact, multiplication by any integer can be replaced by shift and addition. In arm 7, addition and shift can be completed by one instruction, and the execution time is less than that of multiplication instructions. For example: i = i × 5 can be replaced by i = (i<<2) + i.

Use multiplication instead of exponentiation: The arm7 core has a built-in 32 × 8 multiplier, so multiplication can be used to replace exponentiation to save the cost of exponentiation function calls. For example: i = pow(i, 3.0) can be replaced by i = i×i × i.

Use AND operation instead of remainder operation: Sometimes you can improve efficiency by using AND instruction instead of remainder operation (%). For example: i = i % 8 can be replaced by i = i & 0x07.

(2) Optimize loop termination conditions

In a loop structure, the termination condition of the loop will seriously affect the efficiency of the loop. In addition, due to the conditional execution characteristics of the arm instruction, the count-down-to-zero structure should be used as much as possible when writing the termination condition of the loop. In this way, the compiler can use a BNE (jump if non-zero) instruction instead of the CMP (compare) and BLE (jump if less than) instructions, which not only reduces the code size but also speeds up the running speed.

(3) Using inline functions

ARM C supports the inline keyword. If a function is designed as an inline function, the function body will replace the function call statement where it is called, which will completely save the function call overhead. The biggest disadvantage of using inline is that the code size will increase when the function is called frequently.

1.2 Processor-related optimization methods

(1) Keep the assembly line unobstructed

As we can see from the previous introduction, pipeline delay or blocking will affect the performance of the processor, so the pipeline should be kept unblocked as much as possible. Pipeline delay is difficult to avoid, but the delay cycle can be used to perform other operations.

The auto-indexing feature in the LOAD/STORE instructions is designed to take advantage of pipeline delay cycles. When the pipeline is in a delay cycle, the processor's execution units are occupied, but the arithmetic logic unit (ALU) and barrel shifter may be idle. At this time, they can be used to complete the operation of adding an offset to the base register.

For use by subsequent instructions. For example, the instruction LDR R1, [R2], #4 performs two operations: R1 = *R2 and R2 += 4, which is an example of post-indexing; while the instruction LDR R1, [R2, #4]! performs two operations: R1 = *(R2 + 4) and R2 += 4, which is an example of pre-indexing.

The situation of pipeline blocking can be improved by loop disassembly and other methods. A loop can be disassembled to reduce the proportion of jump instructions in the loop instructions, thereby improving code efficiency. The following is an example of a memory copy function.

void memcopy(char *to, char *from, unsigned int nbytes)

{

while(nbytes--)

*to++ = *from++;

}

For simplicity, it is assumed here that nbytes is a multiple of 16 (the remainder processing is omitted). The above function needs to perform a judgment and jump every time it processes a byte. The loop body can be disassembled as follows:

void memcopy(char *to, char *from, unsigned int nbytes)

{

while(nbytes) {

*to++ = *from++;

nbytes -= 4;

}

In this way, the number of instructions in the loop body increases, but the number of loops decreases. The negative impact of jump instructions is weakened. Taking advantage of the 32-bit word length of the ARM 7 processor, the above code can be further adjusted as follows:

void memcopy(char *to, char *from, unsigned int nbytes)

{

int *p_to = (int *)to;

int *p_from = (int *)from;

while(nbytes) {

*p_to++ = *p_from++;

nbytes -= 16;

}

After optimization, one loop can process 16 bytes. The impact of jump instructions is further reduced. However, it can be seen that the amount of code after adjustment has increased.

(2) Using register variables

The CPU accesses registers much faster than memory, so assigning a register to a variable will help optimize the code and improve the running efficiency. Integer, pointer, floating point and other types of variables can be assigned registers; a structure can also be assigned a register in part or in whole. Assigning registers to variables that need to be frequently accessed in the loop body can also be done in

Improve program efficiency to a certain extent.

[page]

1.3 Instruction set related optimization methods

Sometimes the characteristics of the arm7 instruction set can be used to optimize the program.

(1) Avoid division

There is no division instruction in the arm 7 instruction set. Its division is implemented by calling C library functions. A 32-bit division usually requires 20 to 140 clock cycles. Therefore, division has become a bottleneck of program efficiency and should be avoided as much as possible. Some divisions can be replaced by multiplications, for example: if ( (x / y) > z) can be changed to if ( x > (y × z)). If the accuracy can be met and the memory space is sufficient,

In the case of redundancy, you can also consider using a table lookup method instead of division. When the divisor is a power of 2, use a shift operation instead of division.

(2) Using conditional execution

An important feature of the ARM instruction set is that all instructions can contain an optional conditional code. When the conditional code flag in the program status register (PSR) meets the specified condition, the instruction with the conditional code can be executed. Using conditional execution can usually save separate judgment instructions, thereby reducing code size and improving program efficiency.

(3) Use appropriate variable types

The arm instruction set supports signed/unsigned 8-bit, 16-bit, 32-bit integer and floating-point variables. Proper use of variable types can not only save code, but also improve code running efficiency. You should avoid using char and short local variables as much as possible, because operating 8-bit/16-bit local variables often requires more instructions than operating 32-bit variables. Please compare the following 3 functions and their assembly codes.

intwordinc(inta) wordinc

{ ADD a1,a1,#1

return a + 1; MOV pc,lr

} shortinc

shortshortinc(shorta) ADD a1,a1,#1

{ MOV a1,a1,LSL #16

return a + 1; MOV a1,a1,ASR #16

} MOV pc,lr

charcharinc(chara) charinc

{ ADD a1,a1,#1

return a + 1; AND a1,a1,#&ff

} MOV pc,lr

It can be seen that the instructions required to operate 32-bit variables are less than those required to operate 8-bit and 16-bit variables.

1.4 Memory-related optimization methods

(1) Using table lookup instead of calculation

When processor resources are limited but memory resources are relatively abundant, we can sacrifice storage space for running speed. For example, when we need to calculate the sine or cosine function values frequently, we can calculate the function values in advance and put them in memory for later retrieval.

(2) Make full use of on-chip RAM

Some arm chips produced by some manufacturers have a certain amount of RAM integrated in them, such as Atmel's AT91R40807 with 128KB RAM and Sharp's LH75400/LH75401 with 32KB RAM. The processor's access speed to the on-chip RAM is faster than the access to the external RAM, so the program should be loaded into the on-chip RAM as much as possible. If the program is too large to be completely put into the on-chip RAM, consider loading the most frequently used data or program segments into the on-chip RAM to improve the program's running efficiency.

1.5 Compiler-related optimization methods

Most compilers support optimization of program speed and size. Some compilers also allow users to select the content and degree of optimization. Compared with the previous optimization methods, optimizing programs by setting compiler options is a simple and effective way.

2 Code size optimization

An important feature of RISC computers is that the length of instructions is fixed. This simplifies the process of instruction decoding, but it easily leads to an increase in code size. To avoid this problem, you can consider taking the following measures to reduce the amount of program code.

2.1 Using multiple register operation instructions

The multi-register operation instructions LDM/STM in the arm instruction set can load/store multiple registers, which is very effective in saving/restoring the status of the register group and copying large blocks of data. For example, if you want to save the contents of registers R4~R12 and R14 to the stack, if you use STR instructions, you need a total of 10 instructions, but one STMEA R13!, {R4 ?? R12, R14} instruction can achieve the same purpose, saving considerable instruction storage space. However, it should be noted that although one LDM/STM instruction can replace multiple LDR/STR instructions, this does not mean that the program running speed has been improved. In fact, when the processor executes the LDM/STM instruction, it still splits it into multiple separate LDR/STR instructions for execution.

2.2 Arrange the order of variables reasonably

The arm 7 processor requires that 32-bit/16-bit variables in the program must be aligned by word/halfword, which means that if the order of variables is not arranged properly, it may cause a waste of storage space. For example: if the four 32-bit int variables i1 ~ i4 and the four 8-bit char variables c1 ~ c4 in a structure are stored in the order of i1, c1, i2, c2, i3, c3, i4, c4, the alignment of the integer variables will cause the 8-bit char variable located between the two integer variables to actually occupy 32 bits of memory, thus causing a waste of storage space. To avoid this situation, the int variables and char variables should be stored continuously in the order of i1, i2, i3, i4, c1, c2, c3, c4.

2.3 Using Thumb Instructions

In order to effectively reduce the code size fundamentally, ARM has developed the 16-bit Thumb instruction set. Thumb is an extension of the ARM architecture. The Thumb instruction set is a collection of most commonly used 32-bit ARM instructions compressed into 16-bit wide instructions. During execution, 16-bit instructions are transparently decompressed into 32-bit ARM instructions in real time without performance loss. Moreover, there is zero overhead when the program switches between Thumb state and ARM state. Compared with the equivalent 32-bit ARM code, Thumb code can save up to 35% of memory space.

Conclusion

In summary, the optimization process is to make full use of hardware resources and constantly adjust the program structure to make it reasonable, based on a thorough understanding of the software/hardware structure and characteristics. Its purpose is to maximize the processor performance, maximize the use of resources, and maximize the performance of the program on a specific hardware platform. With the increasing application of ARM processors in industries such as communications and consumer electronics, optimization technology will play an increasingly important role in the program design process based on ARM processors.

It is worth noting that program optimization is usually only one of the many goals that software design needs to achieve. Optimization should be carried out without affecting the correctness, robustness, portability and maintainability of the program. One-sided pursuit of program optimization often affects important goals such as robustness and portability.

Reference address：ARM Programming Optimization

Previous article：Design of adaptive mine main fan monitoring system based on ARM9
Next article：Design of Linear CCD Diameter Measurement System Based on ARM and FPGA

Popular Resources
Popular amplifiers