Practical skills | Ten optimization solutions for embedded C code

Latest update time：2022-03-10

Reads：

▲ For more exciting content, please click on the blue words above and follow us!

Code optimization has its own focus. Optimization is an art of balance, which often comes at the expense of program readability or increased code length. In embedded development, the requirements for program execution speed are relatively high, so learning to do code optimization well can make your code execution more efficient .

1. Choose the right algorithm and data structure

It is important to choose a suitable data structure. If a large number of insert and delete instructions are used in a bunch of randomly stored numbers, it is much faster to use a linked list. Arrays and pointer statements are closely related. Generally speaking, pointers are more flexible and concise, while arrays are more intuitive and easy to understand. For most compilers, the code generated by using pointers is shorter and more efficient than that generated by using arrays.

In many cases, you can use pointer arithmetic instead of array indexing, often resulting in faster and shorter code. Pointers generally make code faster and take up less space than array indexing. The difference is more pronounced when working with multidimensional arrays. The following code does the same thing, but with different efficiency.

数组索引                指针运算

For(;;){               p=array

A=array[t++];          for(;;){

                      a=*(p++);

...............................

}                      }

The advantage of the pointer method is that each time the address of the array is loaded into the address p, only p needs to be incremented in each loop. In the array index method, the complex operation of finding the array subscript based on the value of t must be performed in each loop.

2. Use the smallest possible data type

If you can define a variable with character type (char), don't define it with integer type (int); if you can define a variable with integer type, don't use long int; if you can avoid using floating point type (float), don't use floating point type. Of course, don't exceed the scope of the variable after defining it. If you assign a value beyond the scope of the variable, the C compiler will not report an error, but the program will run wrong, and such errors are difficult to find.

In ICCAVR, you can set the use of printf parameters in Options, try to use basic parameters (%c, %d, %x, %X, %u and %s format specifiers), use less long integer parameters (%ld, %lu, %lx and %lX format specifiers), and try not to use floating point parameters (%f). The same is true for other C compilers. If other conditions remain unchanged, using the %f parameter will increase the amount of generated code and reduce the execution speed.

3. Reduce the intensity of calculation

1. Table Lookup (Required Course for Game Programmers)

A smart game developer will not do any calculations in his main loop. He will do the calculations first and then look up the table in the loop. See the following example:

Old code:

long factorial(int i)
{
    if (i == 0)
      return 1;
    else
      return i * factorial(i - 1);
}

New code:

static long factorial_table[] = {1， 1， 2， 6， 24， 120， 720  /* etc */ };
long factorial(int i)
{
    return factorial_table[i];
}

If the table is large and difficult to write, write an init function to temporarily generate the table outside the loop.

2. Remainder operation

a=a%8;

Can be changed to:

a=a&7;

Note: Bit operations can be completed in just one instruction cycle, while most C compilers use subroutines to complete the "%" operation, which results in long code and slow execution. Usually, if you only need to find the remainder of a square of 2n, you can use bit operations instead.

3. Square operation

a=pow(a, 2.0);

Can be changed to:

a=a*a;

Note: In microcontrollers with built-in hardware multipliers (such as the 51 series), multiplication is much faster than squaring, because the squaring of floating-point numbers is achieved by calling a subroutine. In AVR microcontrollers with built-in hardware multipliers, such as ATMega163, multiplication can be completed in just 2 clock cycles. Even in AVR microcontrollers without built-in hardware multipliers, the subroutine for multiplication is shorter in code and faster in execution than the subroutine for squaring.

If you want to find the cube, such as:

a=pow(a，3.0);

change to:

a=a*a*a；

The improvement in efficiency is more obvious.

4. Use shifting to implement multiplication and division operations

a=a*4;
b=b/4;

Can be changed to:

a=a<<2;
b=b>>2;

Usually, if you need to multiply or divide by 2n, you can use the shift method instead. In ICCAVR, if you multiply by 2n, you can generate left shift code, and multiply by other integers or divide by any number, you call the multiplication and division subroutine. The code generated by the shift method is more efficient than the code generated by calling the multiplication and division subroutine. In fact, as long as you multiply or divide by an integer, you can use the shift method to get the result, such as:

a=a*9

Can be changed to:

a=(a<<3)+a

Use an expression with less computational effort to replace the original expression. Here is a classic example:

Old code:

x = w % 8;
y = pow(x， 2.0);
z = y * 33;
for (i = 0;i < MAX;i++)
{
    h = 14 * i;
    printf("%d"， h);
}

New code:

x = w & 7;                 /* 位操作比求余运算快*/
y = x * x;                 /* 乘法比平方运算快*/
z = (y << 5) + y;          /* 位移乘法比乘法快 */
for (i = h = 0; i < MAX; i++)
{
    h += 14;               /* 加法比乘法快 */
    printf("%d"，h);
}

5. Avoid unnecessary integer division

Integer division is the slowest of all integer operations, so it should be avoided whenever possible. One possible way to reduce integer division is to use continuous division, where division can be replaced by multiplication. The side effect of this replacement is that it is possible to overflow when calculating the product, so it can only be used in a certain range of division.

Bad code:

int i， j， k， m；
m = i / j / k；

Recommended code:

int i， j， k， m；
m = i / (j * k)；

6. Using the increment and decrement operators

Try to use increment and decrement operators when using addition and subtraction operations, because increment statements are faster than assignment statements. The reason is that for most CPUs, the increase and decrement operations on memory words do not need to use explicit memory fetch and write instructions, such as the following statement:

x=x+1;

Taking most microcomputer assembly languages as an example, the generated code is similar to:

move A，x      ;把x从内存取出存入累加器A
add A，1       ;累加器A加1
store x        ;把新值存回x

If the increment operator is used, the generated code is as follows:

incr x           ;x加1

Obviously, without fetching and storing instructions, the increment and decrement operations are executed faster and their length is shortened.

7. Using compound assignment expressions

Compound assignment expressions (such as a-=1 and a+=1, etc.) can generate high-quality program code.

8. Extract common sub-expressions

In some cases, the C++ compiler cannot extract common sub-expressions from floating-point expressions, because this means reordering the expressions. In particular, the compiler cannot rearrange the expressions according to algebraic equivalence before extracting the common sub-expressions. In this case, the programmer has to manually extract the common sub-expressions (there is a "global optimization" option in VC.NET that can do this, but the effect is unknown).

Bad code:

float a， b， c， d， e， f；
、、、
e = b * c / d；
f = b / d * a；

Recommended code:

float a， b， c， d， e， f；
、、、
const float t(b / d)；
e = c * t；
f = a * t；

Bad code:

float a， b， c， e， f；
、、、
e = a / c；
f = b / c；

Recommended code:

float a， b， c， e， f；
、、、
const float t(1.0f / c)；
e = a * t；
f = b * t；

4. Layout of structure members

Many compilers have options to "align structures to words, double words, or quad words". However, it is still necessary to improve the alignment of structure members. Some compilers may allocate space for structure members in a different order than they are declared. However, some compilers do not provide these features or the effect is not good. Therefore, to achieve the best alignment of structures and structure members at the least cost, the following methods are recommended:

1. Sort by the length of the data type

Sort the members of the structure by their type length, and put the long type before the short type when declaring the members. The compiler requires that long data types be stored on even address boundaries. When declaring a complex data type (both multi-byte data and single-byte data), the multi-byte data should be stored first, and then the single-byte data, so as to avoid memory holes. The compiler automatically aligns instances of structures on even memory boundaries.

2. Fill the structure to an integer multiple of the longest type length

Pad the structure to an integer multiple of the longest type length. In this way, if the first member of the structure is aligned, all the entire structure will be aligned. The following example shows how to reorder the structure members:

Bad code, normal order:

struct
{
  char a[5]；
  long k；
  double x；
} baz；

The recommended code, with the new order and a few bytes manually padded:

struct
{
  double x；
  long k；
  char a[5]；
  char pad[7]；
} baz；

This rule also applies to the layout of class members.

3. Sort local variables by length of data type

When the compiler allocates space for local variables, their order is the same as the order in which they are declared in the source code. As in the previous rule, long variables should be placed before short variables. If the first variable is aligned, the other variables will be stored continuously and will be aligned naturally without padding bytes. Some compilers do not automatically change the order of variables when allocating variables, and some compilers cannot generate 4-byte aligned stacks, so 4 bytes may not be aligned. The following example demonstrates the reordering of local variable declarations:

Bad code, normal order

short ga， gu， gi；
long foo， bar；
double x， y， z[3]；
char a， b；
float baz；

Recommended code, order of improvement

double z[3]；
double x， y；
long foo， bar；
float baz；
short ga， gu， gi；

4. Copy frequently used pointer parameters to local variables

Avoid frequent use of pointer-type parameters in functions. Pointer-type parameters often cannot be optimized by the compiler because the compiler does not know whether there will be conflicts between pointers. This prevents data from being stored in registers and significantly consumes memory bandwidth. Note that many compilers have an "assume no conflicts" optimization switch (which must be added manually to the compiler command line /Oa or /Ow in VC), which allows the compiler to assume that two different pointers always have different contents, so there is no need to save pointer-type parameters to local variables. Otherwise, save the data pointed to by the pointer to a local variable at the beginning of the function. If necessary, copy it back before the end of the function.

Bad code:

// 假设 q != r
void isqrt(unsigned long a， unsigned long* q， unsigned long* r)
{
  *q = a；
  if (a > 0)
  {
    while (*q > (*r = a / *q))
    {
      *q = (*q + *r) >> 1；
    }
  }
  *r = a - *q * *q；
}

Recommended code:

// 假设 q != r

void isqrt(unsigned long a， unsigned long* q， unsigned long* r)
{
  unsigned long qq， rr；
  qq = a；
  if (a > 0)
  {
    while (qq > (rr = a / qq))
    {
      qq = (qq + rr) >> 1；
    }
  }
  rr = a - qq * qq；
  *q = qq；
  *r = rr；
}

5. Loop Optimization

1 .Fully decompose small loops

To fully utilize the CPU's instruction cache, you need to fully decompose small loops. Especially when the loop body itself is very small, decomposing the loop can improve performance. Note: Many compilers cannot automatically decompose loops. Bad code:

// 3D转化：把矢量 V 和 4x4 矩阵 M 相乘
for (i = 0；i < 4；i ++)
{
  r[i] = 0；
  for (j = 0；j < 4；j ++)
  {
    r[i] += M[j][i]*V[j]；
  }
}

Recommended code:

r[0] = M[0][0]*V[0] + M[1][0]*V[1] + M[2][0]*V[2] + M[3][0]*V[3]；
r[1] = M[0][1]*V[0] + M[1][1]*V[1] + M[2][1]*V[2] + M[3][1]*V[3]；
r[2] = M[0][2]*V[0] + M[1][2]*V[1] + M[2][2]*V[2] + M[3][2]*V[3]；
r[3] = M[0][3]*V[0] + M[1][3]*V[1] + M[2][3]*V[2] + M[3][3]*v[3]；

2. Extract the common part

For some tasks that do not require loop variables to participate in calculations, they can be placed outside the loop. The tasks here include expressions, function calls, pointer operations, array accesses, etc. All operations that do not need to be performed multiple times should be grouped together and placed in an init initialization program.

3. Delay function

The commonly used delay functions are in the form of self-increment:

void delay (void)
{
  unsigned int i;
  for (i=0;i<1000;i++) ;
}

Change it to a self-decrementing delay function:

void delay (void)
{
  unsigned int i;
  for (i=1000;i>0;i--) ;
}

The delay effects of the two functions are similar, but almost all C compilers generate 1 to 3 bytes less code for the latter function than for the former, because almost all MCUs have instructions for transferring to 0, and the latter method can generate such instructions. The same is true when using a while loop. Using a self-decrement instruction to control the loop will generate 1 to 3 fewer letters of code than using a self-increment instruction to control the loop. However, when there are instructions to read and write arrays through the loop variable "i" in the loop, using a pre-decrement loop may cause the array to exceed the bounds, so be careful.

4. While loop and do…while loop

There are two loop forms when using the while loop:

unsigned int i;
i=0;
while (i<1000)
{
   i++;
   //用户程序
}

or:

unsigned int i;
i=1000;
do
{
   i--;
   //用户程序
}
while (i>0);

Of the two loops, the length of the code generated after compilation using the do...while loop is shorter than that of the while loop.

5. Loop Unrolling

This is a classic speed optimization, but many compilers (such as gcc -funroll-loops) can do this automatically, so it is not very effective to optimize it yourself now.

Old code:

for (i = 0; i < 100; i++)
{
  do_stuff(i);
}

New code:

for (i = 0; i < 100; )
{
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
}

It can be seen that the new code reduces the number of comparison instructions from 100 to 10, saving 90% of the loop time. However, please note that the compiler often refuses to expand loops whose intermediate variables or results are changed (for fear of taking responsibility), so you need to do the expansion work yourself.

Another point to note is that on CPUs with internal instruction caches (such as MMX chips), because the code for loop unrolling is very large, the cache often overflows. At this time, the unrolled code will frequently be transferred between the CPU cache and memory. Because the cache speed is very high, loop unrolling will actually be slower. In addition, loop unrolling will affect vector operation optimization.

6. Nested loops

Putting related loops into one loop will also speed things up.

Old code:

for (i = 0; i < MAX; i++)         /* initialize 2d array to 0's */
    for (j = 0; j < MAX; j++)
        a[i][j] = 0.0;
    for (i = 0; i < MAX; i++)        /* put 1's along the diagonal */
        a[i][i] = 1.0;

New code:

for (i = 0; i < MAX; i++)         /* initialize 2d array to 0's */
{
    for (j = 0; j < MAX; j++)
        a[i][j] = 0.0;
    a[i][i] = 1.0;                            /* put 1's along the diagonal */
}

7. Sort cases in the Switch statement by frequency of occurrence

Switch may be converted into code of many different algorithms. The most common ones are jump table and comparison chain/tree. When switch is converted into comparison chain, the compiler will generate nested code of if-else-if and compare them in order. When there is a match, it will jump to the statement that meets the condition to execute. Therefore, the case values can be sorted according to the possibility of occurrence, and the most likely ones are placed first, which can improve performance. In addition, it is recommended to use small consecutive integers in the case, because in this case, all compilers can convert switch into a jump table.

Bad code:

int days_in_month， short_months， normal_months， long_months；
、、、
switch (days_in_month)
{
  case 28:
  case 29:
    short_months ++；
    break；
  case 30:
    normal_months ++；
    break；
  case 31:
    long_months ++；
    break；
  default:
    cout << "month has fewer than 28 or more than 31 days" << endl；
    break；
}

Recommended code:

int days_in_month， short_months， normal_months， long_months；
、、、
switch (days_in_month)
{
  case 31:
    long_months ++；
    break；
  case 30:
    normal_months ++；
    break；
  case 28:
  case 29:
    short_months ++；
    break；
  default:
    cout << "month has fewer than 28 or more than 31 days" << endl；
    break；
}

8. Convert large switch statements into nested switch statements

When there are many case labels in a switch statement, in order to reduce the number of comparisons, it is wise to convert the large switch statement into a nested switch statement. Put the case labels with high frequency in one switch statement, and put them in the outermost layer of the nested switch statement, and put the case labels with relatively low frequency in another switch statement. For example, the following program segment puts the relatively low frequency case in the default case label.

pMsg=ReceiveMessage();
switch (pMsg->type)
{
      case FREQUENT_MSG1:
        handleFrequentMsg();
        break;
      case FREQUENT_MSG2:
        handleFrequentMsg2();
        break;
        。。。。。。
      case FREQUENT_MSGn:
        handleFrequentMsgn();
        break;
      default:                     //嵌套部分用来处理不经常发生的消息
        switch (pMsg->type)
        {
          case INFREQUENT_MSG1:
               handleInfrequentMsg1();
               break;
          case INFREQUENT_MSG2:
               handleInfrequentMsg2();
               break;
        。。。。。。
          case INFREQUENT_MSGm:
              handleInfrequentMsgm();
              break;
        }
}

If there is a lot of work to do in each case of the switch, it may be more efficient to replace the entire switch statement with a table of function pointers, such as the following switch statement, which has three cases:

enum MsgType{Msg1， Msg2， Msg3}
switch (ReceiveMessage()
{
    case Msg1;
        。。。。。。
    case Msg2;
        。。。。。
    case Msg3;
        。。。。。
}

To speed up execution, replace the switch statement above with the following code.

/*准备工作*/
int handleMsg1(void);
int handleMsg2(void);
int handleMsg3(void);
/*创建一个函数指针数组*/
int (*MsgFunction [])()={handleMsg1， handleMsg2， handleMsg3};
/*用下面这行更有效的代码来替换switch语句*/
status=MsgFunction[ReceiveMessage()]();

9. Loop Transpose

Some machines have special instructions for JNZ (jump to 0), which are very fast. If your loop is not sensitive to direction, you can loop from large to small.

Old code:

for (i = 1; i <= MAX; i++)
{
   。。。
}

New code:

i = MAX+1;
while (--i)
{
  。。。
}

However, please note that if the pointer operation uses the value of i, this method may cause a serious error of pointer out of bounds (i = MAX+1;). Of course, you can correct it by adding or subtracting i, but this will not accelerate the operation, unless it is similar to the following situation:

Old code:

char a[MAX+5];
for (i = 1; i <= MAX; i++)
{
  *(a+i+4)=0;
}

New code:

i = MAX+1;
while (--i)
{
    *(a+i+4)=0;
}

10. Common code blocks

Some common processing modules often use a lot of if-then-else structures to meet various calling needs. This is not good. If the judgment statement is too complicated, it will consume a lot of time. The use of common code blocks should be reduced as much as possible. (In any case, space optimization and time optimization are opposites--Donglou). Of course, if it is just a simple judgment like (3==x), it is still allowed to use it appropriately. Remember, optimization is always about pursuing a balance, not going to extremes.

11. Improve loop performance

To improve the performance of a loop, it is useful to reduce redundant constant calculations (i.e., calculations that do not vary across the loop).

Bad code (contains an unchanged if() inside a for()):

for( i 。。。)
{
  if( CONSTANT0 )
  {
     DoWork0( i )；// 假设这里不改变CONSTANT0的值
  }
  else
  {
    DoWork1( i )；// 假设这里不改变CONSTANT0的值
  }
}

Recommended code:

if(CONSTANT0)
{
  for( i 。。。)
  {
    DoWork0( i )；
  }
}
else
{
  for( i 。。。)
  {
    DoWork1( i )；
  }
}

This avoids repeated calculations if the value of if() is already known. Although the branch in the bad code can be easily predicted, the recommended code reduces the reliance on branch prediction because the branch is determined before entering the loop.

12. Choose a good infinite loop

In programming, we often need to use infinite loops. The two most commonly used methods are while (1) and for (；；). These two methods have exactly the same effect, but which one is better? Let's take a look at their compiled code:

Before compilation:

while (1)；

After compilation:

mov eax，1
test eax，eax
je foo+23h
jmp foo+18h

Before compilation:

for (；；)；

After compilation:

jmp foo+23h

Obviously, it has fewer instructions, does not occupy registers, and has no judgment or jump, which is better than . for (；；) while (1)

6. Improve CPU parallelism

1. Use parallel code

Whenever possible, break up long dependent code chains into several independent code chains that can be executed in parallel in the pipeline execution units. Many high-level languages, including C++, do not reorder the resulting floating-point expressions because that is a rather complex process. Note that the reordered code being identical to the original code does not necessarily mean the same computational results because floating-point operations lack precision. In some cases, these optimizations may lead to unexpected results. Fortunately, in most cases, only the least significant bit (i.e., the lowest bit) of the final result is likely to be wrong.

Bad code:

double a[100]， sum；
int i；
sum = 0.0f；
for (i=0；i<100；i++)
sum += a[i]；

Recommended code:

double a[100]， sum1， sum2， sum3， sum4， sum；

int i；

sum1 = sum2 = sum3 = sum4 = 0.0；
for (i = 0；i < 100；i += 4)
{
  sum1 += a[i]；
  sum2 += a[i+1]；
  sum3 += a[i+2]；
  sum4 += a[i+3]；
}
sum = (sum4+sum3)+(sum1+sum2)；

It should be noted that the 4-way decomposition is used because a 4-stage pipeline floating-point addition is used. Each stage of the floating-point addition takes one clock cycle, ensuring maximum resource utilization.

2. Avoid unnecessary read and write dependencies

When data is saved to memory, there is a read-write dependency, that is, the data must be correctly written before it can be read again. Although CPUs such as AMD Athlon have hardware to accelerate read-write dependency delays, allowing the data to be read before it is written to memory, it will be faster if the read-write dependency is avoided and the data is stored in internal registers. Avoiding read-write dependencies is especially important in a long and interdependent chain of code. If the read-write dependency occurs when operating an array, many compilers cannot automatically optimize the code to avoid read-write dependencies. Therefore, it is recommended that programmers manually eliminate read-write dependencies, for example, by introducing a temporary variable that can be stored in a register. This can greatly improve performance. The following code is an example:

Bad code:

float x[VECLEN]， y[VECLEN]， z[VECLEN]；
。。。。。。
for (unsigned int k = 1；k < VECLEN；k ++)
{
  x[k] = x[k-1] + y[k]；
}

for (k = 1；k <VECLEN；k++)
{
  x[k] = z[k] * (y[k] - x[k-1])；
}

Recommended code:

float x[VECLEN]， y[VECLEN]， z[VECLEN]；
。。。。。。
float t(x[0])；
for (unsigned int k = 1；k < VECLEN；k ++)
{
  t = t + y[k]；
  x[k] = t；
}
t = x[0]；
for (k = 1；k <；VECLEN；k ++)
{
  t = z[k] * (y[k] - t)；
  x[k] = t；
}

7. Loop-invariant calculation

For some calculation tasks that do not require loop variables to participate in the calculation, you can put them outside the loop. Many compilers can still do this by themselves, but they dare not touch the formulas that use variables in the middle, so in many cases you still have to do it yourself. For those functions called in the loop, all operations that do not need to be executed multiple times should be brought out and put into an init function, which is called before the loop. In addition, try to reduce the number of feeding times, and try not to pass parameters to it if it is not necessary. If a loop variable is needed, let it create a static loop variable and accumulate it by itself, which will be faster.

There is also structure access. According to Donglou's experience, whenever more than two elements of a structure are accessed in a loop, it is necessary to create an intermediate variable (the structure is like this, what about C++ objects? Think about it), see the following example:

Old code:

total = a->b->c[4]->aardvark + a->b->c[4]->baboon + a->b->c[4]->cheetah + a->b->c[4]->dog;

New code:

struct animals * temp = a->b->c[4];
total = temp->aardvark + temp->baboon + temp->cheetah + temp->dog;

Some old C compilers do not perform aggregation optimization, but new compilers that comply with the ANSI specification can automatically perform this optimization. See the example:

float a， b， c， d， f， g;
。。。
a = b / c * d;
f = b * g / c;

This writing method is of course necessary, but it is not optimized

float a， b， c， d， f， g;
。。。
a = b / c * d;
f = b / c * g;

If written this way, a new compiler that complies with the ANSI specification can calculate b/c only once and then substitute the result into the second formula, saving a division operation.

8. Function Optimization

1. Inline function

In C++, the keyword Inline can be added to any function declaration. This keyword requests the compiler to replace all calls to the specified function with the code inside the function. This is faster than a function call in two ways: first, the execution time required for the call instruction is saved; second, the time required for passing arguments and passing procedures is saved. However, while this method optimizes program speed, the program size becomes larger, so more ROM is required. Using this optimization is most effective when the Inline function is called frequently and contains only a few lines of code.

2. Do not define unused return values

The function definition does not know whether the function return value will be used. If the return value will never be used, void should be used to explicitly declare that the function does not return any value.

3. Reduce function call parameters

Using global variables is more efficient than passing parameters to functions. This eliminates the time required to push parameters to the stack when the function is called and to pop them off the stack when the function is completed. However, the decision to use global variables can affect the modularity and reentrancy of the program, so use them with caution.

4. All functions should have prototype definitions

Generally speaking, all functions should have prototype definitions. Prototype definitions can convey more information to the compiler that may be used for optimization.

5. Use constants whenever possible

Use constants (const) whenever possible. The C++ standard states that if the address of a const-declared object is not obtained, the compiler is allowed not to allocate storage space for it. This can make the code more efficient and generate better code.

6. Declare local functions as static

If a function is used only in the file that implements it, declare it as static to force internal linkage. Otherwise, the function is defined as external linkage by default. This may affect some compiler optimizations - for example, automatic inlining.

9. Use recursion

Unlike languages like LISP, C is pathologically fond of using repetitive code loops from the beginning. Many C programmers are determined not to use recursion unless the algorithm requires it. In fact, C compilers are not averse to optimizing recursive calls at all. On the contrary, they like to do this. Only when the recursive function needs to pass a large number of parameters and may cause a bottleneck, should loop code be used. Otherwise, it is better to use recursion.

10. Variables

1. Register variables You can use the register keyword when declaring local variables. This causes the compiler to put the variable into a multi-purpose register instead of the stack. Proper use of this method can increase execution speed. The more frequent function calls are, the more likely it is to increase the speed of the code.

Avoid using global variables and static variables in the innermost loop unless you can be sure that it will not change dynamically during the loop cycle. Most compilers have only one way to optimize variables, which is to set them as register variables, and for dynamic variables, they simply give up optimizing the entire expression. Try to avoid passing the address of a variable to another function, although this is still very common. C language compilers always assume that the variables of each function are internal variables. This is determined by its mechanism. In this case, their optimization is best completed. However, once a variable may be changed by other functions, these brothers will never dare to put the variable in a register again, which seriously affects the speed. See the example:

a = b();
c(&d);

Because the address of d is used by function c and may be changed, the compiler dare not keep it in the register for a long time. Once it runs to c(&d), the compiler will put it back to the memory. If it is in a loop, it will cause N frequent read and write operations of d between the memory and the register. As we all know, the reading and writing speed of the CPU on the system bus is very slow. For example, your Celeron 300 has a CPU main frequency of 300 and a bus speed of up to 66M. For a bus read, the CPU may have to wait for 4-5 cycles, ... ... ... ... ... ... ... I shudder just thinking about it.

2. Declaring multiple variables at the same time is better than declaring variables individually 3. Short variable names are better than long variable names, and variable names should be kept as short as possible 4. Declare variables before the start of the loop

11. Use nested if structures

If there are many parallel conditions to be judged in the if structure, it is best to split them into multiple if structures and then nest them together to avoid unnecessary judgments.

at last

The above optimization scheme was collected and compiled by Wang Quanming. Many materials come from the Internet, and the source is unknown. Thanks to all authors! This scheme is mainly based on the high requirements for program execution speed in embedded development, so this scheme is mainly to optimize the execution speed of the program. Note: Optimization has its own focus. Optimization is an art of balance, which often comes at the expense of program readability or increased code length.

Recommended Reading

Practical information | An article that explains the principles and differences between DC-DC and LDO

Useful Information | What is the difference between totem pole and complementary push-pull? Why do PWM driver chips use totem poles?

Useful Information | Why is your power supply ripple so large?

Useful Information | How to distinguish different ground wires? Talk about the nature of GND in embedded systems

Add WeChat and reply " join group"

Invite you to join the technical exchange group!

Reply to any content you want to search in the official , such as problem keywords, technical terms, bug codes, etc., and you can easily get relevant professional technical content feedback . Go and try it!

If you want to see our articles more often, you can go to our homepage, click the "three dots" in the upper right corner of the screen, and click "Set as Star".