Modern processors and optimized code performance

fish001

Modern processors and optimized code performance [Copy link]

Some features of modern processors

One of the great achievements of modern processors is that they use a complex and unique processor structure in which multiple instructions can be executed in parallel while giving the appearance of simple sequential execution of instructions. - "A Deep Understanding of Computer Systems"

In our intuitive understanding, the processor is a monotonous and reliable machine that repeats the operations of fetching, decoding, and executing according to the compiled code instructions. In fact, the way modern processors handle code instructions is no longer as regular as it seems on the surface. It may present different operation strategies for different forms of code.

Regarding the features of modern processors, this article only briefly introduces a few of them that are related to the code optimization techniques that follow. For more detailed feature descriptions, please refer to [1], which contains professional and detailed descriptions.

1.1 Superscalar

A processor that can perform multiple operations per clock cycle is called a "superscalar processor". Modern processors achieve superscalar processing in two main ways:

Multiple parallel functional units. These units can execute the same or different instructions at the same time. For example, TI's C64X+ architecture is equipped with 8 parallel functional units, which are responsible for multiplication and addition, logic, access and other operations. The VLIW (Very Long Instruction Word) very long instruction set is designed to distribute instructions to these multiple functional units;
SSE (Streaming SIMD Extensions), SIMD stands for Single-Instruction, Multiple Data. By extending additional vector processing functional units and vector registers, a single instruction can be used to control multiple identical calculations, such as accessing 8 bytes of data at a time, or performing 8 16x16 multiplications at a time. The NEON coprocessor in the ARM processor is a SIMD extension of the ARM architecture.

1.2 Cache

Cache is a small and fast storage device. Generally speaking, the CPU's access speed to cache is second only to registers. The size of the cache space is not only related to its high price, but more importantly, as the storage capacity increases, the storage access latency will increase. Therefore, many processors are designed with a multi-level cache structure, and the closer the level is to the CPU, the smaller its capacity.

Modern processors include independent I-cache (instruction cache) and D-cache (data cache), which are managed by dedicated hardware logic. Simply put, when the CPU first accesses a low-level storage (cache miss), the cache manager will interact with the upper-level cache for multiple instructions/data after the storage address, so that when the CPU wants to access the next continuous instruction/data, it only needs to access the cache (cache hit).

There are several important indicators to measure cache performance: miss rate, hit rate, hit time, and miss penalty. From the miss penalty indicator, we can see how important cache is to speeding up the CPU's computing performance.

The miss penalty is the extra time overhead required for a cache miss. For L1 cache, the hit time is on the order of several clock cycles; the penalty for L1 misses requiring service from L2 is usually tens of cycles; the penalty for service from L3 is 50 cycles; and the penalty for service from main memory is 200 cycles!

1.3 Branch prediction, speculative execution, and conditional transfer

Branch code is an obstacle to pipeline processing because the compiler, including the hardware, is very likely unable to predict which branch instruction will be executed next, so it has to wait for the branch judgment result to come out before continuing to fill the pipeline, causing "bubbles" in the pipeline.

Modern processors use a technique called branch prediction, which guesses whether a branch will be taken, and also predicts the target address of the branch. Then, using a technique called speculative execution, the processor starts fetching instructions located at the branch it predicted to jump to, decoding the instructions, and executing them even before it determines whether the branch prediction was correct. Once the actual branch path is determined, the processor "commits" the results of the speculatively executed instructions if the prediction was correct.

Conditional branches can incur a large "misprediction penalty" when the branch prediction logic mispredicts. At this point, the processor must discard all speculative execution results and restart the instruction fetch process at the correct location, which must refill the instruction pipeline before any useful results can be produced.

Another way to handle branches is to use "conditional move instructions". The compiler can generate code that uses these instructions to select whether to execute or ignore instructions based on whether a condition is met, rather than the traditional conditional transfer of control. The basic idea of translating into conditional moves is to calculate the value of a conditional expression or statement in both directions, and then use conditional moves to select the expected value. Conditional move instructions can be implemented as part of normal instruction pipelining, there is no need to guess whether the condition is met, and therefore no penalty for guessing wrong.

1.4 Out-of-order execution

For a single thread, if instructions are only executed sequentially, sometimes the subsequent instructions need to rely on the execution results of the previous instructions, which may cause the functional unit or pipeline to wait, reducing processing efficiency.

Out-of-order execution means that logically later instructions can be executed before earlier instructions, which improves the execution efficiency of the hardware and achieves a higher level of instruction-level parallelism. The processor uses a "register renaming" method to implement out-of-order execution of instructions while ensuring that the final result of the program is not affected.

The necessity of code optimization

Modern processors have considerable computing power, but we may need to write programs in a very procedural way to induce these capabilities. ——"In-depth Understanding of Computer Systems"

Let's put aside the code architecture and code readability for the moment, and only talk about the core code segments that have the greatest impact on the performance of the entire program. Most programmers are more concerned about the algorithm that implements the code function, and rarely pay attention to using compiler- and processor-friendly code. For example, when sorting an array, we will wonder whether to use bubble sort, insertion sort, or heap sort... and then we will never tire of comparing which algorithm can consume the least computing power. In engineering practice, I found that properly adjusting the code implementation techniques can often bring greater efficiency improvements than choosing the algorithm itself, and sometimes this improvement is multiples or even dozens of times! This is certainly not to say that algorithm selection is not important, but to show that code optimization is also very important.

Compilers usually have an integrated optimizer that can automatically optimize user code. Although some optimizers have done their best, human optimization intervention is still necessary.

On the one hand, adjusting the code structure is risky, and to avoid code errors caused by optimization, the compiler always makes conservative estimates.

On the other hand, since ordinary programs themselves are not parallel, the instruction-level parallelism achieved through superscalar execution is severely weakened. Even the smartest out-of-order superscalar processor, combined with a smart and competitive compiler, will still be affected by the combined effects of load delays, cache misses, branches and dependencies between instructions, causing the processor to be full (run at full speed) in very few cycles.

In view of the above reasons, users can optimize their code from two aspects:

Compiler-friendly. Understand the capabilities and limitations of the optimizing compiler, and try to "tell" the compiler the user's true intentions through the code itself and precompiled pseudo instructions;
Processor-friendly. Adjust the code implementation to make the best possible use of the processor's hardware units.

Although the same optimization strategy may not have the same effect on different processors, the general principles of operation and optimization apply to a variety of processors.

Simple but effective optimization techniques

3.1 Eliminate unnecessary memory references

Code Snippet 1

void array_sum(short *a, short *sum, length)

{

unsigned int i;

for(i=0; i<length ; i++)

{

*sum = *sum + a[i];

}

}

For the above code, each iteration requires two memory read operations + one memory write operation + one addition. However, except for the last iteration, when we need to write the calculation result to the storage address represented by sum, the intermediate calculation process can actually be temporarily saved in the register. Therefore, make the following changes to the code:

Code Snippet 2

void array_sum(short *a, short *sum, length)

{

unsigned int i;

short sum_temp = 0;

for(i=0; i<length ; i++)

{

sum_temp = sum_temp + a[i];

}

*sum =sum_temp;

}

This reduces the memory operations per iteration from two reads and one write to just one read.

Imagine, wouldn't the compiler automatically perform such an obvious optimization? If we analyze the code carefully, we will find that if a and sum point to the same address when calling the function, the above two pieces of code may get two different results. When it is impossible to confirm whether such storage aliasing will occur, the compiler will take a conservative attitude!

3.2 Multiple cumulative variables

Reconsider code snippet 2. Since the next sum_temp calculation depends on the accumulated result of the previous sum_temp, at most one element can be accumulated in each cycle. Assuming that the processor has two parallel addition units, there will always be one unit that is idle. Taking this into account, rewrite the code as follows:

Code Snippet 3

void array_sum(short *a, short *sum, length)

{

unsigned int i;

short sum_temp1 = 0;

short sum_temp2 = 0;

for(i=0; i<length-1 ; i+=2)

{

sum_temp1 = sum_temp1 + a[i];

sum_temp2 = sum_temp2 + a[i+1];

}

for(; i<length; i++)

{

sum_temp1 = sum_temp1 + a[i];

}

*sum =sum_temp1 + sum_temp1;

}

Using two temporary accumulation variables for simultaneous accumulation allows the two addition units of the processor to run simultaneously in the same cycle, thereby improving the parallelism of instructions.

3.4 Writing code suitable for conditional transmission implementation

Code Snippet 4

for (i=0; i<CORDIC_level; i++)

{

if (y_coord < 0)

{

x_coord = x_coord - (y_coord >> i);

y_coord = y_coord + (x_coord >> i);

angle_accumulate = angle_accumulate - angleLUT[i];

}

else

{

x_coord = x_coord + (y_coord >> i);

y_coord = y_coord - (x_coord >> i);

angle_accumulate = angle_accumulate + angleLUT[i];

}

}

As shown in Code Segment 4 above, the loop contains branch judgment statements, making it difficult for the compiler to pipeline the loop body. In addition, since branch prediction is only feasible for regular patterns, the judgment of the above y_coord < 0 condition is almost unpredictable, so branch prediction will be handled poorly.

If the compiler can generate code that uses conditional data transfers instead of conditional control transfers, it can greatly improve program performance. Some methods of expressing conditional behavior can be directly translated into conditional transfers, avoiding the need for the processor to perform branch prediction.

Change Code Fragment 4 to the following style and check the generated assembly code to confirm that it does generate code using conditional transfers:

Code Snippet 5

int x_temp, y_temp;

for (i=0; i<CORDIC_level; i++)

{

x_temp =x_coord >> i;

y_temp = y_coord >> i;

x_coord = (y_coord < 0)? (x_coord - y_temp ) : (x_coord + y_temp);

y_coord = (y_coord < 0)? (y_coord + x_temp ) : (y_coord - x_temp);

angle_accumulate = (y_coord < 0)? (angle_accumulate - angleLUT[i]) : (angle_accumulate + angleLUT[i]);

}

3.4 Cache-friendly code

In another article of this public account, "Things related to storage in computer systems", the temporal locality and spatial locality of storage access have been introduced, and examples of writing code with good locality have been given.

Here is another problem that is easily overlooked. It appeared in my actual project debugging process. There is a function. The simulation results without considering memory access show that the function takes about 100us to run completely, but the actual run found that the function consumes about 700us. Because the simulation does not take into account the delay of memory access, we allow the actual running result to be slightly longer than the simulation result, but 700us is 7 times longer than 100us, which is a bit abnormal.

After some investigation, we finally found that the problem was caused by a variable initialization statement. A global array short a[1920*8] was initialized at the beginning of the function:

memset(a, 0, sizeof(a));

However, this statement alone took more than 500 us! After analyzing the code function, we found that initialization of the variable can be completely avoided through some adjustments. Especially for such a large array, no matter how good the locality is, the number of cache misses will be large, and the cache will be refreshed in large areas.

Usually, some good programming habits may lead to performance degradation, such as initialization of data blocks. In the code, you can often see memset immediately after malloc, and then assign values to the data block. If the memory block operated is large, the performance will be significantly affected. Therefore, variable initialization is a good programming habit, but if it conflicts with performance, try to avoid such operations or only initialize key data and avoid large-block data operations.

Program performance analysis

4.1 Identify performance bottlenecks

When dealing with large programs, it is difficult to know exactly where to optimize. At this time, you can use a code profiler to collect parameters such as the number of calls and time spent on each function during program execution, and print a profile report to obtain the time distribution of the function.

Amdahl's law can be used to analyze how much impact the improvement of a certain part of the program will ultimately have on the overall performance of the program.

Amdahl's law states that if the execution time of the original program is Told, the proportion of the execution time of a part of the code to that time is a, and the proportion of the performance improvement of this part is b, then the speedup ratio of the entire program is:

Told/Tnew = 1/[(1-a) + a/b]

4.2 Maximum performance of the program

After optimizing the code, the effect of the optimization can be tested through simulation or actual operation. However, this is only a relative comparison after all. If we can establish an evaluation method, first establish the boundary of a performance index (just like the Cramer-Rao bound in parameter estimation), and then test the performance index of the written code, wouldn’t it be possible to obtain the absolute degree of code optimization and predict how much room for optimization there is?

In the book "In-depth Understanding of Computer Systems", the author provides us with such an evaluation method. In the book, the author uses the number of cycles per element (CPE) as a statistical index, and uses the delay bound and throughput bound to describe the maximum performance of the program.

The CPE index is only for loop code (it can almost be said that code performance optimization is loop optimization), which refers to the number of cycles consumed by processing each element of data. The reason for using the number of cycles per element instead of the number of cycles per loop is that the number of loops may vary with the degree of loop unrolling, and what we ultimately care about is how fast the program runs for a given vector length.

Latency bounds describe the number of cycles taken along the critical path (longest path) to process each element when a series of operations must be performed in a strict order. Latency bounds can limit program performance when data dependency issues limit the ability to parallelize instructions.

The throughput bound describes the raw computational power of a processor's functional units when they are fully operational. For example, if a processor has two units that can do multiplications simultaneously, for a loop that only multiplies 1 element, the throughput bound is 0.5. The throughput bound is the ultimate limit on program performance.