Interviewer: How to write code that makes the CPU run faster?

Latest update time：2021-09-01 16:01

Reads：

Preface

The code is run by the CPU. The quality of our code determines the execution efficiency of the CPU. Especially when writing computationally intensive programs, we should pay more attention to the execution efficiency of the CPU, otherwise it will greatly affect the system performance.

The CPU has a CPU Cache embedded inside it. Its storage capacity is very small, but it is very close to the CPU core, so the cache read and write speeds are extremely fast. If the CPU reads data directly from the CPU Cache instead of from the memory during calculation, the calculation speed will be very fast.

However, most people do not know the operating mechanism of CPU Cache, so they don’t know how to write code that can cooperate with the working mechanism of CPU Cache. Once you master it, you will have new optimization ideas when writing code.

So, let’s take a look at what CPU Cache is like, how it works, and how to write code to make the CPU execute faster.

text

How fast is CPU Cache?

You may be curious why we need CPU Cache when we have memory? According to Moore's Law, the access speed of CPU will double every 18 months, which is equivalent to an annual growth of about 60%. The speed of memory will also continue to grow, but the growth rate is much lower than that of CPU, with an average annual growth of only about 7%. As a result, the gap between CPU and memory access performance continues to widen.

Until now, a memory access takes 200~300 multiple clock cycles, which means that the access speed of the CPU and memory has differed 200~300 by several times.

In order to make up for the performance difference between CPU and memory, CPU Cache, also known as cache, was introduced inside the CPU.

CPU Cache is usually divided into three levels of cache of varying sizes, namely L1 Cache, L2 Cache and L3 Cache .

Since the material used for CPU Cache is SRAM, its price is much higher than DRAM used for memory. Today, it costs US$7 to produce 1 MB of CPU Cache, while memory only costs US$0.015, a cost difference of 466 times. Therefore, CPU Cache is not calculated in GB like memory. Its size is calculated in KB or MB.

In Linux system, we can use the following figure to view the size of CPU Cache at each level. For example, on this server, the L1 Cache closest to the CPU core is 32KB, followed by the L2 Cache which is 256KB, and the largest L3 Cache is 3MB.

Among them, L1 Cache is usually divided into "data cache" and "instruction cache" , which means that data and instructions are cached separately at the L1 Cache level. The in the above figure index0 is the data cache, and index1 is the instruction cache. The sizes of the two are usually the same.

In addition, you will also notice that L3 Cache is much larger than L1 Cache and L2 Cache. This is because L1 Cache and L2 Cache are unique to each CPU core, while L3 Cache is shared by multiple CPU cores.

When a program is executed, the data in the memory is first loaded into the shared L3 Cache, then loaded into the L2 Cache unique to each core, and finally into the fastest L1 Cache before being read by the CPU. The hierarchical relationship between them is shown in the following figure:

The closer the cache is to the CPU core, the faster it is accessed. The CPU only needs 2~4 clock cycles to access L1 Cache, about 10~20 clock cycles to access L2 Cache, about 20~60 clock cycles to access L3 Cache, and the speed of accessing memory is about 200~300 clock cycles. See the following table:

Therefore, the CPU reads data from L1 Cache many times faster than reading from memory 100 .

What is the data structure and reading process of CPU Cache?

The data in the CPU Cache is read from the memory in small blocks rather than as individual array elements. Such small blocks of data in the CPU Cache are called Cache Lines .

You can view the CPU Cache Line in your Linux system in the following way. You can see that the L1 Cache Line size of my server is 64 bytes, which means that the size of data loaded by L1 Cache at one time is 64 bytes .

For example, if there is an int array[100] array of , when loading array[0] , since the size of this array element only occupies 4 bytes in memory, less than 64 bytes, the CPU will sequentially load the array elements array[15] , which means that array[0]~array[15] all the array elements will be cached in the CPU Cache. Therefore, when accessing these array elements next time, they will be read directly from the CPU Cache instead of reading from the memory, which greatly improves the performance of the CPU reading data.

In fact, when the CPU reads data, regardless of whether the data is stored in the cache, the CPU first accesses the cache. Only when the data is not found in the cache will it access the memory and read the data in the memory into the cache. The CPU then reads the data from the CPU cache.

This access mechanism is the same as the logic of using "memory as a cache for the hard disk". If there is cached data in the memory, it will be returned directly, otherwise it will have to access the slow hard disk.

How does the CPU know whether the memory data to be accessed is in the cache? If so, how to find the corresponding data in the cache? Let's start with the simplest and most basic Direct Mapped Cache and look at the data structure and access logic of the entire CPU Cache.

Earlier, we mentioned that when the CPU accesses memory data, it reads small blocks of data. The specific size of this small block of data depends on coherency_line_size the value of , which is generally 64 bytes. In the memory, this block of data is called a memory block ( Bock ) . When reading, we need to get the address of the memory block where the data is located.

The strategy adopted for direct-mapped cache is to always "map" the address of the memory block to the address of a CPU Line (cache block). As for the implementation method of the mapping relationship, "modulo operation" is used. The result of the modulo operation is the address of the CPU Line (cache block) corresponding to the memory block address.

For example, the memory is divided into 32 memory blocks, and the CPU Cache has 8 CPU Lines. Suppose the CPU wants to access the 15th memory block. If the data in the 15th memory block has been cached in the CPU Line, it must be mapped in the 7th CPU Line, because 15 % 8 the value of is 7.

You must have discovered that if you use the modulo mapping method, multiple memory blocks will correspond to the same CPU Line. For example, in the example above, in addition to memory block 15 being mapped to CPU Line 7, memory blocks 7, 23, and 31 are also mapped to CPU Line 7.

Therefore, in order to distinguish different memory blocks, we will also store a group tag in the corresponding CPU Line . This group tag will record the memory block corresponding to the data stored in the current CPU Line. We can use this group tag to distinguish different memory blocks.

In addition to the group tag information, the CPU Line also has two pieces of information:

One is the actual storage data ( Data ) loaded from the memory .
The other is the valid bit , which is used to mark whether the data in the corresponding CPU Line is valid. If the valid bit is 0, the CPU will directly access the memory and reload the data regardless of whether there is data in the CPU Line.

When the CPU reads data from the CPU Cache, it does not read the entire data block in the CPU Line, but reads a data fragment required by the CPU. Such data is collectively called a word . So how do you find the required word in the data block in the corresponding CPU Line? The answer is that you need an offset .

Therefore, a memory access address includes three types of information: group tag, CPU Line index, and offset. The CPU can use this information to find cached data in the CPU Cache. The data structure in the CPU Cache is composed of index + valid bit + group tag + data block .

If the data in the memory is already in the CPU cache, then when the CPU accesses a memory address, it will go through these four steps:

According to the index information in the memory address, calculate the index in the CPU Cahe, that is, find the address of the corresponding CPU Line;
After finding the corresponding CPU Line, the valid bit in the CPU Line is determined to confirm whether the data in the CPU Line is valid. If it is invalid, the CPU will directly access the memory and reload the data. If the data is valid, the execution will proceed.
Compare the group tag in the memory address with the group tag in the CPU Line to confirm that the data in the CPU Line is the memory data we want to access. If not, the CPU will directly access the memory and reload the data. If yes, it will continue to execute;
According to the offset information in the memory address, read the corresponding word from the data block of the CPU Line.

At this point, I believe you have a certain understanding of direct-mapped cache, but in fact, in addition to direct-mapped cache, there are other strategies to find data in CPU cache through memory addresses, such as fully associative cache ( Fully Associative Cache ), set associative cache ( Set Associative Cache ), etc. The data structures of these strategies are relatively similar. We understand how stream direct-mapped cache works. If you are interested in reading other strategies, I believe you will understand them quickly.

How to write code that makes the CPU run faster?

We know that the speed at which the CPU accesses memory is more than 100 times slower than the speed at which it accesses the CPU Cache. Therefore, if the data that the CPU wants to operate is in the CPU Cache, this will bring about a significant performance improvement. If the accessed data is in the CPU Cache, it means that the cache hits . The higher the cache hit rate, the better the code performance will be, and the faster the CPU will run.

Therefore, the question "How to write code that makes the CPU run faster?" can be changed to "How to write code with a high CPU cache hit rate?"

As I mentioned earlier, L1 Cache is usually divided into "data cache" and "instruction cache". This is because the CPU processes data and instructions separately. For example, 1+1=2 this operation, + which is an instruction, will be placed in the "instruction cache", while the input number 1 will be placed in the "data cache".

Therefore, we need to look at the cache hit rates of "data cache" and "instruction cache" separately .

How to improve the hit rate of data cache?

Suppose you want to traverse a two-dimensional array. There are two forms as follows. Although the code execution results are the same, which form do you think is the most efficient? Why is it so efficient?

After testing, array[i][j] the execution time of form 1 array[j][i] is several times faster than that of form 2.

The reason for such a big difference is that array the memory occupied by the two-dimensional array is continuous. For example, if the length N is 2 , then the layout order of the array elements in the memory is as follows:

The order of accessing array[i][j] array elements is exactly the same as the order in which the array elements are stored in the memory. When the CPU accesses array[0][0] , since the data is not in the cache, it will "sequentially" load the following 3 elements from the memory to the CPU cache. In this way, when the CPU accesses the following 3 array elements, it can successfully find the data in the CPU cache, which means that the cache hit rate is very high, and the cache hit data does not need to access the memory, which greatly improves the performance of the code.

If you use form 2 array[j][i] to access, the order of access is:

You can see that the access method is jumpy, not sequential. If the value of N is large, then array[j][i] when operating , it is impossible to array[j+1][i] read into the CPU Cache. Since array[j+1][i] is not read into the CPU Cache, it is necessary to read the data element from the memory. Obviously, this discontinuous and jumpy access method to data elements may not fully utilize the characteristics of the CPU Cache, so the performance of the code is not high.

When accessing array[0][0] elements, how many elements will the CPU load from memory to the CPU Cache at a time? We have mentioned this question before. It is related to the CPU Cache Line, which indicates the size of the data that the CPU Cache can load at a time . Its size can be checked in Linux through coherency_line_size configuration, which is usually 64 bytes.

That is to say, when the CPU accesses memory data, if the data is not in the CPU Cache, it will load 64 bytes of data into the CPU Cache at one time. Then, when accessing array[0][0] , since the element is less than 64 bytes, it will be read into the CPU Cache in sequence array[0][0]~array[0][15] . Sequential access takes advantage of this feature, so it is faster array[i][j] than jump access .array[j][i]

Therefore, when encountering this kind of array traversal, accessing in order of memory layout will effectively utilize the benefits of CPU Cache, so that the performance of our code will be greatly improved.

How to improve the hit rate of instruction cache?

The way to improve the cache hit rate of data is to access it in order of memory layout. So how can we improve the cache for instructions?

Let's take an example. There is a one-dimensional array whose elements are random numbers between 0 and 100:

Next, perform two operations on this array:

The first operation loops through the array and sets the array elements less than 50 to 0;
The second operation is to sort the array;

So the question is, do you think it is faster to traverse first and then sort, or to sort first and then traverse?

Before answering this question, let's first understand the CPU's branch predictor . For an if conditional statement, this means that you can choose to jump to at least two different instructions, that is, the instructions in if or else. So, if the branch prediction can predict whether the next instruction to be executed is the if instruction or the else instruction, these instructions can be placed in the instruction cache "in advance", so that the CPU can directly read the instructions from the cache, and the execution speed will be very fast .

When the elements in the array are random, branch prediction cannot work effectively. When the array elements are sequential, the branch predictor will dynamically predict the future based on historical hit data, so the hit rate will be very high.

Therefore, sorting first and then traversing will be faster. This is because after sorting, the numbers are from small to large, so if < 50 the number of hits in the first few loops will be relatively large, so the branch prediction will cache if the array[i] = 0 instructions in the cache, and the subsequent CPU only needs to read from the cache to execute the instruction.

If you are sure that the probability of the expression in the in the code being if judged as true is relatively high, we can use the display branch prediction tool. For example, in the C/C++ language, the compiler provides the two macros likely and . unlikely If the probability of if the condition being ture is high, you can use likely the macro if to wrap the expression in , otherwise use unlikely the macro.

In fact, the CPU's own dynamic branch prediction is already quite accurate, so it is recommended to use these two macros only when you are absolutely sure that the CPU prediction is inaccurate and you know the actual probability.

How to improve the cache hit rate of multi-core CPU?

In a single-core CPU, although only one process can be executed, the operating system allocates a time slice to each process. When the time slice is used up, the next process is scheduled. Therefore, each process occupies the CPU alternately according to the time slice. From a macro perspective, it appears that each process is executing at the same time.

Modern CPUs are all multi-core, and processes may switch back and forth between different CPU cores for execution, which is not good for CPU Cache. Although L3 Cache is shared among multiple cores, L1 and L2 Cache are unique to each core. If a process switches back and forth between different cores, the cache hit rate of each core will be affected . On the contrary, if all processes are executed on the same core, the cache hit rate of L1 and L2 Cache of their data can be effectively improved. A high cache hit rate means that the CPU can reduce the frequency of memory access.

When there are multiple "computationally intensive" threads executing at the same time, in order to prevent the cache hit rate from decreasing due to switching to different cores, we can bind the threads to a certain CPU core , which can significantly improve performance.

On Linux, sched_setaffinity the method is provided to implement the function of binding a thread to a CPU core.

Summarize

With the development of computer technology, the access speed of CPU and memory has become increasingly different. Now the gap has reached hundreds of times. Therefore, CPU Cache component is embedded in the CPU as a cache layer between memory and CPU. Since CPU Cache is very close to the CPU core, the access speed is also very fast. However, due to the high cost of materials required, it is not like memory which can be several GB in size, but only tens of KB to MB in size.

When the CPU accesses data, it first accesses the CPU Cache. If the cache hits, the data is returned directly without having to read the data from the memory every time. Therefore, the higher the cache hit rate, the better the code performance.

However, it should be noted that when the CPU accesses data, if the CPU Cache does not cache the data, it will read the data from the memory, but it will not read only one piece of data. Instead, it will read the data piece by piece at a time and store it in the CPU Cache before it can be read by the CPU.

There are many strategies for mapping memory addresses to CPU Cache addresses. The simplest one is direct cache mapping, which cleverly splits the memory address into "index + group tag + offset", allowing us to map a large memory address to a very small CPU Cache address.

To write code that makes the CPU run faster, you need to write code with a high cache hit rate. The CPU L1 Cache is divided into data cache and instruction cache, so you need to improve their cache hit rates separately:

For data cache, when we traverse the data, we should operate in the order of memory layout. This is because the CPU Cache operates data in batches according to the CPU Cache Line. Therefore, when operating continuous memory data sequentially, the performance can be effectively improved.
For instruction cache, regular conditional branch statements can make the CPU's branch predictor work, further improving execution efficiency;

In addition, for multi-core CPU systems, threads may switch back and forth between different CPU cores, which will affect the cache hit rate of each core. Therefore, if you want to improve the cache hit rate of the process, you can consider binding the thread to a certain CPU core.