What is the future development direction of multi-core CPU?-EEWORLD

Collect

At the end of 2020, I gave a report to a large company. It contained two parts: one part was about computer architecture, especially the evolution of CPU structure; the other part was about processor chip design methods. I will post the first part here to answer this question on Zhihu.

First, let's review the three laws in the field of computer architecture: Moore's Law, Makimura's Law, and Bell's Law. There is no need to say much about Moore's Law, but I want to express a point that Moore's Law is not dead, it is just slowing down.

2. Moore's Law has increased the number of transistors on a chip, but one question is whether these transistors are fully utilized? Recently, the MIT team published an article in Science titled "There's plenty of room at the Top: What will drive computer performance after Moore's law?", giving their answer: Obviously not!

Let's take a look at a small experiment conducted by the MIT team (see the PPT below): Assuming that the performance of a matrix multiplication implemented in Python is 1, then the performance can be improved by 50 times after rewriting it in C language. If the architecture features (such as loop parallelization, memory access optimization, SIMD, etc.) are fully exploited, the performance can even be improved by 63,000 times. However, programmers who can truly understand the architecture so deeply and write such extreme performance are absolutely rare.

The question is whether such a big performance difference is good or bad? From the perspective of software development, this is obviously not a good thing. This means that most programmers cannot fully utilize the performance of the CPU and cannot fully utilize the transistors. This cannot be blamed on the programmers, but mainly because the CPU microstructure is too complex, which makes it difficult for software to fully utilize the hardware performance.

How to solve this problem? Domain-Specific Architecture (DSA) is an effective method. DSA can customize microstructures for specific domain applications, thereby improving performance-to-power ratio by orders of magnitude. This is equivalent to implementing the knowledge of top programmers directly into hardware.

3. The second law is Makimura's law (also known as "Makimura fluctuation"). In 1987, Tsugio Makimoto, former chief engineer of Hitachi, proposed that the development process of semiconductor products always alternates between "standardization" and "customization", fluctuating about once every ten years. Behind Makimura's law is the balance between performance, power consumption and development efficiency.

For processors, it is a balance between dedicated architecture and general architecture. Recently, the trend has shifted to pursuing performance and power consumption, so dedicated architecture has begun to attract more attention.

4. The third law is Bell's Law. This is an observation made by Gordon Bell in 1972. The details are described in the following PPT. It is worth mentioning that the highest award for supercomputer application, the Gordon Bell Prize, is named after him.

5. Bell's Law points out a new development trend in the future, which is the arrival of the AIoT era. This will be an era in which the demand for processors will explode again, but it will also be an era of fragmented demand. Different fields and industries will have different demands for chips, such as integrating different sensors, different accelerators, etc. How to deal with fragmented demand? This will be another challenge.

6. These three laws all drive the development of computer architecture in one direction, that is, "DSA". How to implement DSA involves two aspects:

In order to pursue performance and power consumption, there are three main design principles (see the PPT below);

In order to cope with the fragmentation demand, it is necessary to develop a new method of agile processor design. (This answer will not introduce the agile design method)

7. Before we talk about some specific technologies, let’s take a general look at how CPU performance has improved over the past few decades. The following PPT lists the architecture evolution of Intel processors from 1995 to 2015. This is a process of continuous iterative optimization that integrates hundreds of architecture optimization technologies.

There are still many couplings between these technologies, which brings great design complexity. For example, in 2011, the large page technology was introduced on Sandy Bridge. To realize this function, it involves the modification of a series of CPU modules and functions such as superscalar, out-of-order execution, large memory, SSE instructions, multi-core, hardware virtualization, uOP Fusion, etc. It also involves the modification of software levels such as operating system, compiler, function library, etc. (I often see people say that chip design is very simple, perhaps because they have not been exposed to CPU chip design and do not know the complexity of CPU design)

8. The processor has a very complex state inside, and its state changes are driven by the program. In other words, the processor state depends on the program behavior (see the PPT below), and the optimization idea at the CPU architecture level is to discover the common features in the program behavior and accelerate it.

How to discover the common features in program behavior is the key to processor optimization, which requires a good understanding of program behavior, operating system, programming and compilation, architecture and other levels, which is also the basic requirement for a Ph.D. in computer architecture. This is also why many foreign computer architecture majors belong to the Computer Science Department.

Off topic: I saw the establishment of integrated circuit as a first-level discipline in China these days, which is good news. However, in order to cultivate CPU design talents, we should not ignore the traditional computer science courses such as operating system, programming and compilation in the course design.

9. Give two examples of discovering hot applications and hot codes and optimizing them at the architecture level. One example is that it was found that there are many common operations in the fifth layer of the TCP/IP protocol stack (L5Ps) in many fields, such as encryption and decryption, so an accelerator for L5Ps was directly implemented on the network card, which greatly accelerated the network packet processing capability. Another example is that this epidemic caused a large amount of computing power in cloud computing data centers to be used for video transcoding, so a hardware accelerator was designed specifically to accelerate video transcoding, which greatly improved the efficiency of the data center.

10. It is not easy to discover and identify such hot applications and hot codes, which requires very powerful infrastructure and analysis equipment. For example, Google has a GWP tool in its data center, which can monitor and count the entire data center application at a very low cost, find out which hot programs/codes consume computing power, and which parts of the current CPU are bottlenecks. For example, GWP shows that 5% of the computing power in Google's data center is used for compression.

Thanks to these basic tools, Google discovered early on that the proportion of AI applications in data centers was increasing, so it began to design TPUs specifically to accelerate AI applications.

11. The following introduces common optimization ideas at the architecture level from three aspects: reducing data movement, reducing data accuracy, and increasing processing parallelism.

First, let's look at how to reduce data movement. The first entry point is the instruction set - the instruction set is a way to express program semantics. The same algorithm can be expressed using instruction sets of different granularities, but the execution efficiency will be very different. Generally speaking, the larger the granularity, the weaker the expressive power, but the higher the execution efficiency.

12. In order to cover as many applications as possible, general instruction sets often need to support thousands of instructions, which makes the pipeline front-end design (instruction fetch, decoding, branch prediction, etc.) very complicated, and has a negative impact on performance and power consumption.

13. Designing a dedicated instruction set for a certain field can greatly reduce the number of instructions, increase the operation granularity, integrate memory access optimization, and achieve an order of magnitude improvement in performance-to-power ratio. The data in the following PPT is a study done by a Stanford University team. From this figure, it can be seen that after using the "Magic Instruction", the performance-to-power ratio has been greatly improved by dozens of times. This Magic Instruction is actually a very specific expression and the corresponding circuit implementation (see the lower right corner of the PPT).

14. The second common method to reduce data movement is to give full play to the role of cache. The memory access component is actually the most important part of the processor, involving many technical points (as shown in the following PPT). Many people pay attention to how wide and deep the processor pipeline is, but in fact, most of the time, memory access has the greatest impact on processor performance.

There are also a series of technologies for memory access optimization, including replacement, prefetching, etc. These technologies are still the focus of architecture research today, so I won’t go into details here.

15. I will not elaborate on the memory access optimization technology, but will just introduce the recently popular memory compression topic.

IBM has added a memory compression acceleration module to its latest Z15 processor, which is 388 times more efficient than software compression, with outstanding results.

16. NVIDIA is also studying how to increase the effective capacity of on-chip storage in GPUs through memory compression technology, thereby improving application performance.

17. Intel has put a lot of effort into memory access optimization, which can be seen by comparing two Intel CPUs. The Core 2 Due T9600 and Pentium G850 are two CPUs with different processes but similar frequencies, 2.8GHz and 2.9GHz respectively, but the performance is 77% different - the SPEC CPU score of G850 is 31.7 points, while that of T9600 is only 17.9 points.

The frequencies are similar, so why is the performance so different? In fact, the cache capacity of G850 is smaller than that of T9600 - 6MB L2 vs. 256KB L2 + 3MB L3.

If you compare them more carefully, you will find that the biggest difference between the two processors is that the G850-adapted memory controller introduces FMA (Fast Memory Access) optimization technology, which greatly improves memory access performance.

18. The second type of architecture optimization technology is to reduce data precision. This aspect has been a hot topic in recent years, especially in the field of deep learning. Many studies have found that 64-bit floating points are not needed, only 16-bit or even 8-bit fixed points are needed for calculations, and there is no loss of precision, but the performance is improved several times.

Many AI processors are using this idea for optimization, including the CPU in the world's fastest supercomputer "Fugaku" developed by Japan some time ago, which uses different computing precisions. Therefore, its low-precision AI computing power can reach 1.4EOPS, which is 3.4 times higher than the 64-bit floating-point computing performance (416PFLOPS).

19. One of the drawbacks of the IEEE 754 floating point format is that it is not easy to convert between different precisions. In recent years, the academic community has proposed a new floating point format, POSIT, which is easier to achieve different precisions. Some scholars even call for POSIT to replace IEEE 754 (Posit: A Potential Replacement for IEEE 754).

The RISC-V community has been paying attention to POSIT, and some teams have implemented floating-point arithmetic units (FPUs) based on POSIT, but there are still some controversies (there is also a wonderful debate between David Patterson and POSIT inventor John L. Gustafson, which I will introduce at another opportunity).

20. The third optimization idea at the architecture level is parallelism. The "multi-core" mentioned in this topic is a specific technology in this idea. In addition to multi-core, there are other levels of parallelism, such as instruction set parallelism, thread-level parallelism, and request-level parallelism; in addition to instruction-level parallelism ILP, there is also memory-level parallelism MLP. In short, improving processing parallelism is a very effective optimization method.

The above is a general summary of computer architecture, especially CPU structure optimization ideas, for your reference. In summary, there are two conclusions:

Domain-specific architecture (DSA) is the development trend of architecture in the future;

Three optimization routes at the architecture level: reduce data movement, reduce data accuracy, and increase processing parallelism.

Reference address：What is the future development direction of multi-core CPU?

Previous article：NXP BlueBox 3.0 development platform is now available, doubling performance
Next article：Tianshu Zhixin's self-developed 7nm GPGPU cloud training chip is successfully lit up

Popular Resources
Popular amplifiers