Nine elements of multi-core processor design-EEWORLD

Collect

Like SMT, CMP is committed to exploring the coarse-grained parallelism of computing. CMP can be seen as the development of large-scale integrated circuit technology. When the chip capacity is large enough, the SMP (symmetric multiprocessor) or DSM (distributed shared processor) nodes in the large-scale parallel processor structure can be integrated into the same chip, and each processor executes different threads or processes in parallel. In a single-chip multiprocessor based on the SMP structure, processors communicate with each other through off-chip cache or off-chip shared memory. In a single-chip multiprocessor based on the DSM structure, processors communicate with each other through an on-chip high-speed crossbar switch network connected to the distributed memory.

Since SMP and DSM are already very mature technologies, CMP structure design is relatively easy, but the requirements for back-end design and chip manufacturing process are higher. Because of this, CMP has become the first "future" high-performance processor structure to be applied to commercial CPUs.

Although multi-core can take advantage of the many benefits brought by increased integration and increase the performance of the chip exponentially, it is obvious that some of the original system-level problems are introduced into the processor.

1 Nuclear structure research: isomorphic or heterogeneous

The structure of CMP is divided into two categories: homogeneous and heterogeneous. Homogeneous means that the structure of the internal core is the same, while heterogeneous means that the internal core structure is different. Therefore, it is crucial to study the implementation of the core structure for different applications to achieve the performance of future microprocessors. The structure of the core itself is related to the area, power consumption and performance of the entire chip. How to inherit and develop the achievements of traditional processors directly affects the performance and implementation cycle of multi-core. At the same time, according to Amdahl's theorem, the acceleration ratio of the program is determined by the performance of the serial part, so theoretically it seems that the structure of heterogeneous microprocessors has better performance.

The instruction system used by the core is also very important for the implementation of the system. Whether the multiple cores use the same instruction system or different instruction systems, whether they can run the operating system, etc., will also be one of the research contents.

2 Program Execution Model

The first issue in multi-core processor design is to choose a program execution model. The applicability of the program execution model determines whether the multi-core processor can provide the highest performance at the lowest cost. The program execution model is the interface between compiler designers and system implementers. Compiler designers decide how to convert a high-level language program into a target machine language program according to a program execution model; system implementers decide how to effectively implement the program execution model on a specific target machine. When the target machine is a multi-core architecture, the questions that arise are: How does the multi-core architecture support important program execution models? Are there other program execution models that are more suitable for multi-core architectures? To what extent can these program execution models meet the needs of applications and be accepted by users?

3 Cache Design: Multi-level Cache Design and Consistency Issues

The speed gap between the processor and the main memory is a prominent contradiction for CMP, so multi-level cache must be used to alleviate it. Currently, there are CMPs with shared primary cache, shared secondary cache, and shared main memory. Usually, CMPs use a shared secondary cache CMP structure, that is, each processor core has a private primary cache, and all processor cores share a secondary cache.

The cache architecture design is also directly related to the overall system performance. However, in the CMP structure, whether shared cache or unique cache is better, whether to build multi-level cache on a chip, and how many levels of cache to build, etc., have a great impact on the size, power consumption, layout, performance and operating efficiency of the entire chip, so these are all issues that need to be carefully studied and discussed.

On the other hand, multi-level caches raise consistency issues. The cache consistency model and mechanism used will have a significant impact on the overall performance of the CMP. The cache consistency models widely used in traditional multi-processor system structures include: sequential consistency model, weak consistency model, release consistency model, etc. The related cache consistency mechanisms mainly include bus snooping protocol and directory-based directory protocol. Most current CMP systems use bus-based snooping protocol.

4 Inter-core Communication Technology

Programs executed by the CPU cores of a CMP processor sometimes need to share and synchronize data, so its hardware structure must support inter-core communication. An efficient communication mechanism is an important guarantee for the high performance of a CMP processor. Currently, there are two mainstream on-chip efficient communication mechanisms: one is a cache structure based on bus sharing, and the other is an on-chip interconnect structure.

The bus-shared cache structure means that each CPU core has a shared secondary or tertiary cache to store frequently used data and communicate through the bus connecting the cores. The advantages of this system are simple structure and high communication speed, but the disadvantage is that the bus-based structure has poor scalability.

The structure based on on-chip interconnection means that each CPU core has an independent processing unit and cache, and each CPU core is connected together through a cross switch or on-chip network. Each CPU core communicates through messages. The advantages of this structure are good scalability and guaranteed data bandwidth; the disadvantages are complex hardware structure and large software changes.

Perhaps the result of the competition between the two is not to replace each other but to cooperate with each other, for example, using on-chip networks globally and buses locally to achieve a balance between performance and complexity.

5 Bus Design

In traditional microprocessors, cache misses or memory access events will have a negative impact on the CPU's execution efficiency, and the efficiency of the bus interface unit (BIU) will determine the extent of this impact. When multiple CPU cores request to access memory at the same time or cache misses occur in the private caches of multiple CPU cores at the same time, the efficiency of the BIU's arbitration mechanism for these multiple access requests and the conversion mechanism for external storage access determines the overall performance of the CMP system. Therefore, it is important to find an efficient multi-port bus interface unit (BIU) structure to convert the single-word access of multiple cores to main memory into a more efficient burst access; at the same time, it is important to find the number model of burst access words that is optimal for the overall efficiency of the CMP processor and the arbitration mechanism for efficient multi-port BIU access.

6 Operating system design: task scheduling, interrupt handling, synchronization and mutual exclusion

For multi-core CPUs, optimizing the operating system task scheduling algorithm is the key to ensuring efficiency. General task scheduling algorithms include global queue scheduling and local queue scheduling. The former means that the operating system maintains a global task waiting queue. When a CPU core in the system is idle, the operating system selects a ready task from the global task waiting queue and starts executing it on this core. The advantage of this method is that the CPU core utilization rate is high. The latter means that the operating system maintains a local task waiting queue for each CPU core. When a CPU core in the system is idle, it selects an appropriate task from the task waiting queue of the core to execute. The advantage of this method is that tasks basically do not need to be switched between multiple CPU cores, which is conducive to improving the local cache hit rate of the CPU core. At present, most multi-core CPU operating systems use a task scheduling algorithm based on a global queue.

The interrupt handling of multi-core is very different from that of single-core. The processors of a multi-core need to communicate with each other through interrupts, so the local interrupt controllers between multiple processors and the global interrupt controller responsible for arbitrating the interrupt distribution between the cores also need to be encapsulated inside the chip.

In addition, a multi-core CPU is a multi-tasking system. Since different tasks will compete for shared resources, the system needs to provide synchronization and mutual exclusion mechanisms. However, the traditional solution mechanism for a single core cannot meet the needs of multi-cores, and it is necessary to use the "read-modify-write" atomic operation or other synchronization and mutual exclusion mechanisms provided by the hardware to ensure it.

7 Low power design

The rapid development of semiconductor technology has made the integration of microprocessors higher and higher. At the same time, the surface temperature of processors has become higher and higher and has increased exponentially. The power density of processors can double every three years. Currently, low power consumption and thermal optimization design have become the core issues in microprocessor research. The multi-core structure of CMP determines that its related power consumption research is a crucial topic.

Low-power design is a multi-level problem that requires research at multiple levels, including the operating system level, algorithm level, structure level, circuit level, etc. The low-power design methods at each level achieve different results - the higher the level of abstraction, the more obvious the effect of reducing power consumption and temperature.

8 Memory Wall

In order to make the chip core work fully, the minimum requirement is that the chip can provide memory bandwidth that matches the chip performance. Although the capacity of the internal cache can solve some problems, as the performance is further improved, there must be other means to increase the bandwidth of the memory interface, such as increasing the bandwidth of a single pin, DDR, DDR2, QDR, XDR, etc. Similarly, the system must also have a memory that can provide high bandwidth. Therefore, the chip has higher and higher requirements for packaging. Although the number of package pins increases by 20% each year, it still cannot completely solve the problem, and it also brings about the problem of cost increase. Therefore, how to provide a high-bandwidth, low-latency interface bandwidth is an important problem that must be solved.

9 Reliability and safety design

With the development of technological innovation, the application of processors has penetrated into all aspects of modern society, but there are great hidden dangers in terms of security. On the one hand, the reliability of the processor structure itself is low. Due to the ultra-fineness, high-speed clock design, and low power supply voltage, the safety factor in the design is becoming increasingly difficult to guarantee, and the incidence of failures is gradually increasing. On the other hand, malicious attacks from third parties are increasing in number and the means are becoming more and more advanced, which has become a universal social problem. Now, the improvement of reliability and security has attracted much attention in the field of computer architecture research.

In the future, structures in which multiple processes are executed simultaneously within processor chips such as CMP will become mainstream. Coupled with the increased hardware complexity and design errors, the internal processor chips may not be safe. Therefore, there is still a long way to go in safety and reliability design.

Reference address：Nine elements of multi-core processor design

Previous article：Design and application of MC9S08LL16 in water meter and gas meter
Next article：Microcontroller power saving management method

Popular Resources
Popular amplifiers

Latest Microcontroller Articles

Learn ARM development(16)
There are many things to learn about ARM, and interrupts are definitely something that needs to be learned. Since the CPU introduced interrupts, it has truly entered the multi-tasking system and greatly improved work efficiency. ...
Learn ARM development(17)
Because all embedded systems use interrupts, how does my S3C44B0 interrupt the process? Then I need to understand the whole process. To understand it in depth, the best way is to write a program ...
Learn ARM development(18)
Last time, we have learned about the interrupt handling process of ARM and how to set the interrupt function. So, does it work like this? The answer is no. Because S3C44B0 has several registers that control ...
Embedded system debugging simulation tool
After the embedded hardware system is designed, it needs to be debugged. Whether it is hardware debugging, software debugging or program solidification, debugging simulation tools are needed. ...
A small question that has been bothering me recently has finally been solved~~
Recently, I have not been able to understand the concept of drivers very well. Sometimes I can get a feel for it by combining some USB examples written by others, but because the ARM system does not directly explain the pins like the microcontroller ...
Learn ARM development (1)
Learn ARM development (2)
Learn ARM development (4)
Learn ARM development (6)

He Limin Column Microcontroller and Embedded Systems Bible

Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.

MoreSelected Circuit Diagrams

Change More Related Popular Components

MorePopular Articles

MoreDaily News

Guess you like