Like SMT, CMP is committed to exploring the coarse-grained parallelism of computing. CMP can be seen as the development of large-scale integrated circuit technology. When the chip capacity is large enough, the SMP (symmetric multiprocessor) or DSM (distributed shared processor) nodes in the large-scale parallel processor structure can be integrated into the same chip, and each processor executes different threads or processes in parallel. In a single-chip multiprocessor based on the SMP structure, processors communicate with each other through off-chip cache or off-chip shared memory. In a single-chip multiprocessor based on the DSM structure, processors communicate with each other through an on-chip high-speed crossbar switch network connected to the distributed memory.
Reference address:Nine elements of multi-core processor design
Since SMP and DSM are already very mature technologies, CMP structure design is relatively easy, but the requirements for back-end design and chip manufacturing process are higher. Because of this, CMP has become the first "future" high-performance processor structure to be applied to commercial CPUs.
Although multi-core can take advantage of the many benefits brought by increased integration and increase the performance of the chip exponentially, it is obvious that some of the original system-level problems are introduced into the processor.
1 Nuclear structure research: isomorphic or heterogeneous
The structure of CMP is divided into two categories: homogeneous and heterogeneous. Homogeneous means that the structure of the internal core is the same, while heterogeneous means that the internal core structure is different. Therefore, it is crucial to study the implementation of the core structure for different applications to achieve the performance of future microprocessors. The structure of the core itself is related to the area, power consumption and performance of the entire chip. How to inherit and develop the achievements of traditional processors directly affects the performance and implementation cycle of multi-core. At the same time, according to Amdahl's theorem, the acceleration ratio of the program is determined by the performance of the serial part, so theoretically it seems that the structure of heterogeneous microprocessors has better performance.
The instruction system used by the core is also very important for the implementation of the system. Whether the multiple cores use the same instruction system or different instruction systems, whether they can run the operating system, etc., will also be one of the research contents.
2 Program Execution Model
The first issue in multi-core processor design is to choose a program execution model. The applicability of the program execution model determines whether the multi-core processor can provide the highest performance at the lowest cost. The program execution model is the interface between compiler designers and system implementers. Compiler designers decide how to convert a high-level language program into a target machine language program according to a program execution model; system implementers decide how to effectively implement the program execution model on a specific target machine. When the target machine is a multi-core architecture, the questions that arise are: How does the multi-core architecture support important program execution models? Are there other program execution models that are more suitable for multi-core architectures? To what extent can these program execution models meet the needs of applications and be accepted by users?
3 Cache Design: Multi-level Cache Design and Consistency Issues
The speed gap between the processor and the main memory is a prominent contradiction for CMP, so multi-level cache must be used to alleviate it. Currently, there are CMPs with shared primary cache, shared secondary cache, and shared main memory. Usually, CMPs use a shared secondary cache CMP structure, that is, each processor core has a private primary cache, and all processor cores share a secondary cache.
The cache architecture design is also directly related to the overall system performance. However, in the CMP structure, whether shared cache or unique cache is better, whether to build multi-level cache on a chip, and how many levels of cache to build, etc., have a great impact on the size, power consumption, layout, performance and operating efficiency of the entire chip, so these are all issues that need to be carefully studied and discussed.
On the other hand, multi-level caches raise consistency issues. The cache consistency model and mechanism used will have a significant impact on the overall performance of the CMP. The cache consistency models widely used in traditional multi-processor system structures include: sequential consistency model, weak consistency model, release consistency model, etc. The related cache consistency mechanisms mainly include bus snooping protocol and directory-based directory protocol. Most current CMP systems use bus-based snooping protocol.
4 Inter-core Communication Technology
Programs executed by the CPU cores of a CMP processor sometimes need to share and synchronize data, so its hardware structure must support inter-core communication. An efficient communication mechanism is an important guarantee for the high performance of a CMP processor. Currently, there are two mainstream on-chip efficient communication mechanisms: one is a cache structure based on bus sharing, and the other is an on-chip interconnect structure.
The bus-shared cache structure means that each CPU core has a shared secondary or tertiary cache to store frequently used data and communicate through the bus connecting the cores. The advantages of this system are simple structure and high communication speed, but the disadvantage is that the bus-based structure has poor scalability.
The structure based on on-chip interconnection means that each CPU core has an independent processing unit and cache, and each CPU core is connected together through a cross switch or on-chip network. Each CPU core communicates through messages. The advantages of this structure are good scalability and guaranteed data bandwidth; the disadvantages are complex hardware structure and large software changes.
Perhaps the result of the competition between the two is not to replace each other but to cooperate with each other, for example, using on-chip networks globally and buses locally to achieve a balance between performance and complexity.
5 Bus Design
In traditional microprocessors, cache misses or memory access events will have a negative impact on the CPU's execution efficiency, and the efficiency of the bus interface unit (BIU) will determine the extent of this impact. When multiple CPU cores request to access memory at the same time or cache misses occur in the private caches of multiple CPU cores at the same time, the efficiency of the BIU's arbitration mechanism for these multiple access requests and the conversion mechanism for external storage access determines the overall performance of the CMP system. Therefore, it is important to find an efficient multi-port bus interface unit (BIU) structure to convert the single-word access of multiple cores to main memory into a more efficient burst access; at the same time, it is important to find the number model of burst access words that is optimal for the overall efficiency of the CMP processor and the arbitration mechanism for efficient multi-port BIU access.
6 Operating system design: task scheduling, interrupt handling, synchronization and mutual exclusion
For multi-core CPUs, optimizing the operating system task scheduling algorithm is the key to ensuring efficiency. General task scheduling algorithms include global queue scheduling and local queue scheduling. The former means that the operating system maintains a global task waiting queue. When a CPU core in the system is idle, the operating system selects a ready task from the global task waiting queue and starts executing it on this core. The advantage of this method is that the CPU core utilization rate is high. The latter means that the operating system maintains a local task waiting queue for each CPU core. When a CPU core in the system is idle, it selects an appropriate task from the task waiting queue of the core to execute. The advantage of this method is that tasks basically do not need to be switched between multiple CPU cores, which is conducive to improving the local cache hit rate of the CPU core. At present, most multi-core CPU operating systems use a task scheduling algorithm based on a global queue.
The interrupt handling of multi-core is very different from that of single-core. The processors of a multi-core need to communicate with each other through interrupts, so the local interrupt controllers between multiple processors and the global interrupt controller responsible for arbitrating the interrupt distribution between the cores also need to be encapsulated inside the chip.
In addition, a multi-core CPU is a multi-tasking system. Since different tasks will compete for shared resources, the system needs to provide synchronization and mutual exclusion mechanisms. However, the traditional solution mechanism for a single core cannot meet the needs of multi-cores, and it is necessary to use the "read-modify-write" atomic operation or other synchronization and mutual exclusion mechanisms provided by the hardware to ensure it.
7 Low power design
The rapid development of semiconductor technology has made the integration of microprocessors higher and higher. At the same time, the surface temperature of processors has become higher and higher and has increased exponentially. The power density of processors can double every three years. Currently, low power consumption and thermal optimization design have become the core issues in microprocessor research. The multi-core structure of CMP determines that its related power consumption research is a crucial topic.
Low-power design is a multi-level problem that requires research at multiple levels, including the operating system level, algorithm level, structure level, circuit level, etc. The low-power design methods at each level achieve different results - the higher the level of abstraction, the more obvious the effect of reducing power consumption and temperature.
8 Memory Wall
In order to make the chip core work fully, the minimum requirement is that the chip can provide memory bandwidth that matches the chip performance. Although the capacity of the internal cache can solve some problems, as the performance is further improved, there must be other means to increase the bandwidth of the memory interface, such as increasing the bandwidth of a single pin, DDR, DDR2, QDR, XDR, etc. Similarly, the system must also have a memory that can provide high bandwidth. Therefore, the chip has higher and higher requirements for packaging. Although the number of package pins increases by 20% each year, it still cannot completely solve the problem, and it also brings about the problem of cost increase. Therefore, how to provide a high-bandwidth, low-latency interface bandwidth is an important problem that must be solved.
9 Reliability and safety design
With the development of technological innovation, the application of processors has penetrated into all aspects of modern society, but there are great hidden dangers in terms of security. On the one hand, the reliability of the processor structure itself is low. Due to the ultra-fineness, high-speed clock design, and low power supply voltage, the safety factor in the design is becoming increasingly difficult to guarantee, and the incidence of failures is gradually increasing. On the other hand, malicious attacks from third parties are increasing in number and the means are becoming more and more advanced, which has become a universal social problem. Now, the improvement of reliability and security has attracted much attention in the field of computer architecture research.
In the future, structures in which multiple processes are executed simultaneously within processor chips such as CMP will become mainstream. Coupled with the increased hardware complexity and design errors, the internal processor chips may not be safe. Therefore, there is still a long way to go in safety and reliability design.
Previous article:Design and application of MC9S08LL16 in water meter and gas meter
Next article:Microcontroller power saving management method
- Popular Resources
- Popular amplifiers
Latest Microcontroller Articles
- Learn ARM development(16)
- Learn ARM development(17)
- Learn ARM development(18)
- Embedded system debugging simulation tool
- A small question that has been bothering me recently has finally been solved~~
- Learn ARM development (1)
- Learn ARM development (2)
- Learn ARM development (4)
- Learn ARM development (6)
He Limin Column
Microcontroller and Embedded Systems Bible
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
MoreSelected Circuit Diagrams
MorePopular Articles
- LED chemical incompatibility test to see which chemicals LEDs can be used with
- Application of ARM9 hardware coprocessor on WinCE embedded motherboard
- What are the key points for selecting rotor flowmeter?
- LM317 high power charger circuit
- A brief analysis of Embest's application and development of embedded medical devices
- Single-phase RC protection circuit
- stm32 PVD programmable voltage monitor
- Introduction and measurement of edge trigger and level trigger of 51 single chip microcomputer
- Improved design of Linux system software shell protection technology
- What to do if the ABB robot protection device stops
MoreDaily News
- Detailed explanation of intelligent car body perception system
- How to solve the problem that the servo drive is not enabled
- Why does the servo drive not power on?
- What point should I connect to when the servo is turned on?
- How to turn on the internal enable of Panasonic servo drive?
- What is the rigidity setting of Panasonic servo drive?
- How to change the inertia ratio of Panasonic servo drive
- What is the inertia ratio of the servo motor?
- Is it better for the motor to have a large or small moment of inertia?
- What is the difference between low inertia and high inertia of servo motors?
Guess you like
- Questions about the nonlinear capacitance of MOSFET tubes
- 【GD32L233C-START Review】1. Unboxing test to light up the color screen
- 【RT-Thread software package application works】+ unpacking and development environment establishment
- Square wave triangle wave generating circuit
- After deleting a wire that was previously laid, AD09 got stuck.
- Saiyuan MCU SC93F8332 lights up the RGB light of WS2812B
- Implementation of a Programmable Fully Digital Phase-Locked Loop
- Canaan wants to record a video about K510. What kind of content do you hope to see? (Attached is the official information about K510)
- Distance of ESD tip discharge needle
- Bone Vibration Sensor LIS25BA-Anti-Wind Noise Example