1. Cache
We often see that CPU parameters often mention the 1st, 2nd, and 3rd level cache capacities, and sometimes there is also an L1$ label. What is this?
TLB and two-level cache
Source: Internet
Cache refers to a high-speed memory with faster access speed than general random access memory (RAM). Usually, cache does not use DRAM technology like system main memory, but uses expensive but faster SRAM technology.
The working principle of the cache is that when the CPU wants to read a piece of data, it first searches for it in the CPU cache. If it is found, it is read immediately and sent to the CPU for processing. If it is not found, it is read from the relatively slower memory and sent to the CPU for processing. At the same time, the data block where the data is located is loaded into the cache, so that the entire block of data can be read from the cache in the future without calling the memory.
It is this reading mechanism that makes the CPU read cache hit rate very high (most CPUs can reach about 90%), that is, 90% of the data that the CPU needs to read next time is in the CPU cache, and only about 10% needs to be read from the memory. This greatly saves the time for the CPU to read the memory directly, and also makes the CPU basically do not need to wait when reading data.
Generally speaking, the CPU reads data from the cache first and then from the memory.
The first level cache is L1 Cache. Because the pronunciation of Cache is the same as the English word Cash, it is sometimes written as L1$. The first level cache is divided into instructions and data, and the second level cache is L2 Cache, which does not distinguish between instructions and data. The first level cache is unique to each core. For a small number of multi-cores such as more than 12 cores, the second level cache is shared by 4 or more cores. Most of the second level cache is unique to each core. The third level cache, namely L3 Cache, is connected to the core through a bus and is shared by multiple cores.
Because the cache uses SRAM, its transistor density is low and it occupies a large area, which means high cost. Simply put, the more cache is used, the higher the cost and the better the performance. Cache stores data in fixed-size units, called a Cache entry, and this unit is called a Cache line or Cacheblock. Given the Cache capacity and Cache line size, the number of entries it can store is fixed. Because the Cache is a fixed size, the data it gets from DRAM is also a fixed size. For X86, its Cache line size is consistent with the data size that can be obtained by a single memory access of DDR3 and 4, that is, 64Bytes.
Usually L1 Cache is closer to where the CPU core needs data, while L2 Cache is at the edge. When accessing data, L2 Cache needs to pass through farther copper wires or even more circuits, which increases the delay.
L1 Cache is divided into ICache (instruction cache) and DCache (data cache). Instruction cache ICache is usually placed near the instruction prefetch unit of the CPU core, and data cache DCache is usually placed near the load/store unit of the CPU core. L2 Cache is placed outside the CPU pipeline. There is also L3 cache, which is usually a multi-core shared cache with a larger capacity.
The cache is level by level. If it cannot be found in the level 1 cache, it will be looked for in the level 2 cache, then the level 3 cache, and finally the memory outside the chip. Due to the wide variety of memory types, it is necessary to introduce a memory management unit, or MMU. The MMU is a hardware device that is controlled by a two-level page table stored in the main memory. The main function of the MMU is to be responsible for the mapping of virtual addresses issued by the CPU core to physical addresses, and to provide hardware mechanism memory access permission checks. The MMU allows each user process to have its own address space, and protects the memory used by each process from being destroyed by other processes through memory access permission checks.
After the processor introduces MMU, it needs to access the memory twice to read instructions and data: first, obtain the physical address by querying the page table, and then access the physical address to read instructions and data. In order to reduce the performance degradation of the processor caused by MMU, TLB was introduced. TLB is the abbreviation of Translation Lookaside Buffer, which can be translated as "address translation buffer", also referred to as "fast table". Simply put, TLB is the cache of the page table, which stores the page table entries that are most likely to be accessed at the moment, and its content is a copy of some page table entries. Only when TLB cannot complete the address translation task will it query the page table in the memory, thus reducing the performance degradation of the processor caused by page table query.
What is a Page Table?
This is a term used in operating systems. One of the main tasks of an operating system is to isolate programs from each other. Therefore, different memory spaces need to be established, that is, different addresses need to be assigned to the memory. The page table is usually determined by the operating system. For a 32-bit operating system, if you want to support a 4GB process virtual address space under a 32-bit operating system, assuming that the page table size is 4K, there are a total of 2 to the power of 20 pages.
If the fastest level 1 page table is used, 2 to the power of 20 page table entries are required. If a page table entry is 4 bytes, then a process needs (1048576*4=)4M of memory to store the page table entry. This is too large and the cost is too high, so it needs to be graded. If a level 2 page table is used, only one page directory is needed when creating a process, which occupies (1024*4)=4KB of memory. The remaining level 2 page table entries will only be requested when they are used. If it is 64-bit, a level 4 page table is required. After Linux v2.6.11, the final solution adopted is a level 4 page table, which are:
PGD: Page Global Directory (47-39), Page Global Directory
PUD: Page Upper Directory (38-30), page upper directory
PMD: Page Middle Directory (29-21), page middle directory
PTE: Page Table Entry (20-12), page table entry
For any instruction with an address, the address should be considered a virtual memory address rather than a physical address. Assuming that register a0 contains the address 0x1000, this is a virtual memory address. The virtual memory address will be transferred to the memory management unit (MMU) and translated into a physical address. This physical address will then be used to index physical memory and load or store data from or to physical memory. From the CPU's perspective, once the MMU is turned on, the address in every instruction it executes is a virtual memory address. In order to complete the translation of virtual memory addresses to physical memory addresses, the MMU will have a table with virtual memory addresses on one side and physical memory addresses on the other.
The core idea of paging technology is to regard virtual memory space and physical memory space as small blocks of fixed size. The blocks of virtual memory space are called pages, and the blocks of physical address space are called frames. Each page can be mapped to a frame. Each page creates a form entry, so each address translation is for a page. In RISC-V, a page is 4KB, which is 4096Bytes. For the virtual memory address, we divide it into two parts, index and offset. The index is used to find the page, and the offset corresponds to which byte in a page. When the MMU is doing address translation, it can know the page number in the physical memory by reading the index in the virtual memory address. This page number corresponds to the 4096 bytes in the physical memory. After that, the offset in the virtual memory address points to one of the 4096 bytes in the page. Assuming the offset is 12, the 12th byte in the page is used. Add the offset to the starting address of the page to get the physical memory address.
The page table is stored in the memory. That is, one memory I/O needs to check the page table four times in the memory just to convert the virtual address to the physical address. If you add the actual memory access, in the worst case, it takes five memory I/Os to get a piece of memory data. This is too time-consuming and also leads to increased power consumption. Therefore, TLB was born. TLB is the page table cache.
When the CPU executive receives the virtual address sent by the application, it first searches the TLB for the corresponding page table data. The MMU obtains the page table from the TLB and translates it into a physical address. If the required page table is stored in the TLB, it is called a TLB hit. Next, the CPU checks to see if the data in the physical memory address corresponding to the page table in the TLB is already in the first and second caches. If not, it retrieves the data stored in the corresponding address from the memory. If the required page table is not in the TLB, it is called a TLB miss. Next, the page table stored in the physical memory must be accessed and the page table data in the TLB must be updated. The TLB is also divided into instructions and data.
Typical architecture Cortex-A78 cache instruction flow
Image source: ARM
The MOP cache is a fused operation cache of some preprocessed instructions.
2. Superscalar
Early computers were all serial computing. As the demand for throughput increased, parallel computing emerged. Common parallel computing includes three types: instruction parallelism, data parallelism, and task parallelism. Task parallelism can only be achieved with the help of software. For hardware, there are three types: instruction parallelism (ILP), thread parallelism, and data parallelism.
Previous article:A Deeper Look at Automotive System-on-Chip (SoC): Overview of ARM’s Business Model and CPU Microarchitecture
Next article:Using BLDC motors to help mechanical scanning LiDAR achieve 360-degree field of view
Recommended ReadingLatest update time:2024-11-16 12:48
- Popular Resources
- Popular amplifiers
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Read the good book "Electronic Engineer Self-study Handbook" - Suggestions
- Why do DSPs with large on-chip RAM have high efficiency?
- How is the buffering function of 74HC04 manifested? What is the difference between buffering and latching?
- From base stations to terminals, end-to-end 5G RF front-end solutions
- The 5G era is here. What challenges will the RF front end face?
- It's quite impressive. This hero made a 60,000 am/h, PD100W power bank
- Without using MCU, how can we make the MOS tube turn off 1000000000 times faster than it turns on?
- [HC32F460 Development Board Review] 02 Build a development environment to implement the LED marquee function
- Qorvo follows the animation to grasp the 5G RF front-end trend
- Microwave Technology and Antennas (4th Edition)