A Deep Dive into Automotive SoCs: Cache, Superscalar, Out-of-Order Execution-EEWORLD

Collect

1. Cache

We often see that CPU parameters often mention the 1st, 2nd, and 3rd level cache capacities, and sometimes there is also an L1$ label. What is this?

TLB and two-level cache

Source: Internet

Cache refers to a high-speed memory with faster access speed than general random access memory (RAM). Usually, cache does not use DRAM technology like system main memory, but uses expensive but faster SRAM technology.

The working principle of the cache is that when the CPU wants to read a piece of data, it first searches for it in the CPU cache. If it is found, it is read immediately and sent to the CPU for processing. If it is not found, it is read from the relatively slower memory and sent to the CPU for processing. At the same time, the data block where the data is located is loaded into the cache, so that the entire block of data can be read from the cache in the future without calling the memory.

It is this reading mechanism that makes the CPU read cache hit rate very high (most CPUs can reach about 90%), that is, 90% of the data that the CPU needs to read next time is in the CPU cache, and only about 10% needs to be read from the memory. This greatly saves the time for the CPU to read the memory directly, and also makes the CPU basically do not need to wait when reading data.

Generally speaking, the CPU reads data from the cache first and then from the memory.

The first level cache is L1 Cache. Because the pronunciation of Cache is the same as the English word Cash, it is sometimes written as L1$. The first level cache is divided into instructions and data, and the second level cache is L2 Cache, which does not distinguish between instructions and data. The first level cache is unique to each core. For a small number of multi-cores such as more than 12 cores, the second level cache is shared by 4 or more cores. Most of the second level cache is unique to each core. The third level cache, namely L3 Cache, is connected to the core through a bus and is shared by multiple cores.

Because the cache uses SRAM, its transistor density is low and it occupies a large area, which means high cost. Simply put, the more cache is used, the higher the cost and the better the performance. Cache stores data in fixed-size units, called a Cache entry, and this unit is called a Cache line or Cacheblock. Given the Cache capacity and Cache line size, the number of entries it can store is fixed. Because the Cache is a fixed size, the data it gets from DRAM is also a fixed size. For X86, its Cache line size is consistent with the data size that can be obtained by a single memory access of DDR3 and 4, that is, 64Bytes.

Usually L1 Cache is closer to where the CPU core needs data, while L2 Cache is at the edge. When accessing data, L2 Cache needs to pass through farther copper wires or even more circuits, which increases the delay.

L1 Cache is divided into ICache (instruction cache) and DCache (data cache). Instruction cache ICache is usually placed near the instruction prefetch unit of the CPU core, and data cache DCache is usually placed near the load/store unit of the CPU core. L2 Cache is placed outside the CPU pipeline. There is also L3 cache, which is usually a multi-core shared cache with a larger capacity.

The cache is level by level. If it cannot be found in the level 1 cache, it will be looked for in the level 2 cache, then the level 3 cache, and finally the memory outside the chip. Due to the wide variety of memory types, it is necessary to introduce a memory management unit, or MMU. The MMU is a hardware device that is controlled by a two-level page table stored in the main memory. The main function of the MMU is to be responsible for the mapping of virtual addresses issued by the CPU core to physical addresses, and to provide hardware mechanism memory access permission checks. The MMU allows each user process to have its own address space, and protects the memory used by each process from being destroyed by other processes through memory access permission checks.

After the processor introduces MMU, it needs to access the memory twice to read instructions and data: first, obtain the physical address by querying the page table, and then access the physical address to read instructions and data. In order to reduce the performance degradation of the processor caused by MMU, TLB was introduced. TLB is the abbreviation of Translation Lookaside Buffer, which can be translated as "address translation buffer", also referred to as "fast table". Simply put, TLB is the cache of the page table, which stores the page table entries that are most likely to be accessed at the moment, and its content is a copy of some page table entries. Only when TLB cannot complete the address translation task will it query the page table in the memory, thus reducing the performance degradation of the processor caused by page table query.

What is a Page Table?

This is a term used in operating systems. One of the main tasks of an operating system is to isolate programs from each other. Therefore, different memory spaces need to be established, that is, different addresses need to be assigned to the memory. The page table is usually determined by the operating system. For a 32-bit operating system, if you want to support a 4GB process virtual address space under a 32-bit operating system, assuming that the page table size is 4K, there are a total of 2 to the power of 20 pages.

If the fastest level 1 page table is used, 2 to the power of 20 page table entries are required. If a page table entry is 4 bytes, then a process needs (1048576*4=)4M of memory to store the page table entry. This is too large and the cost is too high, so it needs to be graded. If a level 2 page table is used, only one page directory is needed when creating a process, which occupies (1024*4)=4KB of memory. The remaining level 2 page table entries will only be requested when they are used. If it is 64-bit, a level 4 page table is required. After Linux v2.6.11, the final solution adopted is a level 4 page table, which are:

PGD: Page Global Directory (47-39), Page Global Directory

PUD: Page Upper Directory (38-30), page upper directory

PMD: Page Middle Directory (29-21), page middle directory

PTE: Page Table Entry (20-12), page table entry

For any instruction with an address, the address should be considered a virtual memory address rather than a physical address. Assuming that register a0 contains the address 0x1000, this is a virtual memory address. The virtual memory address will be transferred to the memory management unit (MMU) and translated into a physical address. This physical address will then be used to index physical memory and load or store data from or to physical memory. From the CPU's perspective, once the MMU is turned on, the address in every instruction it executes is a virtual memory address. In order to complete the translation of virtual memory addresses to physical memory addresses, the MMU will have a table with virtual memory addresses on one side and physical memory addresses on the other.

The core idea of paging technology is to regard virtual memory space and physical memory space as small blocks of fixed size. The blocks of virtual memory space are called pages, and the blocks of physical address space are called frames. Each page can be mapped to a frame. Each page creates a form entry, so each address translation is for a page. In RISC-V, a page is 4KB, which is 4096Bytes. For the virtual memory address, we divide it into two parts, index and offset. The index is used to find the page, and the offset corresponds to which byte in a page. When the MMU is doing address translation, it can know the page number in the physical memory by reading the index in the virtual memory address. This page number corresponds to the 4096 bytes in the physical memory. After that, the offset in the virtual memory address points to one of the 4096 bytes in the page. Assuming the offset is 12, the 12th byte in the page is used. Add the offset to the starting address of the page to get the physical memory address.

The page table is stored in the memory. That is, one memory I/O needs to check the page table four times in the memory just to convert the virtual address to the physical address. If you add the actual memory access, in the worst case, it takes five memory I/Os to get a piece of memory data. This is too time-consuming and also leads to increased power consumption. Therefore, TLB was born. TLB is the page table cache.

When the CPU executive receives the virtual address sent by the application, it first searches the TLB for the corresponding page table data. The MMU obtains the page table from the TLB and translates it into a physical address. If the required page table is stored in the TLB, it is called a TLB hit. Next, the CPU checks to see if the data in the physical memory address corresponding to the page table in the TLB is already in the first and second caches. If not, it retrieves the data stored in the corresponding address from the memory. If the required page table is not in the TLB, it is called a TLB miss. Next, the page table stored in the physical memory must be accessed and the page table data in the TLB must be updated. The TLB is also divided into instructions and data.

Typical architecture Cortex-A78 cache instruction flow

Image source: ARM

The MOP cache is a fused operation cache of some preprocessed instructions.

2. Superscalar

Early computers were all serial computing. As the demand for throughput increased, parallel computing emerged. Common parallel computing includes three types: instruction parallelism, data parallelism, and task parallelism. Task parallelism can only be achieved with the help of software. For hardware, there are three types: instruction parallelism (ILP), thread parallelism, and data parallelism.

[1] [2] [3]

Reference address：A Deep Dive into Automotive SoCs: Cache, Superscalar, Out-of-Order Execution

Previous article：A Deeper Look at Automotive System-on-Chip (SoC): Overview of ARM’s Business Model and CPU Microarchitecture
Next article：Using BLDC motors to help mechanical scanning LiDAR achieve 360-degree field of view

Recommended ReadingLatest update time:2024-11-16 12:48

OK6410A development board (VIII) 120 linux-5.11 OK6410A cache configuration

armv6 linux has 5 cache strategies, corresponding to three categories // The three major categories are determined by two bits of the C1 register in cp15 // P740 bit (C) and bit (W) W C 1. uncached 0 0 3. buffered 1 0 4. writethrough writeback writealloc 0 1 The three types of 3 can be deter

[Microcontroller]

Address mapping between main memory and cache

Compared with the main memory capacity, the cache capacity is very small. The information it stores is only a subset of the main memory information, and the information exchange between the cache and the main memory is in blocks. The size of each block in the main memory is equal to the size of the block in the cache.

[Microcontroller]

Popular Resources
Popular amplifiers