A Deep Dive into Automotive SoCs: Cache, Superscalar, Out-of-Order Execution-EEWORLD

Collect

The function of Commit is that the instruction delivers its result to (writes) the destination register or storage unit. A hardware buffer memory (buffer) must be added for the Write Result step to store the obtained results, and can be provided to other instructions to apply these results. When the instruction enters the Commit step, the result is copied from the buffer to the destination register or storage unit. This hardware buffer memory is called a reorder buffer (ROB). ROB ensures that instructions are issued (issue/dispatch) in order, executed (execute) in order, and committed (commit) in order. Through the Reorder buffer (ROB), Precise Exception and HW Speculation can be implemented. At the same time, because ROB ensures that instructions are committed in order, WAR and WAW risks are also eliminated. Precise Exception means that when an instruction has an exception (division by 0, page fault, etc.), the previous instruction has been completed, and the subsequent instruction cannot modify the register, memory, etc., which is the same as the effect of sequential execution.

Out-of-order execution microarchitecture

Image source: Internet

ROB diagram, FP refers to floating point operations. Another benefit of the emergence of ROB (reorder buffer) is predictive execution. The processor can execute instructions based on predictions, because the executed instructions are not submitted but placed in the ROB. Even if a prediction error occurs and an instruction that should not be executed is executed, this instruction will not be submitted and will be directly discarded.

Out-of-order execution, in-order submission

Image source: Internet

There are also slight differences. One is called explicit renaming. In the explicit renaming scheme, ROB does not record the results of instructions. The data to be submitted and the data in the speculative state are all stored in physical registers, so the number of physical registers is higher than the number of logical registers. When using the implicit renaming scheme, ROB (Recorder Buffer) saves the results of instructions that are being executed and not yet submitted; ARF (ISA Register File) saves the values that are about to be written to the registers in the instructions that have been submitted. In the implicit renaming scheme, ARF only saves the values of instructions that have been submitted, and the values of instructions in the "speculative" state are saved by ROB, so the number of physical registers required is the same as the number of logical registers. The implicit renaming scheme also needs to establish a mapping table to record the location of the operands in the ROB.

Compared with explicit renaming, implicit renaming requires fewer physical registers, which means lower costs, but each operand needs to be stored in two locations, ROB and ARF, during its life cycle, which makes reading data more complex and consumes more power. In Intel architecture, out-of-order execution is extremely complex, and the Allocator locator is introduced. Allocator manages RAT (RegisterAlias Table), ROB (Re-Order Buffer), and RRF (RetirementRegister File).

RAT points renamed, virtual registers (called Architectural Register or Logical Register) to ROB or RRF. RAT is in duplicate, independent for each thread, and each RAT contains 128 renamed registers. RAT points to the most recent execution register state in ROB, or to the final submission state saved by RRF.

ROB is a queue that reorders the instructions that have been executed out of order according to the original order of the program to ensure that all instructions can logically achieve the correct cause and effect relationship. The instructions that have been disrupted (branch prediction, hardware prefetching) are inserted into this queue one by one. When an instruction is sent to the next stage through RAT for execution, this instruction (including the register status) will be added to one end of the ROB queue, and the executed instructions (including the register status) will be removed from the other end of the ROB queue (during which the data of these instructions can be refreshed by some intermediate calculation results). Because the scheduler is in-order, this queue (ROB) is also sequential. Removing an instruction from the ROB means that the instruction has been executed. This stage is called Retirement. Correspondingly, ROB is often also called Retirement Unit (retirement unit, which is actually submission unit), and it is classified as the last part of the pipeline.

In some superscalar designs, the Retire phase writes the ROB data to the L1D cache (this is the case when the MOB is integrated into the ROB), while in other designs, writing to the L1D cache is done by another queue. For example, in Nehalem, which was launched by Intel in 2008, this operation is done by the MOB (Memory Order Buffer).

Intel Nehalem Microarchitecture Front-End

Image source: Internet

Nehalem's 128-entry ROB acts as a buffer for intermediate calculation results. It stores the instructions and data of the speculative execution. The speculative execution allows the execution of branch instructions with undetermined directions in advance. In most cases, the speculative execution works well - the branch is guessed correctly, so the result generated in the ROB is marked as completed and can be used immediately by subsequent instructions without the need for L1Data Cache Load operations (this is another important use of ROB. Load operations are so frequent in typical x86 applications that they account for almost 1/3. Therefore, ROB can avoid a large number of Cache Load operations, which is of great significance). In the remaining unfortunate cases, the branch fails to proceed as expected. At this time, the guessed branch instruction segment will be cleared, the pipeline stages of the corresponding instructions will be cleared, and the corresponding register states will all be invalid. This invalid register state will not and cannot appear in RRF.

Branch Prediction

The pipeline architecture divides the execution of instructions into multiple stages. Each unit is only responsible for completing one stage of the instruction execution process, and the intermediate results are temporarily stored in a dedicated pipeline register. In theory, if the execution of an instruction is divided into 5 stages, then when the 5 units run at the same time for a period of time, theoretically 5 instructions can be executed at the same time. Of course, this is only the simplest case, and the actual situation is much more complicated.

For conditional jump instructions (JMP at the assembly level, such as if at the code level), you need to wait until the current instruction is executed to know whether the result is true or false. In other words, the CPU must wait for several clock cycles before it can determine which branch the next instruction to be executed belongs to.

Modern CPUs have more than ten deep pipelines. This waiting time is too long. Do we just have to wait? Of course not. The CPU will use "branch prediction" to predict the next branch instruction to be executed and execute it in advance. If the if is found to be consistent with the predicted branch after execution, then it wins the prize. The entire execution stage does not pause at all. But if the prediction is tragically wrong, then it must start from address fetching and re-execute the instructions of another branch.

There are two types of branch prediction, one is BHT (branch history table) and the other is BTB (branch target cache). BHT is rarely used anymore. BHT records the most recent one or several executions of branch instructions (successful or unsuccessful) and makes predictions based on this. Under BHT, the branch target still needs to be calculated, so the time required to determine whether the branch is successful should be greater than the time required to determine the branch target address. Only then will the BHT method be useful. BHT records jump information. To make it simple, it can be recorded in 1 bit, for example, 1 means jump, 0 means no jump, and the index of this table is the instruction PC value. Considering a 32-bit system, if you want to record a complete 32-bit branch history, you need 4Gbit of memory, which exceeds the hardware support provided by the system. Therefore, the last 12 bits of the instruction are generally used as the index of the BHT table, so that a 4Kbit table can be used to record the branch history.

BTB stores the address of the branch instruction that has successfully branched and its branch target address in a buffer, which is identified by the address of the branch instruction. This buffer is the branch target buffer. BTB is used to record the jump address of a branch instruction. Since it stores the instruction address, such as a 32-bit address, this table cannot store as much content as BHT. If it also supports 4K instructions, 128Kbit of storage space is required, which is almost the capacity of an L1Cache, so BTB is generally very small, with only 32 or 64 items.

In order to improve the accuracy of branch prediction, BTB is subdivided into NANO, MICRO and main cache. NANO and MICRO are very small, only 16 or 32, and the server-level has 96 items. The main cache has a maximum of 8K. In applications such as databases and ERP, jump branches will cross a large area and have many branches.

[1] [2] [3]

Reference address：A Deep Dive into Automotive SoCs: Cache, Superscalar, Out-of-Order Execution

Previous article：A Deeper Look at Automotive System-on-Chip (SoC): Overview of ARM’s Business Model and CPU Microarchitecture
Next article：Using BLDC motors to help mechanical scanning LiDAR achieve 360-degree field of view

Recommended ReadingLatest update time:2024-11-16 14:45

OK6410A development board (VIII) 120 linux-5.11 OK6410A cache configuration

armv6 linux has 5 cache strategies, corresponding to three categories // The three major categories are determined by two bits of the C1 register in cp15 // P740 bit (C) and bit (W) W C 1. uncached 0 0 3. buffered 1 0 4. writethrough writeback writealloc 0 1 The three types of 3 can be deter

[Microcontroller]

Address mapping between main memory and cache

Compared with the main memory capacity, the cache capacity is very small. The information it stores is only a subset of the main memory information, and the information exchange between the cache and the main memory is in blocks. The size of each block in the main memory is equal to the size of the block in the cache.

[Microcontroller]

Popular Resources
Popular amplifiers