In-depth understanding of automotive system-on-chip SoC series: CPU microarchitecture-EEWORLD

Collect

1. Micro-op and macro-op

Intel SunnyCove Core Front End

Image source: Intel

The above picture shows the front end of Intel's Sunny Cove core. Because the length of CISC instructions is uncertain and relatively complex, Intel cuts these instructions into fixed lengths. This microinstruction appeared in the IA architecture starting with Pentium Pro. The processor accepts x86 instructions (CISC instructions, complex instruction set), but the execution engine does not execute x86 instructions, but RISC-like instructions one by one. Intel calls them Micro Operation, i.e. micro-op or μ-op, and generally uses a more convenient writing method to replace the Greek letters: u-op or uop. In contrast, the operation that combines multiple instructions is called Macro Operation or macro-op, i.e. macro operation.

Intel gradually improved the microinstructions and later added a microinstruction cache, or uOP cache, which is also called L0 cache in some places. On the surface, the uOP cache is positioned as a level 2 instruction cache unit, which is a subset of the level 1 instruction cache unit. Its uniqueness lies in that it stores instructions (uOPs) after decoding. The early Micro-op cache consisted of 32 groups of 8 cachelines. Each cacheline retains a maximum of 6 uOps, a total of 1536 uOps, and Sunny Cove is 48 groups, or 2304 uOps. This cache is shared between two threads and retains a pointer to the micro sequencer ROM. It is also virtual address addressing and is a strict subset of the L1 instruction cache. Each cacheline also includes the number of uops it contains and its length.

The core is always processing continuous 32-byte information from the instruction stream. Similarly, the uOPcache is also based on 32-byte windows. Therefore, the uOP cache can store or abandon the entire window based on the LRU (Least Recently Used) strategy. Intel calls the traditional pipeline the "legacy decode pipeline". In the first iteration, all instructions go through the "legacy decode pipeline". Once the entire instruction stream window is decoded and sent to the allocation queue, a copy of the window is sent to the uOP cache. This happens at the same time as all other operations, so this feature does not add additional pipeline stages. In subsequent iterations, the cached decoded instruction stream can be sent directly to the allocation queue, eliminating the stages of instruction fetch, pre-decode and decode, saving power and increasing throughput. This is also a shorter pipeline with reduced latency.

The uOP cache has a >80% hit rate. During instruction fetch, the branch predictor reads the uOPscache tags. If the cache hits, it can send up to 4 uops (possibly including fused macro-ops) per cycle to the Instruction Decode Queue (IDQ), bypassing all other pre-decoding and decoding that would otherwise be done.

ARM Cortex-A77 microarchitecture

Image source: ARM

The above picture is the architecture diagram of ARM Cortex-A77. Since A77, ARM has also added decoded instruction cache, which is highly similar to Intel's uOP cache, but ARM uses it in the opposite way. Intel cuts CISC instructions into microinstructions, while ARM fuses them, fusing fixed-length and fixed-format RISC instructions into macro operations MOP (Macro Operation), and then divides them into uOPs. This may be because RISC instructions are too scattered, and some can be fused, which is more efficient. Another thing is that the decoder usually has a higher throughput than the backend, and having enough MOP cache can make the backend work more saturated and more utilized.

The same is true for A78, with 4 instructions and 6 MOPs per cycle, but X1 is different, with 5 instructions and 8 MOPs per cycle.

A78 decoding to distribution

Image source: Internet

2. Launch and Execution

ARM and AMD's execution units generally separate integer calculations (ALU) from floating-point calculations (FPU), while Intel combines the two into one. The key parameter on the execution side is the issue width. Currently, the widest is ARM's V1, which is up to 15 bits. The wider the width, the more ALUs and FPUs can be. Usually there are 4 ALUs, and 4 integer arithmetic units are enough. V1 has 4 ALUs, 4 NEONs for floating points, including two SVEs. 3 Loads, 2 Stores. There are also two branches for specific calculation types.

Image source: Internet

The essence of all computer programs is the execution of instructions, which is fundamentally the data interaction between the processor register and the memory. This interaction consists of two steps: reading memory, which is the load operation, reading from the memory to the register, and writing memory, which is the store operation, writing from the register to the memory. LSU is the Load Store Unit.

The LSU component is an execution component of the instruction pipeline, which receives the LSU launch sequence from the core and is connected to the memory subsystem, namely the data cache. Its main function is to send memory requests from the CPU to the memory subsystem and process the response data and messages of the memory subsystem below. In many microarchitectures, the AGU (addressing generator) is introduced to calculate the effective address to speed up the execution of memory instructions, and the data in the LSU is processed using part of the pipeline of the ALU (computational unit, usually a scalar computational unit). From a logical function point of view, the work done by the AGU and ALU still belongs to the LSU.

3. Fixed point and floating point

The so-called fixed-point format is to agree that the decimal point position of all data in the machine is fixed. Fixed-point data is usually represented as pure decimals or pure integers. In order to represent a number as a pure decimal, the decimal point is usually fixed before the highest digit of the numerical part; in order to represent a number as a pure integer, the decimal point is fixed at the end of the numerical part.

For example, if you want to calculate 3.75 in decimal, you can use 011+0.11, but this addition must be aligned to the bit width first. The purpose of alignment is to make everyone's units consistent. You can expand 011 to 011.00 and 0.11 to 000.11. When you add them together, you get 011.11, which is 3.75 in decimal. When it is stored in a computer, the computer does not have a decimal point and will only store it as 01111. If it is a fixed-point format, the position of the decimal point is fixed and cannot be changed. Since the bit width must be aligned during calculation, the range of data that the computer can process is limited, and it cannot be too large or too small. Assume that a byte is used to represent a decimal, the decimal point is fixed at the position of 5.3, the upper 5 bits represent the integer, and the lower 3 bits represent the decimal: 11001_001 —— 11001.001, convert: the integer part 11001 = 25, the decimal part 001 = 1 (numerator part) The denominator is 1000 (8), so the decimal part is 1/8 (the binary system only has 0 and 1). The final decimal representation is 25 + 1/8, that is, there are 0/8, 1/8, 2/8, .... 7/8, a total of eight levels, indicating an accuracy of 1/8, so fixed-point decimals have the problem of numerical range and numerical accuracy! Numerical range and accuracy are a pair of contradictions. If a variable wants to be able to represent a relatively large numerical range, it must sacrifice accuracy; and if you want to improve accuracy, the representation range of the number will be reduced accordingly. In actual fixed-point algorithms, in order to achieve the best performance, this point must be fully considered, that is, the dynamic range and accuracy must be weighed.

This is fine for logical operations or simple integer operations, but if you want to display larger variables, the decimal point must be able to move freely, which is the floating point.

Image source: Internet

For example, 123.45 can be expressed as 1.2345x102 in decimal scientific notation, where 1.2345 is the mantissa, 10 is the base, and 2 is the exponent. Floating-point numbers use the exponent to achieve the effect of floating decimal points, so that a wider range of real numbers can be flexibly expressed. IEEE754 specifies two basic floating-point formats: single precision and double precision. The single-precision format has 24 bits of significant digits (i.e., mantissa) precision, occupying a total of 32 bits; the double-precision format has 53 bits of significant digits (i.e., mantissa) precision, occupying a total of 64 bits.

Two extended floating-point formats: single-extended and double-extended. The standard does not specify the exact precision and size of these formats, but does specify minimum precision and size, such as the IEEE double-extended format must have at least 64 bits of significant digit precision and occupy at least 79 bits total.

Floating point operations are different from fixed point operations (integer operations). They usually have 6 steps: 1. Anomaly detection: mainly detect NAN (non-number) 2. Alignment and right shift of mantissa: mantissas with different exponents cannot be directly added or subtracted, so alignment is required. For example, the mantissas of 1.1 * 2 E 1 + 1.1 * 2 E 2 cannot be calculated. Alignment and right shift of mantissas will make the exponents consistent. 3. Sum and subtract mantissas. The mantissas after alignment are calculated according to the fixed point addition and subtraction rules. 4. Normalization: convert non-standard decimals to normalized decimals. 5. In the process of alignment and right shift, the low bits of the mantissa may be lost, causing errors and affecting the accuracy. For this reason, rounding can be used to improve the accuracy of the mantissa. The IEEE754 standard lists four optional rounding methods: round up, round down, round to the nearest, and truncate. 6. Overflow check. Different from fixed-point operations, the overflow of floating-point numbers is determined by whether the value of the exponent of the result of the operation produces an overflow. If the value of the exponent exceeds the maximum positive number that the exponent can represent, it is an overflow. Furthermore, if the floating-point number is positive at this time, it is a positive overflow. If the floating-point number is negative, it is a negative overflow. Furthermore, if the floating-point number is positive at this time, it is a positive underflow. If the floating-point number is negative, it is a negative underflow. Both positive underflow and negative underflow are treated as machine zero, that is, each digit of the mantissa is forced to be zero.

[1] [2]

Reference address：In-depth understanding of automotive system-on-chip SoC series: CPU microarchitecture

Previous article：Ouster launches Chronos, an advanced digital lidar chip for automobiles
Next article：ON Semiconductor's Silicon Carbide and Sensing Business

Popular Resources
Popular amplifiers