1. Micro-op and macro-op
Intel SunnyCove Core Front End
Image source: Intel
The above picture shows the front end of Intel's Sunny Cove core. Because the length of CISC instructions is uncertain and relatively complex, Intel cuts these instructions into fixed lengths. This microinstruction appeared in the IA architecture starting with Pentium Pro. The processor accepts x86 instructions (CISC instructions, complex instruction set), but the execution engine does not execute x86 instructions, but RISC-like instructions one by one. Intel calls them Micro Operation, i.e. micro-op or μ-op, and generally uses a more convenient writing method to replace the Greek letters: u-op or uop. In contrast, the operation that combines multiple instructions is called Macro Operation or macro-op, i.e. macro operation.
Intel gradually improved the microinstructions and later added a microinstruction cache, or uOP cache, which is also called L0 cache in some places. On the surface, the uOP cache is positioned as a level 2 instruction cache unit, which is a subset of the level 1 instruction cache unit. Its uniqueness lies in that it stores instructions (uOPs) after decoding. The early Micro-op cache consisted of 32 groups of 8 cachelines. Each cacheline retains a maximum of 6 uOps, a total of 1536 uOps, and Sunny Cove is 48 groups, or 2304 uOps. This cache is shared between two threads and retains a pointer to the micro sequencer ROM. It is also virtual address addressing and is a strict subset of the L1 instruction cache. Each cacheline also includes the number of uops it contains and its length.
The core is always processing continuous 32-byte information from the instruction stream. Similarly, the uOPcache is also based on 32-byte windows. Therefore, the uOP cache can store or abandon the entire window based on the LRU (Least Recently Used) strategy. Intel calls the traditional pipeline the "legacy decode pipeline". In the first iteration, all instructions go through the "legacy decode pipeline". Once the entire instruction stream window is decoded and sent to the allocation queue, a copy of the window is sent to the uOP cache. This happens at the same time as all other operations, so this feature does not add additional pipeline stages. In subsequent iterations, the cached decoded instruction stream can be sent directly to the allocation queue, eliminating the stages of instruction fetch, pre-decode and decode, saving power and increasing throughput. This is also a shorter pipeline with reduced latency.
The uOP cache has a >80% hit rate. During instruction fetch, the branch predictor reads the uOPscache tags. If the cache hits, it can send up to 4 uops (possibly including fused macro-ops) per cycle to the Instruction Decode Queue (IDQ), bypassing all other pre-decoding and decoding that would otherwise be done.
ARM Cortex-A77 microarchitecture
Image source: ARM
The above picture is the architecture diagram of ARM Cortex-A77. Since A77, ARM has also added decoded instruction cache, which is highly similar to Intel's uOP cache, but ARM uses it in the opposite way. Intel cuts CISC instructions into microinstructions, while ARM fuses them, fusing fixed-length and fixed-format RISC instructions into macro operations MOP (Macro Operation), and then divides them into uOPs. This may be because RISC instructions are too scattered, and some can be fused, which is more efficient. Another thing is that the decoder usually has a higher throughput than the backend, and having enough MOP cache can make the backend work more saturated and more utilized.
The same is true for A78, with 4 instructions and 6 MOPs per cycle, but X1 is different, with 5 instructions and 8 MOPs per cycle.
A78 decoding to distribution
Image source: Internet
2. Launch and Execution
ARM and AMD's execution units generally separate integer calculations (ALU) from floating-point calculations (FPU), while Intel combines the two into one. The key parameter on the execution side is the issue width. Currently, the widest is ARM's V1, which is up to 15 bits. The wider the width, the more ALUs and FPUs can be. Usually there are 4 ALUs, and 4 integer arithmetic units are enough. V1 has 4 ALUs, 4 NEONs for floating points, including two SVEs. 3 Loads, 2 Stores. There are also two branches for specific calculation types.
Image source: Internet
The essence of all computer programs is the execution of instructions, which is fundamentally the data interaction between the processor register and the memory. This interaction consists of two steps: reading memory, which is the load operation, reading from the memory to the register, and writing memory, which is the store operation, writing from the register to the memory. LSU is the Load Store Unit.
The LSU component is an execution component of the instruction pipeline, which receives the LSU launch sequence from the core and is connected to the memory subsystem, namely the data cache. Its main function is to send memory requests from the CPU to the memory subsystem and process the response data and messages of the memory subsystem below. In many microarchitectures, the AGU (addressing generator) is introduced to calculate the effective address to speed up the execution of memory instructions, and the data in the LSU is processed using part of the pipeline of the ALU (computational unit, usually a scalar computational unit). From a logical function point of view, the work done by the AGU and ALU still belongs to the LSU.
3. Fixed point and floating point
The so-called fixed-point format is to agree that the decimal point position of all data in the machine is fixed. Fixed-point data is usually represented as pure decimals or pure integers. In order to represent a number as a pure decimal, the decimal point is usually fixed before the highest digit of the numerical part; in order to represent a number as a pure integer, the decimal point is fixed at the end of the numerical part.
For example, if you want to calculate 3.75 in decimal, you can use 011+0.11, but this addition must be aligned to the bit width first. The purpose of alignment is to make everyone's units consistent. You can expand 011 to 011.00 and 0.11 to 000.11. When you add them together, you get 011.11, which is 3.75 in decimal. When it is stored in a computer, the computer does not have a decimal point and will only store it as 01111. If it is a fixed-point format, the position of the decimal point is fixed and cannot be changed. Since the bit width must be aligned during calculation, the range of data that the computer can process is limited, and it cannot be too large or too small. Assume that a byte is used to represent a decimal, the decimal point is fixed at the position of 5.3, the upper 5 bits represent the integer, and the lower 3 bits represent the decimal: 11001_001 —— 11001.001, convert: the integer part 11001 = 25, the decimal part 001 = 1 (numerator part) The denominator is 1000 (8), so the decimal part is 1/8 (the binary system only has 0 and 1). The final decimal representation is 25 + 1/8, that is, there are 0/8, 1/8, 2/8, .... 7/8, a total of eight levels, indicating an accuracy of 1/8, so fixed-point decimals have the problem of numerical range and numerical accuracy! Numerical range and accuracy are a pair of contradictions. If a variable wants to be able to represent a relatively large numerical range, it must sacrifice accuracy; and if you want to improve accuracy, the representation range of the number will be reduced accordingly. In actual fixed-point algorithms, in order to achieve the best performance, this point must be fully considered, that is, the dynamic range and accuracy must be weighed.
This is fine for logical operations or simple integer operations, but if you want to display larger variables, the decimal point must be able to move freely, which is the floating point.
Image source: Internet
For example, 123.45 can be expressed as 1.2345x102 in decimal scientific notation, where 1.2345 is the mantissa, 10 is the base, and 2 is the exponent. Floating-point numbers use the exponent to achieve the effect of floating decimal points, so that a wider range of real numbers can be flexibly expressed. IEEE754 specifies two basic floating-point formats: single precision and double precision. The single-precision format has 24 bits of significant digits (i.e., mantissa) precision, occupying a total of 32 bits; the double-precision format has 53 bits of significant digits (i.e., mantissa) precision, occupying a total of 64 bits.
Two extended floating-point formats: single-extended and double-extended. The standard does not specify the exact precision and size of these formats, but does specify minimum precision and size, such as the IEEE double-extended format must have at least 64 bits of significant digit precision and occupy at least 79 bits total.
Floating point operations are different from fixed point operations (integer operations). They usually have 6 steps: 1. Anomaly detection: mainly detect NAN (non-number) 2. Alignment and right shift of mantissa: mantissas with different exponents cannot be directly added or subtracted, so alignment is required. For example, the mantissas of 1.1 * 2 E 1 + 1.1 * 2 E 2 cannot be calculated. Alignment and right shift of mantissas will make the exponents consistent. 3. Sum and subtract mantissas. The mantissas after alignment are calculated according to the fixed point addition and subtraction rules. 4. Normalization: convert non-standard decimals to normalized decimals. 5. In the process of alignment and right shift, the low bits of the mantissa may be lost, causing errors and affecting the accuracy. For this reason, rounding can be used to improve the accuracy of the mantissa. The IEEE754 standard lists four optional rounding methods: round up, round down, round to the nearest, and truncate. 6. Overflow check. Different from fixed-point operations, the overflow of floating-point numbers is determined by whether the value of the exponent of the result of the operation produces an overflow. If the value of the exponent exceeds the maximum positive number that the exponent can represent, it is an overflow. Furthermore, if the floating-point number is positive at this time, it is a positive overflow. If the floating-point number is negative, it is a negative overflow. Furthermore, if the floating-point number is positive at this time, it is a positive underflow. If the floating-point number is negative, it is a negative underflow. Both positive underflow and negative underflow are treated as machine zero, that is, each digit of the mantissa is forced to be zero.
Previous article:Ouster launches Chronos, an advanced digital lidar chip for automobiles
Next article:ON Semiconductor's Silicon Carbide and Sensing Business
- Popular Resources
- Popular amplifiers
- A new chapter in Great Wall Motors R&D: solid-state battery technology leads the future
- Naxin Micro provides full-scenario GaN driver IC solutions
- Interpreting Huawei’s new solid-state battery patent, will it challenge CATL in 2030?
- Are pure electric/plug-in hybrid vehicles going crazy? A Chinese company has launched the world's first -40℃ dischargeable hybrid battery that is not afraid of cold
- How much do you know about intelligent driving domain control: low-end and mid-end models are accelerating their introduction, with integrated driving and parking solutions accounting for the majority
- Foresight Launches Six Advanced Stereo Sensor Suite to Revolutionize Industrial and Automotive 3D Perception
- OPTIMA launches new ORANGETOP QH6 lithium battery to adapt to extreme temperature conditions
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions
- TDK launches second generation 6-axis IMU for automotive safety applications
- LED chemical incompatibility test to see which chemicals LEDs can be used with
- Application of ARM9 hardware coprocessor on WinCE embedded motherboard
- What are the key points for selecting rotor flowmeter?
- LM317 high power charger circuit
- A brief analysis of Embest's application and development of embedded medical devices
- Single-phase RC protection circuit
- stm32 PVD programmable voltage monitor
- Introduction and measurement of edge trigger and level trigger of 51 single chip microcomputer
- Improved design of Linux system software shell protection technology
- What to do if the ABB robot protection device stops
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Download from the Internet--ARM Getting Started Notes
- Learn ARM development(22)
- Learn ARM development(21)
- Learn ARM development(20)
- Learn ARM development(19)
- Learn ARM development(14)
- Learn ARM development(15)
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- [Synopsys IP Resources] Integrated MAC, PCS, and PHY IP for 400G/800G Ethernet
- 【ESP32-C3-DevKitM-1】+ Build the software environment for ESP32-C3 development based on the user manual
- DC/DC Converter Data Sheet - System Efficiency
- Bluetooth performance indicators, how to judge whether it is
- Problems encountered when LCD1602 scroll screen display
- Do you understand the applications of 0 ohm resistors, magnetic beads, and inductors?
- [Evaluation of EC-01F-Kit, the EC-01F NB-IoT development board] Part 2: Manual connection and data publishing of MQTT
- Teach you step by step how to design CPLD/FPGA and MCU together
- In three minutes, you can understand the concept and classification of power amplifiers!
- Gate circuits: AND gate, OR gate, NOT gate circuits and examples