Now let's go back to the autonomous driving chip. To cope with the increasing resolution of LiDAR, the CPU computing power must be enhanced. The bare crystal area of the CPU is generally much smaller than that of the GPU, which means that the CPU is more cost-effective. Considering that the chip design cycle and application mass production cycle are 2-3 years, the performance must be as powerful as possible. At the same time, in order to maintain sufficient openness, the ARM architecture must be used.
Image source: Internet
ARM launched the Neoverse platform for the server HPC market in March 2019. According to the plan, Ares, the god of war in Greek mythology, will be launched in 2020, Zeus, the god of Zeus in Greek mythology, and Poseidon in 2022. V1 is ARM's most powerful CPU architecture.
Image source: Internet
Considering TSMC's tight production capacity and high prices, even well-known customers like Tesla were rejected by TSMC because Tesla's volume was too low and TSMC's resources were given to large customers, leaving small customers with no time to spare. From the perspective of cost-effectiveness and supply chain, Samsung is the only choice. Samsung has sufficient production capacity and its price is much lower than TSMC. The disadvantage is that the 5nm process is not mature enough, so it can only choose the 7nm V1.
Why is V1 the strongest?
The most critical parameter affecting CPU computing power is Decode Wide, which can be simply equated to the number of instructions per cycle, or IPC, that is, how many instructions are completed per cycle.
Image source: Internet
It is very difficult to increase the decoding width. You can't just do it as you want. Simply put, for every additional bit of width, the system complexity will increase by about 15%, and the die area, that is, the cost, will increase by about 15-20%. If the decoding width is simply increased, the cost will also increase, and the manufacturer will lack the motivation to update. Therefore, ARM's approach is to cooperate with TSMC and Samsung's advanced processes, and use the increase in transistor density to reduce the die area and reduce costs. Therefore, every upgrade of ARM's decoding width requires the cooperation of advanced manufacturing processes, otherwise the cost will increase a lot. At the same time, ARM also considers it from a commercial perspective. A small upgrade every year will have room for improvement every year. 8-bit width is the current limit. Apple uses 8-bit width all at once. The disadvantage is that it must use TSMC's most advanced manufacturing process.
In addition, there are differences between RISC and CISC. It is more difficult to increase the width of CISC, but the 1-bit width of CISC can basically replace the 1.2-1.5-bit width of RISC. Intel has the strength to suppress Apple, but its manufacturing process is not as good as TSMC. The length of CISC instructions is not fixed, while that of RISC is fixed. Because the length is fixed, it can be divided into 8 parallel instructions and enter 8 decoders, but CISC cannot, because it does not know the length of the instruction. Therefore, the branch predictor of CISC is much more complicated than that of RISC. Of course, RISC also has instructions with variable length. When encountering some long instructions, CISC can complete them at one time. Because the length of RISC is fixed, it is like a bus stop. It must stop at a certain station, which is definitely not as fast as CISC. In other words, RISC must be optimized with the instruction set and operating system. RISC is software-centric and hardware made for certain specific software, while CISC is the opposite. It is hardware-centric and developed for all types of software.
The width of V1 is variable, up to 8 bits
Image source: Internet
ARM Neoverse V1 microarchitecture
Image source: Internet
In addition to the decoding width, the back-end distribution and emission (Dispatch and Issue) width and ALU must also keep up, otherwise the front end is very busy and the back end is very idle, which is useless. The usual way to improve is to increase the emission width. In terms of distribution, ARM Cortex-A77 is 10 bits, V1 is 10 bits, Apple M1's big core Firestorm is said to be 13 bits, and Samsung Mongoose M4/M5 is 12 bits. However, V1 has more emission units, 15 bits. There are also more ALUs in the back end, 8 of which are integers and 4 floating points or SIMD.
Image source: Internet
V1 is developed for HPC, and it is recommended to have more than 32 cores. However, this is definitely not possible for vehicle-mounted systems, as the power consumption is too high. A 12-core V1 is sufficient, and its computing power is basically equivalent to that of a 16-core ARM Cortex-A78AE, or even higher. Floating-point operations are also not weak.
Image source: Internet
Neoverse V1 also has strong machine learning and floating-point computing performance.
Image source: Internet
In terms of machine learning performance, V1 is 4 times that of N1, and its floating-point computing performance is 2 times that of N1. A typical N1 is Huawei's Kunpeng series server CPU, which Huawei calls the Taishan V110 core.
ARM recently introduced the second generation of SVE, and the NEON instruction set is a standard implementation of the Single Instruction Multiple Data (SIMD) stream for the ARM64 architecture. SVE (Scalable Vector Extension) is a new set of vector instruction sets developed for high-performance computing (HPC) and machine learning. It is the next generation of SIMD instruction set implementation, not a simple extension of the NEON instruction set.
There are many concepts in the SVE instruction set that are similar to the NEON instruction set, such as vectors, channels, data elements, etc. The SVE instruction set also proposes a new concept: the variable vector length programming model (Vector Length Agnostic, VLA).
Traditional SIMD instruction sets use fixed-size vector registers. For example, the NEON instruction set uses fixed 128-bit vector registers. The SVE instruction set that supports the VLA programming model supports variable-length vector registers. This allows chip designers to choose an appropriate vector length based on load and cost. The length of the vector registers of the SVE instruction set supports a minimum of 128 bits and a maximum of 2048 bits, in increments of 128 bits. The SVE design ensures that the same application can run on SVE instruction machines that support different vector lengths without recompiling the code, which is the essence of the VLA programming model.
The SVE instruction set is a new set of instruction sets based on the A64 instruction set, and SVE2 is released on the ARMv9 architecture. It is a superset and expansion of the SVE instruction set. The SVE instruction set contains hundreds of instructions, which can be divided into the following categories: load and store instructions and prefetch instructions, vector move instructions, integer operation instructions, bit operation instructions, floating-point operation instructions, prediction operation instructions, and data element operation instructions.
Simply put, for 8-bit precision, if SVE uses 2048-bit instructions in an ideal state, it is equivalent to 256 cores calculating in parallel at the same time, and if it is 16-bit precision, it is 128 cores. If the processor has 12 cores, then when doing deep learning, it can be approximated to 256*12=3072 cores running, which is almost the same as GPU. Of course, the compiler is powerless to automatically vectorize such wide data operations, and it is difficult for developers to assemble manually. And it requires sufficient cache and registers to cooperate, and the cost will skyrocket. However, 256 or 512 bits is not a big problem, as is Intel's AVX512 instruction. General reasoning uses 8-bit integer precision, and 512 bits is approximately 64 cores.
Let's do a simple calculation. Assuming it is 512-bit wide and has 12 cores, then there are 64*12=768 cores, running at 2GHz. Under ideal conditions, the computing power is 768*2=1526GOPs.
After choosing the CPU, the next step is to choose the GPU. The GPU is actually a floating-point arithmetic unit for parallel data, that is, for computer vision. The extensive use of LiDAR helps reduce the reliance on vision, and computer vision consumes the most floating-point computing power. The extensive use of LiDAR is a fact that can be determined in the future. In addition, GPU is the most expensive and it is difficult to outperform NVIDIA, so the price does not need to be too high.
Choose ARM's MALI G710, which can achieve 1174GFLOPS of computing power at 650MHz with 16 cores.
G710 may be ARM's most successful GPU architecture. Compared with Nvidia's desktop GPU, ARM's architecture is quite different. ARM uses a large core design, generally written as MALI G710 MPX or MCX, X represents the number of cores. The core of MALI is the rendering core, namely the Shader Core, which can be similar to Nvidia's SM streaming multiprocessor. The rendering core has an execution engine, which can be regarded as the ALU in the CPU field.
MALI G710 rendering core
Image source: Internet
ARM was originally designed with SIMD, but has recently become SIMT, which is commonly used in GPUs. The execution engine of G710 is double that of G77, with two execution engines, each containing two clusters, executing 16-bit wide threads, which is equivalent to 64 ALUs. G710 supports 7-16 core designs, which means a maximum of 1024 ALUs.
MALIG710 Execution Engine
Image source: Internet
The execution core of G710 has not disclosed specific information at the front end. It should be the same as G77, with 64 warps or 1024 threads. Each processing unit has three ALUs: FMA (mixed multiply-accumulate calculation) and CVT (Convert) units are 16-wide, and SFU (special function unit) is 4-wide. Each FMA can perform 16 operations per cycle, and the calculation data precision is FP32. If it is changed to FP16, it is 32 times, and the 8-bit integer INT8 is 64 times. For desktop GPUs like NVIDIA, FP16 and FP32 are calculated separately, that is, they can be calculated at the same time, but mobile-level MALI does not need such a design. The Convert unit handles basic integer operations and natural type conversion operations, and acts as a branch port.
Previous article:Zadar Labs Launches Software-Defined 4D Millimeter-Wave Imaging Radar with 0.4° Resolution
Next article:How good is the “first stock in automotive chips”?
Recommended ReadingLatest update time:2024-11-16 16:51
- Popular Resources
- Popular amplifiers
- Virtualization Technology Practice Guide - High-efficiency and low-cost solutions for small and medium-sized enterprises (Wang Chunhai)
- Semantic Segmentation for Autonomous Driving: Model Evaluation, Dataset Generation, Viewpoint Comparison, and Real-time Performance
- Design and application of autonomous driving system (Yu Guizhen, Zhou Bin, Wang Yang, Zhou Yiwei)
- ASPEN: High-throughput LoRA fine-tuning of large language models using a single GPU
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- How to set up a digital oscilloscope to observe eye diagrams without eye diagram analysis software
- [Analog Electronics Course Selection Test] + Basic Knowledge of Operational Amplifiers
- High precision amplifier circuit
- Initialization of MSP430F5529 ADC
- How to choose the capacitor withstand voltage at the power supply end @ [Analog Electronics]
- SystemVerilog and Functional Verification (1)
- MCEWizard software usage for EVAL-M3-TS6-665PN development board
- Have you ever played with any interesting sensors?
- MEMS sensor with AI programmable core (ISPU - intelligent sensor processing unit)
- Application of Finite State Machine in Embedded Software