Designing an autonomous driving system chip challenges Mobileye and Nvidia-EEWORLD

Collect

Now let's go back to the autonomous driving chip. To cope with the increasing resolution of LiDAR, the CPU computing power must be enhanced. The bare crystal area of the CPU is generally much smaller than that of the GPU, which means that the CPU is more cost-effective. Considering that the chip design cycle and application mass production cycle are 2-3 years, the performance must be as powerful as possible. At the same time, in order to maintain sufficient openness, the ARM architecture must be used.

Image source: Internet

ARM launched the Neoverse platform for the server HPC market in March 2019. According to the plan, Ares, the god of war in Greek mythology, will be launched in 2020, Zeus, the god of Zeus in Greek mythology, and Poseidon in 2022. V1 is ARM's most powerful CPU architecture.

Image source: Internet

Considering TSMC's tight production capacity and high prices, even well-known customers like Tesla were rejected by TSMC because Tesla's volume was too low and TSMC's resources were given to large customers, leaving small customers with no time to spare. From the perspective of cost-effectiveness and supply chain, Samsung is the only choice. Samsung has sufficient production capacity and its price is much lower than TSMC. The disadvantage is that the 5nm process is not mature enough, so it can only choose the 7nm V1.

Why is V1 the strongest?

The most critical parameter affecting CPU computing power is Decode Wide, which can be simply equated to the number of instructions per cycle, or IPC, that is, how many instructions are completed per cycle.

Image source: Internet

It is very difficult to increase the decoding width. You can't just do it as you want. Simply put, for every additional bit of width, the system complexity will increase by about 15%, and the die area, that is, the cost, will increase by about 15-20%. If the decoding width is simply increased, the cost will also increase, and the manufacturer will lack the motivation to update. Therefore, ARM's approach is to cooperate with TSMC and Samsung's advanced processes, and use the increase in transistor density to reduce the die area and reduce costs. Therefore, every upgrade of ARM's decoding width requires the cooperation of advanced manufacturing processes, otherwise the cost will increase a lot. At the same time, ARM also considers it from a commercial perspective. A small upgrade every year will have room for improvement every year. 8-bit width is the current limit. Apple uses 8-bit width all at once. The disadvantage is that it must use TSMC's most advanced manufacturing process.

In addition, there are differences between RISC and CISC. It is more difficult to increase the width of CISC, but the 1-bit width of CISC can basically replace the 1.2-1.5-bit width of RISC. Intel has the strength to suppress Apple, but its manufacturing process is not as good as TSMC. The length of CISC instructions is not fixed, while that of RISC is fixed. Because the length is fixed, it can be divided into 8 parallel instructions and enter 8 decoders, but CISC cannot, because it does not know the length of the instruction. Therefore, the branch predictor of CISC is much more complicated than that of RISC. Of course, RISC also has instructions with variable length. When encountering some long instructions, CISC can complete them at one time. Because the length of RISC is fixed, it is like a bus stop. It must stop at a certain station, which is definitely not as fast as CISC. In other words, RISC must be optimized with the instruction set and operating system. RISC is software-centric and hardware made for certain specific software, while CISC is the opposite. It is hardware-centric and developed for all types of software.

The width of V1 is variable, up to 8 bits

Image source: Internet

ARM Neoverse V1 microarchitecture

Image source: Internet

In addition to the decoding width, the back-end distribution and emission (Dispatch and Issue) width and ALU must also keep up, otherwise the front end is very busy and the back end is very idle, which is useless. The usual way to improve is to increase the emission width. In terms of distribution, ARM Cortex-A77 is 10 bits, V1 is 10 bits, Apple M1's big core Firestorm is said to be 13 bits, and Samsung Mongoose M4/M5 is 12 bits. However, V1 has more emission units, 15 bits. There are also more ALUs in the back end, 8 of which are integers and 4 floating points or SIMD.

Image source: Internet

V1 is developed for HPC, and it is recommended to have more than 32 cores. However, this is definitely not possible for vehicle-mounted systems, as the power consumption is too high. A 12-core V1 is sufficient, and its computing power is basically equivalent to that of a 16-core ARM Cortex-A78AE, or even higher. Floating-point operations are also not weak.

Image source: Internet

Neoverse V1 also has strong machine learning and floating-point computing performance.

Image source: Internet

In terms of machine learning performance, V1 is 4 times that of N1, and its floating-point computing performance is 2 times that of N1. A typical N1 is Huawei's Kunpeng series server CPU, which Huawei calls the Taishan V110 core.

ARM recently introduced the second generation of SVE, and the NEON instruction set is a standard implementation of the Single Instruction Multiple Data (SIMD) stream for the ARM64 architecture. SVE (Scalable Vector Extension) is a new set of vector instruction sets developed for high-performance computing (HPC) and machine learning. It is the next generation of SIMD instruction set implementation, not a simple extension of the NEON instruction set.

There are many concepts in the SVE instruction set that are similar to the NEON instruction set, such as vectors, channels, data elements, etc. The SVE instruction set also proposes a new concept: the variable vector length programming model (Vector Length Agnostic, VLA).

Traditional SIMD instruction sets use fixed-size vector registers. For example, the NEON instruction set uses fixed 128-bit vector registers. The SVE instruction set that supports the VLA programming model supports variable-length vector registers. This allows chip designers to choose an appropriate vector length based on load and cost. The length of the vector registers of the SVE instruction set supports a minimum of 128 bits and a maximum of 2048 bits, in increments of 128 bits. The SVE design ensures that the same application can run on SVE instruction machines that support different vector lengths without recompiling the code, which is the essence of the VLA programming model.

The SVE instruction set is a new set of instruction sets based on the A64 instruction set, and SVE2 is released on the ARMv9 architecture. It is a superset and expansion of the SVE instruction set. The SVE instruction set contains hundreds of instructions, which can be divided into the following categories: load and store instructions and prefetch instructions, vector move instructions, integer operation instructions, bit operation instructions, floating-point operation instructions, prediction operation instructions, and data element operation instructions.

Simply put, for 8-bit precision, if SVE uses 2048-bit instructions in an ideal state, it is equivalent to 256 cores calculating in parallel at the same time, and if it is 16-bit precision, it is 128 cores. If the processor has 12 cores, then when doing deep learning, it can be approximated to 256*12=3072 cores running, which is almost the same as GPU. Of course, the compiler is powerless to automatically vectorize such wide data operations, and it is difficult for developers to assemble manually. And it requires sufficient cache and registers to cooperate, and the cost will skyrocket. However, 256 or 512 bits is not a big problem, as is Intel's AVX512 instruction. General reasoning uses 8-bit integer precision, and 512 bits is approximately 64 cores.

Let's do a simple calculation. Assuming it is 512-bit wide and has 12 cores, then there are 64*12=768 cores, running at 2GHz. Under ideal conditions, the computing power is 768*2=1526GOPs.

After choosing the CPU, the next step is to choose the GPU. The GPU is actually a floating-point arithmetic unit for parallel data, that is, for computer vision. The extensive use of LiDAR helps reduce the reliance on vision, and computer vision consumes the most floating-point computing power. The extensive use of LiDAR is a fact that can be determined in the future. In addition, GPU is the most expensive and it is difficult to outperform NVIDIA, so the price does not need to be too high.

Choose ARM's MALI G710, which can achieve 1174GFLOPS of computing power at 650MHz with 16 cores.

G710 may be ARM's most successful GPU architecture. Compared with Nvidia's desktop GPU, ARM's architecture is quite different. ARM uses a large core design, generally written as MALI G710 MPX or MCX, X represents the number of cores. The core of MALI is the rendering core, namely the Shader Core, which can be similar to Nvidia's SM streaming multiprocessor. The rendering core has an execution engine, which can be regarded as the ALU in the CPU field.

MALI G710 rendering core

Image source: Internet

ARM was originally designed with SIMD, but has recently become SIMT, which is commonly used in GPUs. The execution engine of G710 is double that of G77, with two execution engines, each containing two clusters, executing 16-bit wide threads, which is equivalent to 64 ALUs. G710 supports 7-16 core designs, which means a maximum of 1024 ALUs.

MALIG710 Execution Engine

Image source: Internet

The execution core of G710 has not disclosed specific information at the front end. It should be the same as G77, with 64 warps or 1024 threads. Each processing unit has three ALUs: FMA (mixed multiply-accumulate calculation) and CVT (Convert) units are 16-wide, and SFU (special function unit) is 4-wide. Each FMA can perform 16 operations per cycle, and the calculation data precision is FP32. If it is changed to FP16, it is 32 times, and the 8-bit integer INT8 is 64 times. For desktop GPUs like NVIDIA, FP16 and FP32 are calculated separately, that is, they can be calculated at the same time, but mobile-level MALI does not need such a design. The Convert unit handles basic integer operations and natural type conversion operations, and acts as a branch port.

[1] [2] [3] [4]

Reference address：Designing an autonomous driving system chip challenges Mobileye and Nvidia

Previous article：Zadar Labs Launches Software-Defined 4D Millimeter-Wave Imaging Radar with 0.4° Resolution
Next article：How good is the “first stock in automotive chips”?

Recommended ReadingLatest update time:2024-11-16 16:51

Mobileye releases new EyeQ™ 6L chip to accelerate global advanced driver assistance system upgrade

Mobileye announced today that it has delivered the first batch of mass-produced hardware and software for its latest EyeQ™6 Lite (EyeQ6L) system-on-chip to customers. EyeQ6L will enable the advanced driver assistance systems (ADAS) of many models launched this year. This milestone release also marks the official s

[Automotive Electronics]

Mobileye releases new EyeQ™ 6L chip to accelerate global advanced driver assistance system upgrade

Samsung reportedly wins Nvidia AI chip 2.5D packaging order

On April 8, according to South Korean electronics industry media TheElec, Samsung Electronics successfully won Nvidia's 2.5D packaging order. Sources revealed that Samsung's advanced packaging (AVP) team will provide Nvidia with Interposer (middle layer) and I-Cube, which is its self-developed 2.5D packaging technol

[Semiconductor design/manufacturing]

Neptune has landed! Nvidia releases autonomous driving chip Orin

With the continuous development of automobile intelligence and autonomous driving technology, more and more suppliers are beginning to want to take off the traditional hat of "parts manufacturers" and instead regard themselves as "technology providers". NVIDIA, which has been deeply involved in the graphics processing

[Automotive Electronics]

Neptune has landed! Nvidia releases autonomous driving chip Orin

NVIDIA Joins Metaverse Standards Forum to Help Lay Foundation for the Metaverse

The Metaverse is the next important stage in the evolution of the Internet into a 3D network, and will bring significant opportunities to various industries such as entertainment, automobiles, manufacturing, and robotics. That’s why NVIDIA and its partners have joined the Metaverse Standards Forum, an open platfo

[Network Communication]

NVIDIA Joins Metaverse Standards Forum to Help Lay Foundation for the Metaverse

Real Skynet: Nvidia launches first cross-camera car tracking dataset

Cities have great potential to use traffic cameras as citywide sensors to optimize traffic flow and manage traffic incidents, but existing technologies lack the ability to track vehicles over large areas, across multiple cameras, at different intersections, and in varying weather conditions. To overcome this challeng

[Automotive Electronics]

Real Skynet: Nvidia launches first cross-camera car tracking dataset

What can arm bring to NVIDIA?

Since Arm was put on the shelf by SoftBank, Apple and Samsung have become the protagonists of the rumors. After both companies expressed their interest, a new potential buyer, Nvidia, surfaced. Moreover, as the reports deepened, the biggest marriage in the semiconductor industry seemed to be close to taking shape.

[Semiconductor design/manufacturing]

NVIDIA Unveils New Ada Lovelace RTX GPUs for Designers and Creators

RTX 6000 GPUs to deliver 2-4x performance boost to enterprise workflows; global manufacturers to begin shipping in Q4 SANTA CLARA, Calif., Sept. 20, 2022 /PRNewswire/ -- NVIDIA today unveiled the NVIDIA RTX™ 6000 workstation GPU, based on the new NVIDIA Ada Lovelace architecture, usher

[Industrial Control]

NVIDIA Unveils New Ada Lovelace RTX GPUs for Designers and Creators

Mobileye releases fourth quarter and full year 2023 financial report and business overview

Mobileye releases fourth quarter and full year 2023 financial report and business overview • Fourth quarter revenue was $637 million, up 13% year over year, consistent with preliminary data provided on January 4, 2024. • Operating income and adjusted operating income improved significantly compared to the fourth q

[Industrial Control]

Popular Resources
Popular amplifiers