In-depth analysis of Huawei’s intelligent driving chip-EEWORLD

Collect

Huawei's smart car department Intelligent Automotive Solutions (IAS) consists of the Autonomous Driving Solution ( ADS ) department that provides application algorithms, the Mobile Data Center (MDC) that provides domain controllers , and the Integrated Sensing Division that provides sensor systems. Among them, ADS is responsible for algorithm research and is divided into many groups, which are very detailed, such as Obstacle Detection Team obstacle detection, Prediction and Decision prediction and decision-making; MDC is similar to Tier1, formerly the Central Computing Department, mainly for Huawei's ARM server business Hardware provided . The chips used in Huawei's smart driving are provided by HiSilicon , and Huawei's ARM server chips are also provided by HiSilicon. Smart driving and ARM server chips share most of the research and development results.

Huawei HiSilicon AI product line planning roadmap

Image source: https://ggim.un.org/meetings/2019/Deqing/documents/1-3%20Huawei%20slides.pdf

HiSilicon has four AI product lines planned, namely Kunpeng, Shengteng, Qilin and Honghu. Among them, the Kunpeng series is mainly for CPUs , Shengteng is for AI accelerators, Kirin is mainly for mobile phones, and Honghu is for TVs. Intelligent driving is an extension of Shengteng's product line. In addition, the Kirin 990A based on the Kirin 990 is Huawei’s car cockpit chip.

Huawei's smart driving chips mainly include Shengteng 310, Shengteng 610 and Shengteng 620. These three chips can also be cascaded to increase performance. https://www-file.huawei.com/-/media/corp2020/pdf/publications/huawei-research/2022/huawei-research-issue1-en.pdf, this document contains a detailed explanation of Huawei’s Ascend series chips , the main source of information for this article is this document.

Internal frame diagram of Shengteng 610

Image source: Huawei

Shengteng 910 internal frame diagram

Image source: Huawei

Huawei designs chips in the form of modules and tries to reuse research and development results. The CPU and AI cores of the Ascend series chips are basically the same, but the number of cores is different.

List of core features of Huawei Ascend

Image source: Huawei

Ascend core is the AI core, which is divided into original, Max, Mini, Lite and Tiny versions. Different cores and quantity configurations are used for different applications. For example, the Kirin 990 for the mobile phone field has two Lite cores and one Tiny core. The three add up to 6.88TOPS@INT8 computing power . The Ascend 310 has two Mini cores, the Ascend 610 has 10 original cores, and the Ascend 910 has 32 Max cores. Ascend 620 may have 10 Max cores. Each core is basically the same, mainly different in cache configuration and frequency configuration.

Different cores correspond to different algorithm networks

Image source: Huawei

Ascend Max core internal framework

Image source: Huawei

The picture above shows the internal framework of the Max core, which mainly includes three computing units: Scalar, Vector and Tensor. The scalar unit is responsible for task scheduling, the vector unit is responsible for the final activation stage of deep learning, and the tensor is responsible for convolution matrix multiplication.

Calculation modes of three computing units

Image source: Huawei

Scalar is basically similar to CPU, with the highest flexibility, but the lowest computing power for AI. 1D vector is similar to GPU, with intermediate flexibility and medium AI computing power . CUBE targets 2D matrices, which are tensors in a general sense.

If according to the strict mathematical definition, then the vector is a first-order tensor and the matrix is a second-order tensor. The CUBE core is basically the same as NVIDIA's so-called tensor core Tensor.

The tensor core architecture used by NVIDIA since its Turing architecture is basically the same as Huawei's CUBE, both of which are three-dimensional architectures.

Comparison of three computing cores

Image source: Huawei

A CUBE core has a computing power of 8TOPS@FP16. Note that FP16 is not the common INT8, which is generally INT8 in the automotive field. A CUBE contains 4096 FP16 MACs and 8192 INT8 MACs, and a MAC contains two Ops. Therefore, if the operating frequency is 1GHz, the FP16 computing power is 1G*2*4096=8T.

Similarly, Google's TPU V1 has 65,000 FP16 MACs and runs at 0.7GHz, so the computing power is 65,000*0.7G*2=91T. Tesla's first-generation FSD has two NPUs. Each NPU has 9216 INT8 MACs, the operating frequency is 2GHz, and the computing power is 2*2*2G*9216=73TOPS. The so-called computing power is basically a pile of MAC numbers. The more piles there are, the higher the computing power, the larger the area, and the higher the cost.

There is no need to take the figure of computing power seriously.

Comparison of AI computing power of several mobile phone chips

Source: Huawei

Qualcomm Snapdragon 865 has the highest nominal rating, with 8 TOPS, but its AI score is very low, far lower than the MediaTek Dimensity 1000 with 4.5 TOPS, and even lower than Huawei's Kirin 990. Obviously Qualcomm has a lot of water, and MediaTek is too honest. It is said to be at least 1TOPS lower than the actual value.

Huawei published a paper "Kunpeng 920: The First 7-nm Chiplet-Based 64-Core ARM SoC for Cloud Services" on IEEE in 2019. The link is https://ieeexplore.ieee.org/document/9444893. This is a must-have Papers that you pay to browse are not the kind of papers that ARXIV publishes as long as you submit them. IEEE papers are subject to strict review.

Huawei's paper mainly talks about LLC, which is the last level cache. In the design of Kunpeng 920, the global LLC of the SoC is sliced into each CPU Cluster, so that the LLC and the CPU Cluster form a NUMA relationship. Therefore, careful consideration needs to be given to choosing the appropriate size of each cluster to maximize its benefits. Taking multiple factors into consideration, 4 CPU cores per cluster were selected to obtain the best PPA score for the current process node.

LLC adopts private mode or shared mode: Private mode is usually used when each CPU core carries relatively independent task data; when tasks within the SoC share a large amount of data, shared mode is usually used.

In private mode, each CPU cluster and the corresponding LLC slice form a private group, which can avoid the cluster accessing high-latency cache slices.

In shared mode, all LLC slices are combined together to act as one block to improve the reuse of data within the SoC.

Looking at the CPU part, Ascend 610 has a 16-core CPU. According to convention, the CPU core here is probably the CPU core in Kunpeng, that is, "Kunpeng 920: The First 7-nm Chiplet-Based 64-Core ARM SoC for Cloud" TAISHAN V110 mentioned in "Services", as we all know, Taishan is also the name of Huawei's server product line. TAISHAN V110 is a magic modification of the ARM series, because the TAISHAN V120 core is a magic modification based on ARM Cortex-A76, https://www.huaweicentral.com/kirin-990a-huaweis-first-auto-chipset-installed-in-arcfox -alpha-s-smart-car/, it is mentioned here that the CPU of Kirin 990A is the lite version of TAISHAN V120, and https://www.hisilicon.com/en/products/Kirin/Kirin-flagship-chips/Kirin-990 - 5G , it is directly recognized that the CPU of Kirin 990 is ARM Cortex-A76, so TAISHAN

V110 is likely to be the ARM Cortex-A75 or A73 or the N1 of the ARM server series. There is a big gap between the ARM Cortex-A78AE used by NVIDIA's Orin, but Huawei has made up for this gap with quantity, and is basically on par with NVIDIA.

The NoC is a 2D 4*6 MESH grid with an inter-node operating frequency of 2GHz and a bandwidth of 1024 bits or 25 6G B/s. This was a relatively high-end configuration in 2019, but now it is 2023 and can only be a medium configuration. .

Comparison between Huawei and other smart driving chips

Image source: Huawei

Huawei finally made a comparison with other smart driving chips, from which we can also see that the die size of Ascend 610 is very large, with 401 square millimeters. According to TechanaLye's analysis, the die size of Nvidia Orin is 455 square millimeters, but Nvidia uses Samsung's 8-nanometer process. If it uses the same TSMC 7-nanometer process as Ascend, the area should be similar to Ascend 610. In other words, Ascend The hardware cost of the 610 is basically the same as that of the NVIDIA Orin. According to the power of Shengteng 610, water cooling is indispensable.

In fact, it is difficult to compare the computing power. NVIDIA's computing power is generally sparse, while Huawei's computing power is said to be dense, and the difference between the two is usually double. NVIDIA Orin has multiple versions. The top version has 275TOPS@sparse INT8. The computing power is actually two parts: one part is contributed by 2048 CUDA, the highest frequency is 1.3GHz, contributing 170TOPS@sparse INT8 computing power; the other part is 64 Zhang Quantitative core contribution, the highest frequency is 1.6GHz, contributing 105TOPS@INT8 sparse computing power. If it is FP32 dense format, the computing power is only 5.3TOPS (only CUDA can process FP32 data at this time), and it is difficult for CUDA core and tensor core to be used at the same time Achieve maximum performance. The tensor core mainly does matrix multiplication, and CUDA mainly does matrix and vector multiplication, and vector to vector multiplication. The CPU will arrange who will do the work based on different data and tasks.

In addition, there are three different definitions of sparse and dense. One is sparse calculation, and sparse refers to low calculation density. Google’s fourth-generation TPU has a sparse core specially designed for sparse calculation parts such as the embedded part of the transformer. The other is that the input data itself is a sparse matrix, and the other is a sparse model after pruning the dense weight model. A natural sparse matrix refers to a matrix in which the original data contains many zeros. The information matrix of lidar is a typical sparse matrix, and RGB cameras are generally dense matrices.