Article count:1510 Read by:4577717

Featured Content
Account Entry

Autonomous driving chip analysis

Latest update time:2021-02-07
    Reads:


In the past few years, major car manufacturers and autonomous driving operators seem to have always regarded 2020 as a watershed in the development of autonomous vehicles. The mass production of L3 autonomous vehicles was once regarded as the goal for 2020.


Although reality is always some distance from ideal, in 2020, self-driving cars active in PPT will undoubtedly be one step closer to people.


Therefore, today, the key to solving the problem of autonomous driving lies in single-point technology. Only when single-point technology reaches its ultimate level and surpasses humans can this technology be available. For example, the recognition of lane lines and traffic signs, etc., requires powerful computing power to support . Hardware and software algorithms have always been the body and soul, inseparable. The market has put forward new requirements for the computing power and performance of autonomous driving chips. Autonomous driving chips have become a new point of contention.


Uncovering the mystery of autonomous driving chips and understanding their functions and computing processing capabilities may be what every car person needs at work. Hardworking car people need to constantly learn new knowledge across fields.


This article is a summary of part of the analysis chip. It mainly explains the background of high computing power chips, chip computing power units, analysis of Tesla FSD chip computing power, analysis of why 30TOPS is used as the main indicator in Xavier, and analysis of Infineon TriCore™. Computing power...


1

Behind the high computing power chip: the development of smart car E/E architecture


To quote a sentence that everyone is familiar with, the current E/E architecture design faces four major challenges: functional safety, real-time performance, bandwidth bottlenecks, and computing power black holes .


The specific explanation is: meeting functional safety level requirements as functional complexity continues to increase, including ISO26262, SOTIF and RSS; meeting real-time guarantees under complex architecture and functional frameworks; exponential growth of sensor data and explosive The bandwidth bottleneck caused by the Internet data; the computing power black hole required to meet the continuous upgrade of software.


Therefore, the smart car E/E architecture is moving from distributed to centralized, and its ultimate form is a supercomputer.


Bosch's progressive route is a typical path for the current development of E/E architecture. As can be seen from the figure, the overall development trend is computing centralization .


Along with the emergence of computing centralization, there is a new concept. As can be seen in the figure, the next stage of domain fusion is " on-board computers and area-oriented structures ." The key to the area-oriented structure is to cooperate with the on-board computer to complete the connection summary of actuators, sensors, diagnosis and traditional I/O, and successfully complete the advanced decision-making function, which is similar to the north and south bridges in the PC.


In this case, use a military analogy. The domain concept is like dividing the navy, land and air forces (body domain, chassis domain, entertainment domain, security domain) according to their functions, and have independent combat rights. Then the concept of vehicle-mounted computers and regional guidance structures is organized and divided according to theaters, and forms the concept of joint combat command + theater with the central computer. In this way, the central computer carries out overall planning and makes major decisions, which significantly increases the computing power requirements of the controller .


On the other hand , in the future, the car delivered by OEM will not be a product with fixed functions, but a robot that continues to evolve. During the entire life cycle of the car, the hardware platform needs to continue to support software iterative upgrades, which means that an open system must be built. A computing platform with a complete tool chain and strong computing power guarantee, providing computing power of up to 1000 TOPS, providing sufficient computing power reserves for various software functions .


The development of smart car E/E architecture will inevitably lead to the demand for high computing power chips. We have always emphasized that software defines cars. In fact, AI chips are not defined by software . In essence, chips and architecture are means and carriers, and software is the purpose and soul. Working together with software and hardware can achieve a high degree of unity between means and ends .


Only when the hardware bends down to adapt to the software can the performance of the transistors be greatly increased. Innovation in processor architecture is a very high barrier and requires a deep understanding of software . Such an overall solution determines the efficiency and quality of converting data into decisions/services. It is the hard technology that the times really call for, meeting the automotive requirements for high computing power and low power consumption of chips.



2

How to calculate chip computing power?

TOPS vs FLOPS vs MACS vs DMIPIS


Let’s first learn the basic concepts of computing power units:


OPS (Operations Per Second): The number of operations completed per second. Multiplication operations count as one OP, and addition operations count as one OP. 1TOPS means 1 trillion operations per second . OPS is mainly the computing power unit of deep learning.


MACS : Indicates the number of fixed-point multiply-accumulate operations that can be performed per second, and is used to measure the fixed-point data computing and processing capabilities of the autonomous driving computing platform. 1GMACS is equivalent to 1 billion fixed-point multiply-accumulate operations per second. Ops/s (the number of operations completed per second) refers to how many MACs can be completed per second (each multiplication and accumulation are each considered to be 1 operation, so the MAC is actually 2 OPs ), that is, 1 MAC =2 OPS .


FLOPS (Floating-Point Operations Per Second): The abbreviation for the number of floating-point operations that can be performed per second . It is used to measure the computer's floating-point operation processing capabilities. Floating point operations include all operations involving decimals. Floating-point operations are more complex, more precise, and more time-consuming than integer operations.


DMIPIS :是测量处理器运算能力的最常见基准程序之一,常用于处理器的整型运算性能的测量。 MIPS:每秒执行百万条指令 ,用来计算同一秒内系统的处理能力,即每秒执行了多少百万条指令。


3

特斯拉 FSD 芯片

144 TOPS 如何计算的


特斯拉的Full Self-Driving (FSD)


特斯拉在抛弃Mobileye和NVIDIA之后,开始自研AI芯片,于2019年发布了首款自动驾驶芯片(FSD)。FSD 芯片采用了 14 nm FinFET CMOS 工艺制造,尺寸为 260 mm,具有 60 亿个晶体管和 2.5 亿个逻辑门。这款SoC芯片的基本组成部分,包括CPU(12核A72,主频为2.2GHz),GPU,各种接口,片上网络。芯片中最重要的部分是自研的 Neural Network Processor(NNP),支持 32 位和 64 位浮点运算的图形芯片


每颗芯片有两个NNP,每个NNP有一个96x96个MAC的矩阵,32MB SRAM,主频是2GHz。所以一个NNP的处理能力是

96x96x2(OPs)x2(GHz)=36.864TOPS ,单芯片的计算力是72TOPS,板卡144TOPS


4

NVIDIA GPU芯片

TFLOPS 如何计算


NVIDIA P100性能参数


GPU的浮点计算理论峰值能力测试跟CPU的计算方式基本一样:


理论峰值 = GPU芯片数量*GPU Boost主频*核心数量*单个时钟周期内能处理的浮点计算次数。


对于浮点计算来说,CPU可以同时支持不同精度的浮点运算,但在GPU里针对单精度和双精度就需要各自独立的计算单元。一般在GPU里支持 单精度运算 的Single Precision ALU称之为 FP32 core 或简称core,而把用作 双精度运算 的Double Precision ALU称之为DP unit或者 FP64 core ,在Nvidia不同架构不同型号的GPU之间,这两者数量的比例差异很大。


在第五代的GPU Pascal架构里,FP64 core:FP32 core=1:2。


所以在P100中:

双精度理论峰值 = FP64 Cores * GPU Boost Clock * 2 = 1792 *1.48GHz*2 = 5.3 TFlops

单精度理论峰值 = FP32 cores * GPU Boost Clock * 2 = 3584 * 1.48GHz * 2 =  10.6 TFlops

因为P100还支持在一个FP32里同时进行2次FP16的半精度浮点计算,所以对于半精度的理论峰值更是单精度浮点数计算能力的两倍也就是达到 21.2TFlops


5

解析NVIDIA Xavier

最负盛名的自动驾驶控制器



Xavier芯片主要由NVIDIA自研的Carmel架构 8核64位CPU Volta架构512 CU DA处理器GPU 这两大模块组成,这两部分电路占据了芯片的大部分空间。


8个CPU核心被平均分配为4个集群,每个集群都有一个独立的时钟平面,并在2个CPU核心之间共享2MB L2缓存,在其之上,4个集群共享4MB L3缓存。Carmel架构是之前Denver架构的继任者,其设计特点是强大的动态代码优化能力。NVIDIA对外表示Carmel是一个10宽度的超标量架构(10个执行端口,非10宽度解码),并且支持ARMv8.2+RAS指令集。



Xavier的GPU源于Volta架构,内部结构被划分为4个TPC(纹理处理集群),每个TPC具有2个SM(流式多处理器), 每个SM集成64个CUDA核心(即流处理器),共计512个CUDA核心 ,其 单精度浮点运算性能为2.8TFLOPS,双精度为1.4TFLOPS 。此外Xavier还从Volta那里继承了Tensor Core,其8bit运算性能为22.6Tops,16bit运算性能为11.3TFLOPS。



总结: Xavier内有六种不同的处理器:ValtaTensorCore GPU,八核ARM64 CPU,双NVDLA深度学习加速器,图像处理器,视觉处理器和视频处理器。


各类处理器的性能参数:


深度学习加速器(DLA) :5 TOPS (FP16) | 10 TOPS (INT8)

VoltaGPU :512 CUDA cores | 20 TOPS(INT8) | 1.3 TFLOPS (FP32)

视觉处理器 :1.6 TOPS

立体声和光流引擎 (SOFE):6 TOPS

图像信号处理器(ISP) :1.5 Giga Pixels/s

视频编码器 :1.2 GPix/s

视频解码器 :1.8 GPix/s


说到这里,可能会有疑问, 为什么没有30TOPS出现呐?


其实30TOPS仅仅是Xavier中GPU在深度学习的计算力。


那就又存在疑问了,那经常各大供应商在宣传自己的自动驾驶控制器或者 在选择Xavier时,会将30TOPS作为主要的参考指标呐?


其实也没错,因为在自动驾驶的算法中,吃算力的感知算法的确离不开机器学习和深度学习。如果能满足感知算法的算力要求,满足自动驾驶的传感器处理、测距、定位和绘图、视觉和激光感知等,那么 从算力角度是可以当成自动驾驶芯片进行使用和开发


6

处理器为什么起的名字都不一样


刚第一次看到上面Xavier中,存在各类处理器,有点小晕。这边汽车人参照网上资料,浅显解释一下,就可以很好理解。


其实这些各类处理器就是 专用处理器 。专用处理器就是针对特定应用或者领域的处理器,类似于是我们经常说的Domain Specific Architecture的概念。


最为通用的处理器当然是CPU(比如intel的桌面CPU,ARM的嵌入式CPU),可以运行任何程序,处理各种数据。但问题是 CPU对某些应用效率太低 (处理能力不够,无法实时处理,或者是能耗太大)。比如, 处理图像不行,于是出现了GPU 信号处理不行,于是出现了DSP 。GPU可以做图像处理,也可以做DNN的training和inference,但是在处理某些DNN应用的时候效率不高,于是有了 专用针对这些应用处理器,也就是像上面说的DLA处理器


7

英飞凌 TriCore™ 算力分析


英飞凌 TriCore™在业界也是屡获殊荣,基于统一的RISC/MCU/DSP处理器内核,拥有强大的计算能力。


TriCore™特性:

位和位段的寻址和操作、 快速上下文切换(4个周期)和低中断延迟16位和32位指令、 双16位乘法器累加器, 每个时钟的两个16x16 MAC的持续吞吐量


那么以275为例,主频为200MHz,包含3个TriCore™核心CPU。

则英飞凌中AURIX家族的TC 275 峰值处理能力

=16x16x2(OPs)x2x200(Mhz)x3=614400 OPS=0.61TOPS。


8

智能汽车AI芯片大集锦


公司

名称


产品参数

应用

百度

云端全功能AI芯片

内存带宽:512 GBps;

算力:峰值260 Tops算力;

功耗:150w;

计算速度:推理速度比传统 GPU/FPGA 加速模型快 3 倍。

支持包括大规模人工智能计算在内的多种功能,例如搜索排序、语音识别、图像处理、自然语言处理、

自动驾驶和 PaddlePaddle等深度学习平台。

线



架构:自研BPU

算力:4TOPS

功耗:2W

自动驾驶中对车辆、行人和道路环境等目标的感知,类似MobileyeQ系列芯片;


Matrix2平台,基于Journey征程2芯片,算力达到16Tops




面向智能摄像头


310

算力:16 TOPS;

功耗:8W;

能效:

2 TOPS/W


集成了FPGA和ASIC两款芯片的优点,包括ASIC的低功耗以及FPGA的可编程、灵活性高等特点。

MDC300: 由华为昇腾Ascend310芯片、华为鲲鹏芯片、Infineon的TC397组成; 算力为64Tops


MDC600 :基于8颗昇腾310 AI芯片,同时还整合了CPU和相应的ISP模块, 算力高达352 TOPS。

cold

Wu

discipline

Cam

bricon-1M

int 8 (8-bit operation) performance ratio: 5Tops/W;


It provides three sizes of processor cores: 2Tops, 4Tops, and 8Tops.

Supports the acceleration of multiple deep learning models and machine learning algorithms such as CNN, RNN, SVM, k-NN, etc., and can complete tasks such as vision, speech, and natural language processing.

Cloud

intelligent

chip

Cam

bricon MLU

100

Balanced mode (main frequency 1Ghz): 128 trillion fixed-point operations; power consumption 80w.


High-performance mode (main frequency 1.3GHz): 166.4 trillion fixed-point operations, power consumption 110w.


black

Zhi

numb

Huashan

No. 2

A1000

8 CPU cores;

NN computing power: 40

~70TOPS,

Power consumption: 8-10W


Suitable for low-level ADAS assisted driving; a single A1000 chip is suitable for L2+ autonomous driving; a domain controller formed by interconnection of dual A1000 chips can support L3-level autonomous driving; the stacking of four A1000 chips can be used for future L4-level autonomous driving.


A1000L is suitable for ADAS, with a computing power of 16TOPS and a power consumption of 5W;

A1000 is suitable for L2+, with a computing power of 70TOPS and a power consumption of 10W;

A1000*2 is suitable for L3, with a computing power of 140TOPS and a power consumption of 25w;

A1000*4 is suitable for l3/L4, with a computing power of 280TOPS and a power consumption of 60W.

Xilinx competition

spirit

think

MP

SoC

series

Dual/quad-core ARM Cortex A53

(up to 1.5Ghz)

Rates up to

Quad-core ARM Cortex-R5 MPCore at 600Mhz

Frequency up to

667Mhz GPU ARM, support

H.264-H.265 video codecs

It is widely used by 29 car brands including Daimler-Benz and top parts suppliers such as Aptiv, Autoliv, Bosch and Continental.

special

Sri Lanka

pull

FSD

Equipped with two Neural Network Processors (NNP)

Computing power: 144 TOPS;

Power consumption: 72W;

Energy efficiency ratio:

2TOPS/W


NVI

DIA

Xavier

8-core ARM64 architecture;

The GPU uses 512 CUDA Volta;

Support FP32/

FP16/INT8;

Single-precision floating point performance of 1.3TFLOPS under 20W power consumption;

Tensor core performance is 20TOPs, and can reach 30TOPS after unlocking to 30W .


Orin

17 billion transistors;

Equipped with NVDIA's next-generation GPU (that is, a GPU based on the Ampere architecture) and Arm Hercules CPU core;

It can provide 200TOPS computing power, which is 7 times that of the previous generation Xavier SOC;

Power consumption 45W ;

Delivery in 2022 .


Mobil

eye

EyeQ series

The highest computing power of EyeQ4 is 2.5 TOPS;

Power consumption: 3W;

efficiency:

0.83 TOPS/W


EyeQ5

Computing power: 24TOPS ;

Power consumption: 10W ; chip energy efficiency is 2.4 times that of Xavier. The EyeQ5 chip will be equipped with 8 multi-threaded CPU cores and 18 Mobileye's next-generation vision processors.

Full vision solution



9

write at the end


After that, I may write a series of articles to deepen my knowledge consolidation, and I hope to explore it with everyone.


If you find it useful, please pay attention to all the big guys passing by and watch. It is not easy to meet in the vast sea of ​​people~



END

Technical exchange group: YasmineMiao (WeChat)

Submission cooperation: 18918250345 (WeChat)

 
EEWorld WeChat Subscription

 
EEWorld WeChat Service Number

 
AutoDevelopers

About Us Customer Service Contact Information Datasheet Sitemap LatestNews

Room 1530, Zhongguancun MOOC Times Building,Block B, 18 Zhongguancun Street, Haidian District,Beijing, China Tel:(010)82350740 Postcode:100190

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号