Waymo computing system conjecture-EEWORLD

Collect

Waymo's astonishing 29-camera sensor configuration has become a hot topic in the autonomous driving circle.

Recently, Waymo posted a video on Youtube specifically introducing "WaymoDriver" - the fifth-generation driverless car platform of the autonomous driving giant. The main speaker of the video is Waymo design director YooJung Ahn (An Yongjun), the female designer who created the "Firefly" autonomous driving car, who shared with us the basic design concept of the new platform. An Yongjun is a Korean born in Seoul. In fact, she is just an industrial designer of consumer products, not a technical person. Before joining Google, she worked in Motorola and LG as a mobile phone industrial designer. She has never had experience in car exterior design, and that is why she designed a car as amazing as Firefly. She mentioned 29 cameras and many incredible performances, such as the camera can recognize stop signs 500 meters away. As we all know, the effective distance is most closely related to the number of pixels. At present, the highest automotive image sensor is Sony's IMX324, with 7.42 million pixels, and it can only say that it can see traffic signs 160 meters away. Waymo has the ability to make its own cameras, but it has no ability to make its own image sensors. Waymo is either bragging or using a 20-megapixel or 30-megapixel mobile phone camera. This makes Waymo further away from mass production of vehicles.

I am afraid that there will be no automotive-grade chips that can handle such high data traffic within 5 years. Waymo or Google has been engaged in autonomous driving for nearly 11 years, and has burned through an estimated 20 billion US dollars. There is still no verified business model. Waymo's frequent appearances show that Waymo is anxious to commercialize, but it may be getting further and further away from commercialization. The more complex the system, the higher the probability of error. The result of evolution should be more and more simple. Early mobile phones and computers had at least a dozen large chips, but today there are basically only two or three. However, Waymo's sensors are becoming more and more complex.

Let’s get back to the topic.

Yole's report points out that the first generation of Waymo (the first generation of Waymo driverless cars should refer to the Chrysler Grand Voyager hybrid) used eight 5-megapixel cameras. So far, there are no automotive-grade 5-megapixel sensors, so it is certain that they are non-automotive-grade image sensors. The frame rate is only 21fps, but the bandwidth has reached 8.7Gbps.

The second-generation Waymo driverless car should refer to the Jaguar I-Pace.

According to Yole, the second-generation Waymo driverless car should use 14 5-megapixel cameras with an amazing bandwidth of 15.3Gbps. It is more reasonable for the second-generation Waymo driverless car to have 29 sensors (not cameras). Of course, it is also possible that Waymo has taken a unique approach and has an advantage, and really used 29 cameras. The data bandwidth of a 5-megapixel camera exceeds 1Gbps, and there are at least 8 of them, which requires very expensive Ethernet switch chips. Most automotive-grade Ethernet switch chips can only correspond to 1 or 2 2.5Gbps bandwidth inputs. Currently, the one with the highest input bandwidth is Broadcom's BCM53162, which can correspond to 4 2.5GbE channels and costs about $650 (Mouser price, starting at 100 pieces). 14 5-megapixel cameras require at least 3 chips.

The above picture is the BCM53162 application diagram

The above picture shows the internal framework of BCM53162

Broadcom does not seem to have put much effort into promoting BCM53162. In March 2019, it launched BCM8956X and BCM8988X, but the specific parameters have not been disclosed. If a 20-megapixel or 30-megapixel sensor is used, the bandwidth must be at least 10Gbps. Currently, there is no automotive-grade Ethernet switch that can correspond to such a high bandwidth. Waymo should not use multiple switch chips. The more common practice in the industry is to use FPGA as an Ethernet switch chip. FPGA can customize the interface, which is particularly suitable for unmanned vehicles, a field where mature ASICs have not yet appeared. The disadvantage is that it is expensive. The price of an FPGA that can support so many high-bandwidth inputs is estimated to be at least more than $2,000. In terms of visual computing, even if only 14 5-megapixel cameras are used, if all 14 cameras are used for deep learning calculations, the computing power needs to be at least 1000Tops. Tesla's FSD single-chip computing power is close to 37TOPS, which is already very high. It should be pointed out that the computing power competition must look at the precision, floating point, fixed point, integer, decimal, FP64 is double precision, FP32 is single precision, FP16 is half precision, and bfloat16 is between FP32 and FP16.

Usually they are all floating point operations, and there is also a lower INT8, which is integer 8-bit precision. FP64FP32FP16bfloat16 is mostly used for training, and INT8 is generally used for inference. Tesla's FSD is 36.864TOPS, which is based on INT8, while Google's TPU V3 is 420TOPS, which is based on bfloat16. If it is switched to INT8 precision, it can reach at least 600TOPS, but TPU is not designed for INT8, so there is no such parameter. Generally speaking, double precision to half precision is 4 times, but NVIDIA is special. It is possible to separate two multi-cores for different precisions. For example, NVIDIA's GK104, each GK104 GPU contains 1536 FP32 CUDA Cores and 64 FP64 Units, the theoretical peak value of single-precision floating point number = 2 GPU * 1536 FP32 Core * 2 * 745MHz = 4.58TFlops, the theoretical peak value of double-precision floating point number = 2 GPU * 64 FP64 core * 2 * 745MHz = 0.19TFlops. NVIDIA's Tesla T4 has 2560 CUDA cores corresponding to FP16 and 320 TURring Tensor cores corresponding to FP32. The FP32 computing power is 8.1TFLOPS (floating point), the FP16 precision is 65.13TFLOPS (floating point), and it can reach 130TOPS at INT8 precision and 260TOPS at INT4 precision.

So is it possible for Waymo to use its own unique weapon, TPU V3?

It should be pointed out that TPU V3 is not a chip, but a board composed of 4 chips. Google also has a larger scale of 1024 TPU V3s to form TPU V3 POD. TPU V3 uses liquid cooling, each chip has a computing power of 105TOPS, and TPU uses bfloat16 data. Because the bottleneck of matrix operations in deep learning is storage bandwidth, TPU V3 uses HBM memory at all costs. The memory bandwidth is 3516GB/s, which is more than 10 times the bandwidth of Tesla. FSD cannot reach the ideal state of 36.8TOPS, and sometimes it may drop by half. Such high-computing chips from Intel , Nvidia and Huawei all use expensive HBM memory at all costs, and AMD's civilian-grade products also have the luxury of using HBM.

The reason for doing so is that they know that the bottleneck is in storage rather than the computing unit itself. The power consumption of TPU V3 is unknown, and most people estimate it to be between 200-350 watts. To reach more than 1000TOPS, 3 TPU V3s are required, and the power consumption is as high as 1 kilowatt, which is obviously far from the automotive standard, and the cost will be very shocking. The price of a TPU V3 is estimated to be more than US$5,000, and 3 pieces will cost US$15,000. It is obviously far from mass production. The most important thing is that TPU V3 is designed for training. What is needed for in-vehicle use is the reasoning part.

Waymo is unlikely to use TPU V3, as the cost and power consumption are too high. The only public information about Waymo's computing platform is Intel's news in September 2017. Intel claimed that it has been working with Google to develop driverless cars since 2009, and also with Waymo, providing the latter with Xeon processors, Arria FPGAs (for machine vision) and Gigabit Ethernet solutions to help Waymo's driverless cars process information in real time. EyeQ5 has a computing power of only 12TOPS, and it was only mass-produced this year. Obviously, Waymo is unlikely to use EyeQ5. The most likely possibility is still Xeon processors plus Arria FPGAs.

The above picture shows the driverless computing platform designed by Intel. Although Intel wrote the Arria 10 series FPGA, the most common Intel Stratix10 is for deep learning acceleration. The typical representative of using FPGA for deep learning acceleration is Microsoft. Since the end of 2015, Microsoft has deployed Catapult FPGA boards on almost every new server it purchased. These servers are used for Microsoft's Bing search, Azure cloud services and other applications. This also makes Microsoft one of the largest FPGA customers in the world. When using Microsoft's original ms-fp8 data format (8-bit bit precision), a peak performance of 90 TFLOPS can be obtained on the Stratix 10 FPGA. Stratix 10 is Intel's 2015 product. At the end of 2019, Stratix 10 was upgraded to Agilex. Agilex FPGA chips are developed based on the second-generation HyperFlex architecture. Compared with the previous generation Stratix 10 FPGA, the performance is improved by 40%, the power consumption is reduced by 40%, the DSP FP16 half-precision floating point performance is up to 40 TFlops (40 trillion times per second), the INT8 integer performance is up to 92Tops, and the transceiver data rate is up to 112Gbps. Xilinx's ACAP series FPGAs have higher computing power, up to 147TOPS (INT8).

However, the power consumption of FPGA is not as low as we usually think. For example, Intel FPGA programmable acceleration card, D5005 programmable acceleration card based on Stratix 10 SX FPGA (2.8 million logic units), has been applied in HPE ProLiant DL3809 Gen10 server. TDP is up to 215 watts. Given a specific airflow, the Intel FPGA PAC D5005 can dissipate up to 189 W of power, of which up to 137 W can come from the FPGA. Intel said that under normal circumstances, the power consumption of D5005 is 189 watts, and the FPGA is 137 watts. Nvidia's T4 is only 75 watts, and the chip itself is only 70 watts. The price of D5005 is also relatively high, up to $10,000. Of course, it also has 2 quad-channel small form-factor pluggable (QSFP) interfaces inside, and the interface speed is up to 100G. The key is that it is programmable, and it is expensive for a reason. However, the price of Stratix 10 SX FPGA chip is at least over $3,000, and the price of Nvidia's T4 single chip should be less than $1,000. Therefore, the use of Nvidia Tesla T4 is most likely, and Waymo may have reduced the accuracy of the deep learning model, which may be INT2 or even INT1. T4 can reach 1040TOPS computing power under INT1.

[1] [2]

Keywords：Waymo Reference address：Waymo computing system conjecture

Previous article：Selection and analysis of pre-charging resistors in new energy vehicle systems
Next article：UCLA develops optical neural network that instantly recognizes objects for use in self-driving cars

Popular Resources
Popular amplifiers