Waymo started investing in autonomous driving as early as 2008, when it was still part of Google's X division. However, 14 years later, Waymo has achieved almost nothing and its voice has become smaller and smaller. The fundamental reason is that Waymo has paid too much attention to software algorithms and ignored hardware platforms. The rise of NVIDIA and Qualcomm's autonomous driving chips in recent years is in sharp contrast to the decline of Waymo. Autonomous driving software and hardware are two sides of the same coin and cannot be separated. The entire solution must include software and hardware. There is no way out by providing software or hardware alone, because the coordination requirements of autonomous driving software and hardware are too high and difficult to transplant. This is mainly because the deep learning algorithm model is highly bound to the hardware. The mismatch between the two can easily lead to inefficiency. It is common to see hardware utilization rates as low as 10% or less.
Waymo initially used Intel chips as its computing platform, mainly based on server CPU Xeon and FPGA acceleration cards.
Image source: Internet
Waymo's computing platform, in which the Ethernet switch is probably built with FPGA, because such a high-bandwidth Ethernet switch is not yet in mass production, and it was even more impossible to have it a few years ago. Only FPGA can achieve such a high bandwidth, but the price is very high, at least more than $1,000. There should also be a PCIe switch between the two Xeon CPUs.
The FPGA may be Arria 10 1150GX, which currently sells for about $2,000 and may cost more than $4,000 in 2013. Altera's FPGA has four major series, namely the top-of-the-line Stratix series, the Arria series that balances cost and performance, the cheap Cyclone series, and the MAX series with NVM. The Stratix series is mostly priced at nearly $10,000, the Arria series is about $2,000-5,000, and the Cyclone series is mostly between $10-20. The Arria series is further divided into four series: 10, V, II, and GX. The 10 series is the latest product, launched in 2013, using a 20-nanometer process. The GX is the first generation product, launched in 2007, using a 90-nanometer process. The II series is a 2009 product, using a 40-nanometer process, and the V series is a 2011 product, using a 28-nanometer process. The 10 series is further divided into two categories: with ARM cores and without ARM cores. The ARM core is a dual A9 core.
In addition to the 1150K logic elements of a standard FPGA, the 1150GX also has 1518 hard-core single-precision floating-point multipliers/adders and 3036 18*19 multipliers. It can ultimately achieve 3340GMACS (equivalent to one million fixed-point multiplication and accumulation operations per second) and 1366 GFLOPS of floating-point computing power. The highest AI computing power is 26TOPS@Int8. This computing power was very impressive in 2013, and the price was of course also very impressive.
FPGA is the most efficient computing unit. The reason why FPGA is more energy efficient than CPU or even GPU is essentially the benefit brought by the architecture without instructions and shared memory. In Feng's structure, since the execution unit (such as CPU core) may execute any instruction, it is necessary to have an instruction memory, decoder, arithmetic units for various instructions, and branch jump processing logic. Due to the complexity of the control logic of the instruction stream, it is impossible to have too many independent instruction streams. Therefore, GPU uses SIMD (single instruction stream, multiple data stream) to allow multiple execution units to process different data at the same pace, and CPU also supports SIMD instructions. The function of each logic unit of FPGA is determined during reprogramming (burning), and no instructions are required.
The registers and on-chip memory (BRAM) in the FPGA belong to their own control logic, and there is no need for unnecessary arbitration and caching. For communication needs, the connection between each logic unit of the FPGA and the surrounding logic units is determined during reprogramming (burning), and there is no need to communicate through shared memory. FPGA is actually like a piece of SRAM. It does not have the memory wall problem that AI chips find difficult to overcome. It is a bit like in-memory computing, but it is much larger than in-memory computing. The hardware utilization rate of FPGA can easily reach more than 80%, so the frequency of FPGA is relatively low.
However, the wiring of FPGA is not optimized, and a large area of silicon wafer resources are idle and wasted, which causes its cost to rise rapidly. The price of small-scale FPGA is very low, but once it exceeds 300,000 to 500,000 logic units, the price soars.
The cost of a single chip of Waymo's computing platform is more than $4,000, and it is not automotive-grade, so it is obviously impossible to mass-produce it. After 2019, Waymo's voice gradually fell silent, while Qualcomm and Nvidia, which integrate hardware and software, developed more and more smoothly.
Seeing Nvidia and Qualcomm soaring, Waymo also repented and began planning to launch an autonomous driving chip with Samsung in 2021, which may be officially launched at the end of 2023. Compared with Nvidia and Qualcomm, Waymo's disadvantage is that Nvidia and Qualcomm are both chip giants, and both have huge shipments, especially Qualcomm, which can significantly reduce chip costs. Waymo customizes chips, and the shipment volume must be very low. Google's TPU is used in data centers and is not cost-sensitive, but it is not the case for cars. At the same time, Google's TPU shipments are still not low relative to self-driving cars.
To achieve commercialization, cost must be considered. Waymo has found Samsung as a partner. Samsung produces hundreds of millions of mobile phone CPU chips every year, which is enough to compete with Qualcomm and spread the cost. Samsung's cooperation with Google began with Google's first generation mobile phone chip Tensor. Qualcomm's Snapdragon Ride platform is currently SA8540p+SA9000. SA8540p is similar to Qualcomm's 5nm mobile phone chip 888, but SA8540p may adopt a design of 4 large cores and 4 small cores, that is, 4 Cortex-X1 plus 4 A78. The emphasis on the A55 small core has been removed. Qualcomm has also derived the 8cx gen3 for laptops, which is very similar to the SA8540p, but the 5G modem has been removed.
If you can make mobile phone SoCs, you can also make autonomous driving chips. Samsung, Apple, and MediaTek can all do it.
Image source: Internet
Google's first-generation Tensor chip, used in Google's Pixel 6 series phones, is actually a modified version of Samsung's Exynos 2100.
Comparison between the first generation Tensor and Samsung Exynos 2100
Image source: Internet
In terms of NPU, Exynos 2100 has an overwhelming advantage of 26TOPS, while Google has 5.7TOPS. However, in actual tests, the advantage of Exynos 2100 is not obvious.
Image source: Internet
In the NNAPI (neural network) test scores of Snapdragon 888, Google Tensor and Exynos 2100, Google Tensor has a clear advantage.
In terms of NLP natural language processing, Google Tensor has obvious advantages.
Image source: Internet
In the offline image classification benchmark test, the difference between Tensor and NVIDIA is not that big.
Image source: Internet
Waymo's self-driving chip is unlikely to be based on the first-generation Tensor, because the second-generation Tensor has been mass-produced at the end of July 2022. Waymo's self-driving chip is likely to be based on the second-generation Tensor. There is no news based on the second-generation Tensor, but it is obvious that Samsung will not do too much work for Google. It should be a modified version of Exynos 2200. After all, the specific model of the first-generation Tensor is Samsung Exynos Tensor GS101. From the model, it can be seen that this is a modified version of Exynos2100.
Image source: Internet
Waymo's self-driving chip should be based on Samsung's 4nm process. The super core should still be two Cortex-x2, instead of one of Exynos. Two Cortex-710 medium cores and four A510 small cores. The GPU is also estimated to be based on AMD's RNDA2 generation GPU, which is enough to compete with Qualcomm's Adreno 730.
The CPU and GPU do not have much room for operation, and NPU should be what Google is good at.
Comparison of Google's TPU generations
Image source: Internet
Google launched the first generation of TPU in 2016 and the fourth generation in 2021. There is no public data on the computing power of the fourth generation TPU, only saying that it is twice that of the third generation. The computing power of the third generation TPU is 360TOPS@Int8, so the fourth generation should be 720TOPS@Int8. However, TPU is for data centers. For edge computing, Google also has TPU EDGE, which is very cheap and should not exceed US$10.
Google did not disclose the computing power of TPU V4, but provided the following table, which shows that the time spent on various algorithm models can completely surpass the top Nvidia system.
Note: This is data from testing in mid-2021.
Image source: Internet
Waymo's strategy should be the same as Qualcomm's, which is also a SoC plus an accelerator. The SoC is based on the second-generation Tensor, that is, the Samsung Exynos 2200, and its internal NPU computing power can reach at least 30TOPS. The accelerator should be modified based on the 4th or 5th generation TPU, and the computing power is estimated to be 360TOPS. In this way, the cost is greatly reduced and should not be higher than the cost of the NVIDIA system. In addition, the 4th or 5th generation TPU should be commissioned to Samsung rather than TSMC for foundry. TSMC's foundry is of course better, but the price will be much higher than Samsung, and Google's order quantity is too small. It must wait in line at TSMC, which has tight production capacity and a large number of large customers. Therefore, Google has always chosen the weak Samsung as its partner.
Previous article:Micron's automotive-grade memory and storage solutions support Ideal L9
Next article:Siri can park automatically, locks in pure visual solutions, analyzes Apple's latest car patent
- Popular Resources
- Popular amplifiers
- A new chapter in Great Wall Motors R&D: solid-state battery technology leads the future
- Naxin Micro provides full-scenario GaN driver IC solutions
- Interpreting Huawei’s new solid-state battery patent, will it challenge CATL in 2030?
- Are pure electric/plug-in hybrid vehicles going crazy? A Chinese company has launched the world's first -40℃ dischargeable hybrid battery that is not afraid of cold
- How much do you know about intelligent driving domain control: low-end and mid-end models are accelerating their introduction, with integrated driving and parking solutions accounting for the majority
- Foresight Launches Six Advanced Stereo Sensor Suite to Revolutionize Industrial and Automotive 3D Perception
- OPTIMA launches new ORANGETOP QH6 lithium battery to adapt to extreme temperature conditions
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions
- TDK launches second generation 6-axis IMU for automotive safety applications
- LED chemical incompatibility test to see which chemicals LEDs can be used with
- Application of ARM9 hardware coprocessor on WinCE embedded motherboard
- What are the key points for selecting rotor flowmeter?
- LM317 high power charger circuit
- A brief analysis of Embest's application and development of embedded medical devices
- Single-phase RC protection circuit
- stm32 PVD programmable voltage monitor
- Introduction and measurement of edge trigger and level trigger of 51 single chip microcomputer
- Improved design of Linux system software shell protection technology
- What to do if the ABB robot protection device stops
- Melexis launches ultra-low power automotive contactless micro-power switch chip
- Melexis launches ultra-low power automotive contactless micro-power switch chip
- Molex leverages SAP solutions to drive smart supply chain collaboration
- Pickering Launches New Future-Proof PXIe Single-Slot Controller for High-Performance Test and Measurement Applications
- Apple faces class action lawsuit from 40 million UK iCloud users, faces $27.6 billion in claims
- Apple faces class action lawsuit from 40 million UK iCloud users, faces $27.6 billion in claims
- The US asked TSMC to restrict the export of high-end chips, and the Ministry of Commerce responded
- The US asked TSMC to restrict the export of high-end chips, and the Ministry of Commerce responded
- ASML predicts that its revenue in 2030 will exceed 457 billion yuan! Gross profit margin 56-60%
- Detailed explanation of intelligent car body perception system
- Showing goods + inventory
- Does anyone have a simulation circuit diagram of the LMD18200 chip?
- Digital Circuit Design with Verilog HDL.pdf
- 46% of open source maintainers are unpaid, and 26% earn more than $1,000 per year
- How to start learning analog electronics
- I would like to ask about the AMS1117 and LM2596S chips
- See good people and emulate them, Zhu Ge is full of positive energy^_^
- [Mil MYS-8MMX] Mil MYS-8MMQ6-8E2D-180-C Unpacking Report 2 - Power on
- MSP430 MCU timer TA interrupt program
- Free review: Domestic FPGA Anlu SparkRoad development board