Comparison of mainstream architecture solutions for autonomous driving: GPU, FPGA, ASIC-EEWORLD

Collect

Comparison of mainstream architecture solutions: three mainstream architectures

The current mainstream AI chips are mainly divided into three categories: GPU , FPGA , and ASIC . Both GPU and FPGA are relatively mature chip architectures in the early stage and are general-purpose chips. ASIC is a chip customized for specific AI scenarios. The industry has confirmed that CPUs are not suitable for AI computing, but they are also essential in AI applications.

GPU solution Architecture comparison between GPU and CPU The CPU follows the von Neumann architecture, the core of which is the storage of programs/data and serial sequential execution. Therefore, the CPU architecture requires a large amount of space to place the storage unit (Cache) and the control unit (Control). In contrast, the computing unit (ALU) only occupies a small part, so the CPU is limited in large-scale parallel computing. Limitations are relatively better at handling logical controls.

GPU (GraphicsProcessing Unit), also known as graphics processor, is a large-scale parallel computing architecture composed of a large number of computing units. It was previously separated from the CPU and specifically used to process image parallel computing data. It is designed to handle multiple parallel computing tasks at the same time. design. The GPU also contains basic computing units, control units and storage units, but the architecture of the GPU is very different from that of the CPU. Its architecture diagram is shown below. Compared to the CPU, less than 20% of the CPU chip space is ALU, while more than 80% of the GPU chip space is ALU. That is, the GPU has more ALUs for data parallel processing.

The difference between GPU and CPU is that the CPU is composed of several cores optimized for sequential serial processing, while the GPU has a massively parallel computing architecture composed of thousands of smaller, more efficient cores. These smaller Core is designed for multitasking at the same time. The reason why CPU and GPU are very different is due to their different design goals. They respectively target two different application scenarios. The CPU needs strong versatility to handle various data types. At the same time, it requires logical judgment and introduces a large number of branch jumps and interrupt processing. All of these make the internal structure of the CPU extremely complex. The GPU faces large-scale data with highly unified types and no dependencies on each other and a pure computing environment that does not need to be interrupted.

Brief description of GPU acceleration technology For deep learning , current hardware acceleration mainly relies on the use of graphics processing units. Compared with traditional CPUs, GPUs have several orders of magnitude more core computing power and are easier to perform parallel calculations.

The many-core architecture of the GPU contains thousands of stream processors, which can perform operations in parallel and significantly shorten the calculation time of the model. As companies such as NVIDIA and AMD continue to promote large-scale parallel architecture support for their GPUs, GPUs for general computing have become an important means of accelerating parallel applications. At present, GPU has developed to a relatively mature stage. Using GPU to train deep neural networks can give full play to its efficient parallel computing capabilities of thousands of computing cores. In scenarios where massive training data is used, the time spent is greatly shortened and fewer servers are occupied. If properly optimized for a proper deep neural network, a GPU card can be equivalent to the computing power of dozens or even hundreds of CPU servers. Therefore, GPU has become the industry's preferred solution for deep learning model training.

When the scale of the trained model is relatively large, the training of the model can be accelerated through data parallelism. Data parallelism can segment the training data and use multiple model instances to train multiple chunks of data simultaneously. In the implementation of data parallelism, since the same model and different data are used for training, the bottleneck affecting model performance lies in parameter exchange between multiple CPUs or multiple GPUs. According to the parameter update formula, the gradients calculated by all models need to be submitted to the parameter server and updated to the corresponding parameters. Therefore, the division of data slices and the bandwidth of the parameter server may become bottlenecks that limit the efficiency of data parallelism. In addition to data parallelism, model parallelism can also be used to speed up model training. Model parallelism refers to splitting a large model into several shards, which are held by several training units respectively. Each training unit cooperates with each other to complete the training of the large model.

GPU -accelerated computing GPU-accelerated computing uses both the graphics processing unit (GPU) and the CPU to speed up scientific, analytical, engineering, consumer and enterprise applications. First introduced by NVIDIA in 2007, GPU accelerators are now supporting energy-efficient data centers in government laboratories, universities, corporations, and small and medium-sized enterprises around the world. GPUs accelerate applications on platforms ranging from cars, phones and tablets to drones and robots . GPU-accelerated computing can deliver extraordinary application performance by offloading workloads from compute-intensive portions of an application to the GPU, while still allowing the CPU to run the rest of the program code . From a user perspective, applications run significantly faster. The GPU currently only performs simple parallel matrix multiplication and addition operations. The construction of neural network models and the transmission of data streams are still performed on the CPU. The interaction process between CPU and GPU: obtain GPU information, configure GPU id, load neuron parameters to GPU, GPU accelerate neural network calculation, and receive GPU calculation results.

Why GPU is so important in the field of autonomous driving One of the most important technical categories in autonomous driving technology is deep learning. Artificial intelligence based on deep learning architecture has now been widely used in computer vision , natural language processing, sensor fusion, target recognition, automatic Various areas of the automotive industry such as driving, from autonomous driving start-ups, Internet companies to major OEMs, are actively exploring the use of GPUs to build neural networks to achieve ultimate autonomous driving. After the birth of GPU accelerated computing, it provided a multi-core parallel computing architecture for enterprise data, supporting data sources that previous CPU architectures could not handle. According to comparison, in order to complete the same deep learning training task, the cost of using a GPU computing cluster is only 1/200 of that of a CPU computing cluster.

GPU is the key to autonomous driving and deep learning. Whether it allows the car to perceive the surrounding real-time environment in real time or quickly plan driving routes and actions, these all require the rapid response of the car's brain, which poses a huge challenge to computer hardware manufacturers. The development of autonomous driving In the process, deep learning or artificial intelligence algorithms are always needed to deal with infinite possible situations. The booming development of artificial intelligence, deep learning and driverless driving has brought about a golden age of GPU computing development. Another important parameter of a GPU is its floating point computing power. Floating point counting uses floating decimal points to represent a number using binary numbers of different lengths, corresponding to fixed-point numbers. When iterating the autonomous driving algorithm, the precision requirements are high and floating point operation support is required.

FPGA solution FPGA chip definition and structure FPGA (Field-Programmable Gate Array), that is, field programmable gate array, is a further development product based on programmable devices such as PAL, GAL, and CPLD . It appears as a semi-custom circuit in the field of application-specific integrated circuits , which not only solves the shortcomings of custom circuits, but also overcomes the shortcomings of the limited number of gate circuits of the original programmable devices. The FPGA chip is mainly completed by 6 parts, namely: programmable input and output unit, basic programmable logic unit, complete clock management, embedded block RAM, rich wiring resources, embedded underlying functional units and embedded dedicated hardware modules . The current mainstream FPGA is still based on look-up table technology, which has far exceeded the basic performance of previous versions, and integrates hard-core (ASIC-type) modules with common functions (such as RAM, clock management, and DSP ).

Working Principle of FPGA Since FPGA needs to be programmed repeatedly, the basic structure of its combinational logic cannot be completed through fixed NAND gates like ASIC, but can only adopt a structure that is easy to configure repeatedly. Lookup tables can well meet this requirement. Currently, mainstream FPGAs all use lookup table structures based on SRAM technology. There are also some military and aerospace-grade FPGAs that use lookup table structures based on Flash or fuse and antifuse technology. Repeated configuration of FPGA is achieved by burning files to change the contents of the lookup table. Look-Up-Table is called LUT for short, and LUT is essentially a RAM. Currently, 4-input LUTs are mostly used in FPGAs, so each LUT can be regarded as a RAM with 4-bit address lines. When the user describes a logic circuit through schematic diagram or HDL language, the PLD/FPGA development software will automatically calculate all possible results of the logic circuit and write the truth table (ie, the result) into RAM in advance. In this way, every time a signal is input Performing logical operations is equivalent to inputting an address, looking up the table, finding the content corresponding to the address, and then outputting it.