BEV will be launched within this year. Are high computing power chips ready?-EEWORLD

Collect

The emergence of large models brings new challenges to BEVs

Currently, the most mainstream sensor for autonomous driving is the camera , and some leading OEMs have also begun to use cameras as their main sensors. The advantages of the camera are:

Large pixels and rich information;

Relatively common and low cost.

In the past 50 years, computer vision has generally followed a step-by-step development model. Speaking of which, I have to mention Marr's computational theory, which is a computer vision theory about object recognition. This theory refers to a logic that extracts some basic elements from image to image, called a 2.5-dimensional element map, and finally forms information based on the 2.5-dimensional representation to calculate a three-dimensional model representation. This logic still holds true today, but the expression may no longer be a step by step modular method, and it is more replaced by neural networks. In the past, in the process of studying computer vision, we used pedestrian detection algorithms, face detection algorithms, and vehicle detection algorithms. Now these have been integrated into neural network algorithms .

Deep learning becomes the main force, neural networks replace manual coding

Starting from 2012, when the Convolution Neural Network paper was published at NIPS as a starting point, deep learning began to become the main algorithm of computer vision. This algorithm has a very typical feature. After standardizing the entire input and output, how to extract these experience elements within the entire network, and how to finally process these experience elements to form semantic information output, is actually through forward propagation and then back Learning from the propagation method, rather than the manual method in the past, greatly simplifies the design workload, so that better results can be obtained on larger-scale and more complex tasks. Today’s autonomous driving includes not only visual perception, but also local positioning, long- and short-term target behavior prediction, and self-vehicle planning and control. These can all be completed using neural networks, which we define as Software 2.0.

Compared with the previous generation, the biggest difference of Software 2.0 is that it can use the design of neural networks to replace the past manual hand-written code to complete tasks. Therefore, the requirements for software engineers or the amount of code are beginning to decrease in proportion, but the scale of the network continues to increase. For autonomous driving, most of today's perceptions, also known as "big perception" and "broad perception", are driven by data. In addition to perception, positioning fusion, map positioning fusion, planning control, etc. are also gradually turning from software 1.0 solutions based on rules and handwritten code to data-driven.

End-to-end model training inspired by GPT

Since last year, a new general-purpose AI revolution has taken place, that is, various GPTs led by large-scale training models, forming a variety of models. For the entire GPT model, this is somewhat different from the previous deep learning training method. GPT is mainly completed through three stages: pre-training with massive data, adding a small amount of data to supervised learning, and then strengthening learning. Then we map it to the autonomous driving system, and we can see:

First, train the backbone of the entire network through massive pre-training;

Complete the supervised training of each subtask module;

Learn human driving behavior through imitation;

Coupled with reinforcement learning, correct the automatic driving thinking mode;

Finally, an end-to-end autonomous driving training method is formed, forming an end-to-end model.

However, training is conducted in stages, not on a large scale right from the start. Driven by Software 2.0, the architecture of the entire autonomous driving algorithm has also undergone great changes, including perception, positioning fusion, planning control, and modular design.

At present, the industry has reached a consensus on the use of deep learning to form an end-to-end process: whether it is a camera or a radar , a map or other signals including navigation, they can all be tokenized through a coding method. For example, a convolutional neural network can be considered a A kind of encoder , different sensors encode it into the desired information. At the same time, various control commands and signals can be encoded, such as map format conversion, and finally the information is formed into a complete external output Token, which is output to the cognitive and decision-making layer. The main network of the model can also be the Transformer class, or similar, and finally the final signal is directly generated through the decoding layer and given to the vehicle actuator. In the past year, colleagues from Horizon published an article in CVPR as the first author "Implementing End-to-End Deep Learning Algorithm for Autonomous Driving Based on Transformer Framework". The architecture mentioned is as mentioned above. This kind of architecture has both interpretability and ultimate end-to-end effect. In some public experiments, we have seen good potential and performance.

This paper was published with an example. What is interesting is that although no traffic lights or other traffic rules are explicitly given during the entire training process, after the entire large-scale training, the car can start and stop according to the traffic light status. This In the first process, the information is not actually in the training data, but in the data annotation.

In fact, the entire large model can automatically learn the common sense of the scene through the pre-training and reference process. You may ask how big is the scale of such an algorithm architecture model? In fact, the current entire autonomous driving model, such as our common large language model, is still much smaller. If our entire GPT language model wants to achieve good results, the amount of data training required is several T levels. However, as computing power increases, computing efficiency improves, and computing power continues to increase, the effect will continue to improve. Currently, the entire transformer starts at T level, ranging from 10T to 20T, and may require hundreds of T at most.

In the future, networks will become larger and larger, all of which will rely on hardware infrastructure. For the cloud, we can achieve large-scale computing power requirements through parallel computing clusters. However, on the car side, due to a series of constraints such as the car side area and heat dissipation power consumption, it may be necessary to use a single chip or dual chips to achieve this. Computing power, so the computing power and computing efficiency requirements of a single chip on the peer end are actually very large. As the demand for large computing power increases, it can be found that the biggest architectural difference between convolutional neural networks and Transformer is the allocation of bandwidth.

Compared with convolutional neural networks, if the common bandwidth and calculation ratio of convolutional neural networks is usually 1:100 to 1:1000, for an architecture like Transformer, the ratio of calculation bandwidth requirements to computing power requirements is usually Probably 1:1 to 1:10. In future architectures, chip bandwidth may become a new core bottleneck. From Journey 5 to Journey 6, these two chips have greatly improved the on-chip bandwidth and the ratio of bandwidth to computing power, thus better supporting larger model solutions such as BEV plus transformer. In terms of BEV perception, this is actually the most important perception algorithm compared to the end-to-end mentioned just now that can be implemented on a mass-produced computing platform. First, in the past, we first detected the target in the 2D image, and then projected it into 3D through the camera. The advantage of this technology is that the entire calculation is very intuitive. However, the entire projection process uses software, and there is no way to form an end-to-end process. The biggest difference between BEV and this traditional solution is that it can see the entire state, through a God's perspective, has a better perception and prediction ability of the global state, and has more global awareness.

Based on BEV multi-modal front and mid-fusion, it is easier to fuse multi-modal sensors. We can encode cameras from different angles through a new network, and then project them to the BEV perspective after encoding. . LiDAR naturally has a 3D viewing angle space, so we can let LiDAR form a feature in the 3D space through some methods, and then it is easier to align the features at the feature level and splice them at the feature level to form multi-modality.

Similar technology can also be used for ultrasonic and millimeter waves , which are encoded in BEV space and then processed to finally form the sensing result. This mid-fusion method makes it easy to perform multi-modal sensor fusion. Compared with post-fusion, the entire architecture is simpler and easier to train.

BEV perception based on Journey 5

On Journey 5, we have implemented a set of BEV-based space-time fusion. In addition to this space and multi-modality, there is also a temporal fusion framework, which can fuse multiple cameras, multiple sensors, and time into the entire framework. This can be divided into input layers, including different sensors, such as forward-looking, peripheral-looking, fisheye, lidar, etc.

The entire image is encoded through the BEV model and projected into the BEV space. The same is true for the radar link. Then, these things are expressed collectively through spatio-temporal dimension conversion. Finally, they are synthesized through a neural network and transformer architecture, and are directly output to the output layer. . The output includes 3D detection, object tracking status, trajectory, and lane line targets, static obstacles in parking spaces, occupied networks, and entire 3D objects. The entire end-to-end system can go from perceived target detection to prediction, to trajectory to prediction, all Can be exported. Many of these are experimental results of our actual scenarios, all tested on real vehicles: