What are the key technologies of end-to-end autonomous driving systems?-EEWORLD

Collect

With the development of generative artificial intelligence represented by ChatGPT, end-to-end autonomous driving systems have received widespread attention and are expected to bring revolutionary breakthroughs to driving intelligence in general scenarios. End-to-end systems characterized by the neural network of all modules have low dependence on expert rules, strong functional intensiveness and real-time performance, and have the ability of intelligent emergence and cross-scenario application potential. They are an important way to achieve data-driven self-evolutionary driving capabilities.

Recently, a paper by Li Shengbo and other scholars from Tsinghua University discussed the key technologies and development trends of end-to-end automotive autonomous driving systems. The paper introduced the technical status of generative artificial intelligence, summarized the key technologies of end-to-end autonomous driving, summarized the development status of such autonomous driving systems, and summarized the technical challenges of the integrated development of generative artificial intelligence and autonomous driving.

The current state of the art in generative AI

Data, computing power and algorithms are the pillars of the development of large models, among which algorithms are the core technical embodiment of large models. Most existing large models are based on the Transformer structure, using the "pre-training + fine-tuning" technology for parameter learning to adapt them to specific tasks in different fields, and complete the final deployment after pruning and compression. This section will introduce the key technologies of large models in four aspects: network architecture, pre-training, fine-tuning and pruning and compression.

1.1 Architecture design of neural network

The emergence of large models is due to the development of deep neural networks in the wave of deep learning. Deep networks have stronger learning and modeling capabilities, which is conducive to improving model performance.

In 2017, Google proposed the neural network structure Transformer (Figure 1), which greatly improved the network's expressiveness and shined in many fields such as CV and NLP. Transformer has now become one of the basic network structures for large models. Transformer is an encoder-decoder structure with attention mechanism as the core. Its main structure is attention, position encoding, residual connection, and layer normalization module. Transformer is widely used in large models in NLP, CV, RL and other fields.

Figure 1 Transformer network structure

1.2 Pre-training and fine-tuning techniques

Pre-training is a key step to enable large models to acquire general knowledge and accelerate the convergence of the model in the fine-tuning stage. According to the sequence modeling method, language models can be divided into autoregressive language models and autoencoder language models (Figure 2). The autoregressive language model uses the decoder structure of the Transformer to predict the next word based on the previous text, thereby modeling the joint probability of the sequence in a unidirectional manner. The autoencoder language model uses the encoder structure of the Transformer to model the joint probability of the sequence in a bidirectional manner by predicting a word in the sequence.

Figure 2 Schematic diagram of two types of language models

Fine-tuning refers to adjusting the pre-trained large model in downstream tasks to make it more suitable for specific tasks. Compared with the pre-trained large model, the performance of the fine-tuned large model in downstream tasks is usually greatly improved. As the model size continues to increase, it becomes very difficult to fine-tune all parameters. Therefore, a variety of efficient fine-tuning methods have emerged in recent years, including Vanilla Finetune, Prompt Tuning, and Reinforcement Learning from Human Feedback (RLHF) (Figure 3).

Figure 3 Schematic diagram of three fine-tuning methods

1.3 Model Pruning and Compression

The trained large model needs to be deployed on a system with limited computing power and memory. Therefore, it is necessary to prune and compress the large model to reduce the redundant structure and information in the model so that it can perform fast reasoning on limited computing resources while minimizing the impact on model accuracy. The compression methods for large models mainly include model pruning, knowledge distillation, and quantization.

Key technologies for end-to-end autonomous driving

The key to the integration of artificial intelligence technology and autonomous driving technology is to open up the online loop iteration path of edge scene data collection and autonomous driving model training with vehicle-cloud collaboration as the core. Figure 4 shows the development plan of the large-scale autonomous driving model with vehicle-cloud collaboration: a certain scale of vehicles with network functions are used for crowdsourcing data collection, and the data are uploaded to the cloud control computing platform after cleaning and screening; the sufficient computing power of the cloud control platform is used to generate massive simulated driving data; virtual and real data are integrated for scene construction, and the large-scale autonomous driving model is optimized online through self-supervised learning, reinforcement learning, adversarial learning and other methods; the learned large model is pruned and compressed to obtain a real-time model of automotive grade, and downloaded to the on-board chip through OTA to complete the self-evolutionary learning of the vehicle-side driving strategy.

Figure 4: Development of large-scale autonomous driving models with vehicle-cloud collaboration

Specific research contents include:

(1) Basic theory of large models for autonomous driving;

(2) General basic model of autonomous driving perception and cognition;

(3) General basic model for autonomous driving decision-making and control;

(4) Autonomous driving big data collection, generation and automatic labeling;

(5) The basic big model of vehicle-cloud collaboration continues to evolve;

(6) Autonomous and controllable vehicle integrated deployment tool chain and platform.

Technical development trends of end-to-end autonomous driving

With the continuous development of big model technology, big model technology represented by ChatGPT has shown amazing results. Big models have been initially applied in many industrial practices and are expected to become a new growth engine for the real economy.

3.1 Perception of the Large Model

The perception module of autonomous driving uses the data collected by sensors to dynamically generate perception results of the driving environment in real time. The perception big model is one of the core driving forces for improving the vehicle's autonomous driving capabilities. These models can recognize and understand information such as roads, traffic signs, pedestrians, and vehicles, providing environmental perception for autonomous vehicles, which is then used for autonomous vehicle decision-making.

At present, there are relevant applications in autonomous driving perception, such as Baidu Wenxin UFO 2.0 visual large model, Huawei Pangu CV and SenseTime's INTERN large model.

Bird's Eye View (BEV) is one of the current mainstream perception solutions. It converts the perception information of multiple sensors such as cameras and radars into a bird's-eye view, and completes multiple perception tasks such as target detection, image segmentation, tracking and prediction in parallel, as shown in Figure 5. Typical work includes Tesla's BEV perception, Baidu's UniBEV and SenseTime's FastBEV.

Figure 5 Bird’s-eye view perception process

3.2 Prediction of large models

Prediction is a key component of autonomous driving. It mainly involves predicting the future motion state of surrounding traffic participants, also known as trajectory prediction. Trajectory prediction comprehensively considers information such as road structure, historical trajectory, and interaction with other traffic participants, and outputs one or more possible future driving trajectories for reference by downstream decision-making and control tasks. Data-driven trajectory prediction methods usually adopt an encoding-decoding architecture, including information representation, scene encoding, and multimodal decoding. Representative works include Google Wayformer, Tsinghua SEPT, and Haomo Zhixing DriveGPT.

3.3 Decision-making and control model

Autonomous decision-making and motion control are the core functions of autonomous driving. The level of decision-making and control determines the intelligence of autonomous driving vehicles. The technical solutions of autonomous driving decision-making and control systems have gone through three stages of development: expert rule-based, imitation learning, and brain-like learning. The goal of the autonomous driving decision-making and control large model is to build a general basic model training algorithm for decision-making and control that combines data-driven and knowledge-guided, represented by deep learning and reinforcement learning, to provide solutions for breakthroughs in the intelligence of autonomous driving.

At present, the industry still lacks a large integrated decision-making and control model for autonomous driving. The integrated control architecture (IDC) proposed by Tsinghua University integrates decision-making and control into a unified constrained optimal control problem, and uses data-driven algorithms to solve the evaluation model and strategy model. It takes the environmental perception results as input and directly outputs control instructions such as throttle, braking, and steering. IDC has the advantages of high online computing efficiency, strong interpretability, no need for manual data annotation, and can predict the next action in an autoregressive manner, laying the foundation for the application of large models in autonomous driving decision-making and control. Figure 6 is a schematic diagram of the traditional expert hierarchical and integrated decision-making and control architecture.