World Model and end-to-end, has the revolutionary moment of autonomous driving arrived?-EEWORLD

Collect

The application of World Model in the field of autonomous driving comes from the paper "World Models for Autonomous Driving: An Initial Survey". World Model has two main functions. One is to generate a large amount of near-realistic and diverse training video data including Corner Case at low cost. The other is to use reinforcement learning to achieve end-to-end results and directly output driving decisions from videos.

The research on World Model started in 2015, was basically finalized in 2018, and became popular on the Internet at the end of 2023. The basic theory mainly comes from Google, and a small amount comes from META.

Tesla and Wayve both proposed the World Model last year, which is a prerequisite for end-to-end. End-to-end requires a massive amount of data containing as many Corner Cases as possible. Currently, the value of the intelligent driving database is extremely low, and the publicly available data can be divided into two types:

One is normal driving under simple working conditions, which is monotonous and lacks diversity. This type of data accounts for about 90% of public data, and its effectiveness is only about one in ten thousand to one in one hundred thousand. This is the case with Tesla's shadow mode. Musk admits that the value of this data is very low, only one in ten thousand, and actually even lower.

Another type is accident data, which is error demonstration. Using such data for end-to-end training can either only adapt to very limited working conditions or make mistakes. End-to-end is a completely black box that cannot be explained and has no certainty, only correlation. This requires the data to be as diverse and high-quality as possible, so that the training results may be better.

End-to-end first needs to solve the data problem. It is impossible to collect data from the outside world. The cost is extremely high, the efficiency is extremely low, and there is a lack of diversity. There is also a lack of interaction, such as the interaction between the vehicle and other vehicles and the environment. This requires expensive manual annotation to complete. Therefore, the World Model is introduced to artificially create massive and diverse data without manual annotation, and the cost is very low.

ChatGPT has greatly inspired the autonomous driving industry. It uses low-cost massive data training without labeling, human-computer interaction, and answering questions. Autonomous driving imitates this human-computer interaction, inputs environmental questions, and the answers are output driving decisions. This model is the World Model.

The world model is divided into three parts: perception, memory and action.

When the word world or environment is mentioned in the field of AI , it is usually to distinguish it from the agent. The fields that study agents the most are reinforcement learning and robotics . Therefore, we can see that world models and world modeling first and most often appear in papers in the field of robotics. Today, the word world models has the greatest influence in this article named "world models" that Jurgen put on arxiv in 2018. The article was eventually published in NeurIPS'18 with the title "Recurrent World Models Facilitate Policy Evolution".

The paper does not define what world models are, but rather draws an analogy with the mental model of the human brain in cognitive science, citing a 1971 paper. Humans perceive and understand the world based on limited senses. The decisions and behaviors we make are actually based on the models we have built internally. Learning to drive and driving are behaviors that are constantly corrected by this model. This model allows the brain to make corresponding future time series decisions based on the information input by the eyes and ears, and then the hands and feet complete the decision. Regardless of whether you are an experienced driver or a novice, the brain's driving model has the ability to predict what kind of scene will appear after the hands and feet execute the decision, or what kind of scene will be achieved, and the brain's driving model has responded. This is consistent with autonomous driving, because autonomous driving is a sequence-to-sequence mapping process. The input is a sensor signal sequence, which may include multiple videos collected by cameras , point clouds collected by Lidar, and various types of information such as GPS and IMU. The output is a future time period driving decision sequence, such as a driving action sequence, or a trajectory sequence that is converted into an operating action. This process is basically consistent with most AI tasks, and this mapping process is equivalent to a function y = f(x). Traditional autonomous driving decomposes this function into many sub-functions, while end-to-end only has one.

The framework diagram consists of three main modules, namely Vision Model (V), Memory RNN (M) and Controller (C). The first is Vision Model (V). The main function of this module is to learn the representation of visual observations. The method used here is VAE (Variational Autoencoder ), which mainly converts the input video (early pictures) into features. After the rise of Transformer, it is converted into Token, and this process becomes Tokenizer.

Image source: Paper "World Models for Autonomous Driving: An Initial Survey"

The above picture is the core world model, which uses MDN, Mixture Density Networks. MDN is very old. By the way, all the basic mathematical knowledge of AI was completed in the 1940s. Today's AI is just the application of these basic mathematics. In essence, humans have not made any progress in the past 100 years. As early as 1994, Christopher M. Bishop proposed Mixture Density Networks. MDN combines conventional deep neural networks and Gaussian mixture models GMM. Neural networks can fit any continuous function. By increasing the number and size of hidden layers of the network, you can get a powerful learning network. Whether it is a quadratic cubic function or a sine cosine, you can use your network to infinitely approximate it. When we want the fitted function to have multiple output values, we need MDN, which outputs a continuous probability distribution, that is, a Gaussian distribution. By combining multiple Gaussian probability distributions, we can theoretically approximate any probability distribution.

MDN no longer uses linear layers or softmax as prediction values in the output part of the network. In order to introduce the uncertainty of the Gaussian distribution model, each output is a Gaussian mixture distribution, rather than a certain value or a simple Gaussian distribution. The Gaussian mixture distribution can solve the multi-value mapping problem that the Gaussian distribution is not easy to solve. Taking the regression problem as an example, both the input and output are vectors that may have multiple dimensions. The probability density of the target value can be expressed as a linear combination of multiple kernel functions. It is actually very close to reinforcement learning. The supervised learning we generally use outputs a single certain value, which naturally requires labeled training data, while reinforcement learning outputs probability values, which does not require labeled training data, greatly reducing data costs. The core function of the World Model: Counterfactual reasoning, that is, even for decisions that have not been seen in the data, the results of the decision can be inferred in the world model. The RNN here later evolved into LSTM (Long Short-Term Memory Network).

The last link is the Controller, which is responsible for predicting the next action. The design here is very simple, and its purpose is to shift the focus to the previous modules, which can learn based on data.

The observation is extracted through V to extract features, then through M to get h, and finally the observation and historical information are sent to C to get actions. Based on the actions, new observations will be generated by interacting with the environment. This can continue. Corresponding to GAIA-1 of Wayve, which is invested by Microsoft, Softbank and Nvidia, the VAE in the figure has been changed to Tokenizer with 300 million parameters. The MDN-RNN in the second section has evolved into the encoder in Transformer with 6.5 billion parameters. The C in the third section is the third section of GAIA-1, the video decoder with 2.6 billion parameters. A total of 9.4 billion parameters.

The framework of the world model has been finalized since 2018. In 2019, RSSM was further evolved.

RSSM combines determinism and randomness, with both a deterministic part to prevent the model from playing randomly and a random part to improve fault tolerance.

In addition, JEPA has evolved. RSSM and JEPA are the core architectures of the current mainstream world model. JEPA was proposed in 2023 and currently has multiple versions. JEPA was proposed by META. The authors include Yann LeCun, one of the three AI giants. Yann LeCun, the father of CNN, a tenured professor at New York University, and Geoffrey Hinton and Yoshua Bengio are also known as the "three giants of deep learning". Former head of Facebook Artificial Intelligence Research Institute, reviewer of IJCV, PAMI and IEEE Trans, he founded the ICLR (International Conference on Learning Representations) conference and served as co-chair with Yoshua Bengio.

The above figure shows the distribution of key model technology types of World Model starting from 2022. Tesla's may be closer to OccWorld, because Tesla's Occupancy Network is relatively good. When using Occupancy Network input, the computing power and storage requirements will be greatly reduced, and HW4.0 can run.

Of course, the World Model is not necessarily required for end-to-end, and there are other methods. However, the World Model is the only way to generate a large amount of near-realistic and diverse training video data including Corner Cases at a low cost. This is revolutionary in the data field, but not revolutionary in terms of model training, as the black box attribute is stronger. Compared with the classic segmented approach, it is still impossible to evaluate which one has better performance, but it is just a different way of thinking. Autonomous driving still has a long way to go.