Tesla sets off an "end-to-end" storm, and autonomous driving continues to develop-EEWORLD

Collect

Understand "end-to-end"

Recently, Tesla pushed the end-to-end FSD version number V12.1.2 Beta to users in the United States. After the version was pushed, overseas Tesla owners and video bloggers uploaded some test videos. The evaluation videos themselves are not much. It’s easy to say that what really deserves attention is “end-to-end”.

Since Musk’s debut of end-to-end FSD, many practitioners in the autonomous driving industry and consumer groups have shown great enthusiasm for discussing end-to-end autonomous driving solutions. Xiaopeng, Xiaomi Motors, etc. have already developed "end-to-end" technology.

So, how should we understand the end-to-end of Tesla FSD?

Understand FSD end-to-end

We can understand the end-to-end large model of Tesla FSD through several different sections including structure, form, principle, and development paradigm .

Structurally, the mainstream autonomous driving system will adopt a sub-module solution, dividing the AD system according to perception, planning and control. It first accurately perceives the surrounding dynamic and static traffic participants and the road network structure, and then plans the driving trajectory of the own vehicle. , and finally the vehicle is controlled in a closed loop through the actuator .

In the sub-module solution, clear interfaces and interfaces are designed between modules based on human cognitive steps.

The end-to-end large model of Tesla FSD eliminates the cross-sections between perception and positioning, decision-making and planning, control and execution of the autonomous driving system, and combines the three major modules to form a large neural network .

Formally, the software of the sub-module solution takes the form of a combination of manual coding and neural networks, and manual coding accounts for a high proportion, especially in the regulation and control links. Most car companies still rely on rule-driven, traditional algorithms and manual coding.

In contrast, Tesla FSD's end-to-end solution is implemented using a full-stack neural network, which directly inputs sensor data and outputs steering, braking and acceleration signals without any coding in the entire process.

Of course, there are many secrets hidden in the deep sea of technology. FSD’s end-to-end full-stack neural network may be just a marketing statement, and it does not necessarily mean that there is no code in the entire autonomous driving software.

After all, Musk has always been loud-mouthed when it comes to autonomous driving. When he first demonstrated the end-to-end FSD last year, he claimed to have eliminated all codes (more than 300,000 lines), but the assistant next to him (whose accent sounded like the Indian one) Ashok Elluswamy, head of the autonomous driving department, reminded that there are more than 3,000 lines of C++ code buried in FSD!

From a principle level, the end-to-end large model is the compression of massive driving video clips.

Recently, Andrej Karpathy, the former head of Tesla's autonomous driving department, made a popular science video on LLM. AK said that in essence, the generative GPT based on the large language model LLM compresses Internet-level TB or PB-level data into GB. level parameter file.

By analogy, Tesla's end-to-end FSD can also be thought of as compressing the human driving knowledge contained in tens of millions of video clips into the parameters of the end-to-end neural network.

Perhaps we can get a closer analogy from humans themselves.

Think about our life, we have been blown by so much wind, been showered with so much rain, tasted laughter, tears, happiness, and pain again and again, and experienced sleepless nights one after another. Has it been sublimated, refined through multiple experiences, and finally engraved into the neurons and synapses of the brain?

In terms of development paradigm, full-stack neural network-based FSD is a product of the software 2.0 era and is completely data-driven.

That is, after the number of layers , structure, weights, parameters, activation functions, and loss functions of the neural network are fixed, the training data (quality and scale) become the only factor that determines the performance of the end-to-end neural network.

The sub-module solution is between software 1.0 and 2.0. Except for the part that uses neural networks, the other part that uses manual coding still depends on the quality of design rules and the performance of traditional algorithms.

At this point, everyone must have a certain concept of end-to-end. Next, we also combine structure, form, and principles to develop a paradigm and talk about its advantages and disadvantages.

End-to-end pros and cons

Tesla overturned the development, simulation, testing, and iteration methods used under the sub-module solution, reconstructed the tool chain, collected and organized a large number of training video clips, paid huge sunk costs, and invested huge new resources. . So, what advantages does Musk, the outstanding representative of capitalists who are driven by profit-seeking, see in end-to-end?

We can borrow this PPT from GAC Research Institute, which well summarizes the advantages and disadvantages of the end-to-end large model compared to the sub-module solution.

There are three advantages:

Have a higher technical upper bound;
Data-driven solution to complex long-tail problems;
Eliminate serious module cumulative errors;

There are two disadvantages:

lack of interpretability;
Massive amounts of high-quality data are required.

"Having a higher technical upper bound" is because overall optimization can be carried out. The end-to-end integrated structure facilitates joint optimization and seeking the overall optimal solution.

The end-to-end large model can serve the overall goal and achieve global optimization, which is closely related to its full-stack neural network form. A unified perception, prediction, planning and control network can use the chain rule to backpropagate errors layer by layer from the output layer (horizontal and vertical control) to the input layer (sensor) without any obstacles, with the goal of minimizing the overall loss function. , update the parameters in each network layer more accurately.

This is obviously impossible to achieve with the sub-module autonomous driving scheme. In the sub-module scheme, there is a "gradient disconnection" phenomenon between modules.

Just look at the picture below. If you want to backpropagate layer by layer, you must ensure that the intermediate chain cannot be broken. As long as there is an interruption in the middle layer of the neural network, backpropagation will only fail.

"Eliminating serious module cumulative errors" also comes from the contribution of the full-stack neural network.

You can understand the forward propagation of a neural network with a multi-layer structure as multiple function calculations. Whether the full amount of information can be transferred between the upper layer and the next layer is the key to the accuracy of the operation.

For the sub-module solution, the full amount of information cannot be transferred between modules, resulting in "cumulative error." In contrast, the full amount of information can be transferred between the upper and lower layers of the full-stack neural network, thus eliminating the cumulative module error.

The expression "data-driven solution to complex long-tail problems" may confuse many people. After all, establishing a data closed loop and using data-driven to cover more corner cases has been the focus of domestic car companies' promotions in the past year or two. In fact, there is no contradiction. The BEV, Transformer, and occupancy networks promoted by local car companies are based on data-driven perception. However, at the regulatory level, most car companies are still based on rules.

Like perception, regulation also faces a long-tail problem.

Rules-based and data-driven are both ways to solve complex long-tail problems. Algorithms, computing power , and data are the three elements that drive the development of artificial intelligence . Within this framework, Rule based can be considered to be "algorithm-driven" and end-to-end large models are "data-driven."

Instead of hand-coding control strategies to deal with endless complex long-tail problems, it is better to design a control neural network and update model parameters through training data in long-tail scenarios. In theory, this is a more permanent approach.

End-to-end "lack of explainability" is indeed an objective shortcoming. However, not only FSD is end-to-end, but the interpretability of GPT and generative AI being developed by Internet giants is also very poor. Scientists have not yet understood where the emergent behavior and emergent capabilities of large models come from.

Both GPT and end-to-end FSD follow the violent aesthetics of large computing power and massive data. The source and mechanism of capabilities are currently difficult to accurately answer.

However, despite poor explanations, Internet giants have continued to increase their investment in large models, and consumers have put them to good use. There are many things that need to be known and why. Perhaps scientists will give answers to the mechanisms of end-to-end and generative large models in the future.

"The need for massive amounts of high-quality data" is not so much a shortcoming as it is a threshold.

In the world of autonomous driving technology, training computing power, data, AI talent, and funds all require thresholds, and among these elements, data is the most important.

Andrej Karpathy once said in an interview that Tesla’s autonomous driving department spends 3/4 of its energy on collecting, cleaning, classifying, and labeling high-quality data, and only 1/4 of its work is used for algorithm exploration and model creation. , this kind of energy allocation is enough to illustrate the status of data in Tesla's autonomous driving technology stack.

Especially for large models that are completely data-driven like end-to-end, the scale and quality of data can determine the performance of the model itself more than the amount of parameters.