Technical Analysis of Tesla's Autopilot System-EEWORLD

Collect

The three core parameters required by the Transformer model are Query, Key and Value. Key and Value are obtained by training the multi-scale feature space generated by the HydraNet backbone through a layer of MLP (multi-layer perceptron network). The global description vector (context summary) is obtained by pooling the feature space, and the positional encoder is performed on each grid of the output BEV space. After synthesizing the description vector and the positional encoding, the Query can be obtained through a layer of MLP.

Through this method, Tesla can internalize the changes in the ground slope, curvature and other geometric shapes into the training parameters of the neural network, and achieve accurate perception and prediction of the depth information of objects. This is also the reason why Tesla dares to abandon the radar fusion route and take the pure vision route.

Short-term memory layer: video spatiotemporal sequence feature extraction

After the introduction of the spatial understanding layer, the perception network has the ability to describe the three-dimensional vector space of the real world, but it still perceives instantaneous image fragments and lacks spatiotemporal memory. In other words, the car can only make judgments based on the information perceived at the current moment, which will cause some features in the world space to be imperceptible.

For example, if a pedestrian is crossing the road and is blocked by a stationary obstacle while driving, and the car only has instantaneous perception capabilities, it will not be able to identify the pedestrian because the pedestrian is blocked by the car at the moment of perception, resulting in a great safety risk. When a human driver faces a similar scenario, he will predict that there is a high probability that the pedestrian is blocked by the car at the current moment and intends to continue crossing the road based on his memory of seeing the pedestrian crossing the road at a previous moment, and thus choose to slow down or brake to avoid the pedestrian.

Therefore, the autonomous driving perception network also needs to have similar memory capabilities, able to remember the data features of a previous time period, so as to deduce the most likely outcome in the current scenario, rather than just making judgments based on the scenario seen at the current moment.

To solve this problem, Tesla's perception network architecture introduced a spatiotemporal sequence feature layer, which adds short-term memory capabilities to autonomous driving by using video clips with a time dimension instead of static images to train neural networks.

Introducing spatiotemporal sequence feature extraction layer to achieve short-term memory capability

Tesla also introduced the vehicle motion information including speed and acceleration obtained by the IMU sensor, combined with the three-dimensional vector space features, to generate feature queues based on the time dimension and the space dimension respectively. The feature queue of the time dimension provides the continuity of perception in time, while the significance of the spatial feature queue is to prevent the loss of timing information due to the long waiting time in some scenes. It also uses three-dimensional convolution, Transfomer, RNN and other methods to realize the fusion of timing information, and then obtain the spatiotemporal feature space of the multi-sensor fused video stream.

In addition, Tesla has also tried a new temporal information fusion method - Spatial RNN, which can omit the position encoding of the BEV layer and directly feed the visual features to the RNN network. It retains the state encoding of multiple moments through the hidden layer to guide which memory fragments need to be selected to deal with the current environment.

The short-term memory layer undoubtedly increases the robustness of Tesla's perception network, and can maintain good perception capabilities in severe weather, emergencies, occlusion scenarios, etc.

The above constitutes Tesla's perception network architecture, which uses an end-to-end training model from video data input to vector space output.

According to Karpathy, Tesla's AI technical director, Tesla's visual perception system based on the above architecture can even better perceive depth information than radar. At the same time, due to its short-term memory, Tesla can achieve real-time construction of local maps. By fusing multiple local maps, it is theoretically possible to obtain a high-precision map of any area. This is why Tesla does not currently use high-precision maps as input.

02 Planning and Control

After perceiving information from the surrounding world, the human body will make corresponding judgments based on the cognition of this information to plan how the body should react and issue control instructions. The same is true for cars. After completing the perception task, the next step is to make a decision plan based on the perceived information and guide the car to complete the corresponding execution actions. This is the planning and control part of autonomous driving.

The core goal of Tesla's autonomous driving control is to plan the car's behavior and driving path based on the three-dimensional vector space output by the perception network to enable the car to reach the designated destination, while maximizing driving safety, efficiency and comfort.

Regulation and control is a very complex issue. On the one hand, the behavior space of a car is typically non-convex. The same target task may correspond to many solutions, and the global optimal solution is difficult to obtain. Specifically, the car may be trapped in a local optimum and unable to make accurate decisions quickly. On the other hand, the behavior space is multidimensional. To formulate a planning plan for the target task, it is necessary to quickly generate parameters of multiple dimensions such as speed and acceleration in a short period of time.

The solution adopted by Tesla is to combine traditional planning and control methods with neural network algorithms to build a hybrid planning system to solve the above two major problems separately by task decomposition. Its planning and control logic is shown in the figure below.

Hybrid Planning System Solutions

In the three-dimensional vector space obtained by perception, based on the established target position, a preliminary path is first found by a rough search method, and then optimized around the preliminary path according to indicators such as safety and comfort, and parameters such as obstacle distance and acceleration are continuously fine-tuned to finally obtain an optimal space-time trajectory.

In most structured scenarios, such as highways, the coarse search uses the classic A-Star algorithm (heuristic search method). However, for some complex scenarios, such as downtown areas and parking lots, there are many unstructured elements in the scenarios and the search space is large. The traditional A-Star algorithm consumes too many computing nodes, resulting in slow decision-making.

Tesla therefore introduced the reinforcement learning method. The mechanism of reinforcement learning is similar to the human learning model. It guides humans to acquire certain abilities by rewarding correct behaviors. First, it uses the neural network to learn the characteristics of the entire scene to obtain the value function, and then uses the MCTS algorithm (Monte Carlo Tree Search) to guide the search path to continuously approach the value function. This method can greatly reduce the search space and effectively improve the real-time decision-making.

MCTS algorithm plans parking lot driving routes

During the driving process, there will be game issues with other vehicles, such as changing lanes and passing other vehicles at narrow intersections. In similar scenarios, it is generally necessary to adjust the decision-making plan of the vehicle at any time according to the changes in the reaction of the other vehicle.

Therefore, in addition to single-vehicle planning, Tesla also performs joint trajectory planning for traffic participants. It plans the paths of other vehicles based on their status parameters (speed, acceleration, angular velocity, etc.), and then selects a suitable plan for its own vehicle. When the status of other vehicles changes, it adjusts its own vehicle plan at any time to avoid the situation where its own vehicle stands still and does not react, thereby improving the intelligence of its own vehicle.

Joint trajectory planning for narrow intersections

At this point, the final architecture of Tesla FSD has surfaced. First, a three-dimensional vector space is generated through the visual perception network. For problems with only a unique solution, a clear control plan can be directly generated. For complex problems with multiple optional solutions, the vector space and the intermediate layer features extracted by the perception network are used to train the neural network planner to obtain the trajectory distribution, and then integrate the cost function, manual intervention data or other simulation data to obtain the optimal control plan. Finally, control instructions such as steering, acceleration, and braking are generated for the car. The car execution module accepts the control instructions to realize automatic driving.

Tesla FSD perception-planning-control overall architecture

03 Data Annotation and Simulation

It can be seen that in Tesla's autonomous driving solution, whether at the perception level or the regulation and control level, the core algorithms are basically driven by data. The quantity and quality of data determine the performance of the algorithm. Therefore, it is crucial to build a closed loop for efficient acquisition, labeling and simulation of training data.