Technical Analysis of Tesla's Autopilot System-EEWORLD

Collect

The first half of the automobile revolution is electrification, and the second half is intelligence. Electrification only changes the way the car is powered, but does not change the nature of the car. Intelligence is the main course of this revolution and will bring disruptive changes to cars. Cars will change from traditional mechanical bodies to intelligent bodies with powerful computing capabilities.

On the road to automotive intelligence, there is a leader with absolute strength, that is Tesla under the leadership of Elon Musk. The autonomous driving system it has built is the focus of global attention. Musk once wrote on Weibo that the artificial intelligence created by Tesla is the most advanced in the world.

Tesla is the only technology company in the world that has achieved full-stack self-developed and self-produced autonomous driving core areas. It has built a full-link autonomous driving software and hardware architecture that includes perception, regulation, control, and execution at various levels such as data, algorithms, and computing power.

Overall, Tesla's autonomous driving architecture uses a pure visual solution to perceive the world, and constructs a three-dimensional vector space of the real world through neural networks based on raw video data. In the vector space, a hybrid planning system that combines traditional regulation and control methods with neural networks is used to realize the behavior and path planning of the vehicle, generate control signals and transmit them to the actuators, and at the same time achieve continuous iteration of autonomous driving capabilities through a complete data closed-loop system and simulation platform.

The following will comprehensively analyze Tesla's core system for achieving FSD (Full Self-Drive) in four parts: perception, planning and control, data and simulation, and computing power.

01 Perception

According to the demonstration at Tesla AI Day in August 2021, Tesla's latest perception solution adopts a pure visual perception solution, completely abandoning non-camera sensors such as lidar and millimeter-wave radar, and only using cameras for perception, making it unique in the field of autonomous driving.

The principle of how humans perceive the world through their eyes is as follows: light passes through the eyes and information is collected by the retina. After transmission and preprocessing, the information reaches the visual cortex of the brain. Neurons extract characteristic structures such as color, direction, and edge from the information transmitted by the retina, and then transmit it to the inferior temporal lobe cortex. After complex processing by the cognitive neural network, the perception result is finally output.

Principles of human visual perception

The autonomous driving visual perception solution imitates the principles of the human visual system. The camera is the "eye of the car". Tesla cars use a total of eight cameras distributed around the car body. There are three cameras on the front of the car body, namely the front main field of view camera, the front wide field of view camera (fisheye lens) and the front narrow field of view camera (telephoto lens). There are two cameras on each side, namely the side front view camera and the side rear view camera. There is a rear view camera at the rear of the car body, which realizes a 360-degree global surround view as a whole, and the maximum monitoring distance can reach 250 meters.

Tesla body camera surround view

The real-world image data collected by the "Eye of the Car" is processed through a complex perception neural network architecture to construct a three-dimensional vector space of the real world, which includes dynamic traffic participants such as cars and pedestrians, static environmental objects such as road lines, traffic signs, traffic lights, buildings, and the coordinate position, direction angle, distance, speed, acceleration and other attribute parameters of each element. This vector space does not need to be completely consistent with the appearance of the real world, but tends to be a mathematical expression for machine understanding.

Use the camera to collect data and output the three-dimensional vector space through the neural network

According to Tesla's public information on AI DAY, after multiple rounds of upgrades and iterations, the visual perception framework currently used by Tesla is shown in the figure below. This is a shared feature multi-task neural network architecture based on video stream data, which has the ability to deeply recognize objects and short-term memory capabilities.

Tesla Visual Perception Network Architecture

Network infrastructure: HydraNet multi-head network

The basic structure of Tesla's visual perception network is composed of a backbone, a neck, and multiple branch heads. Tesla named it "HydraNet", which comes from the nine-headed snake in ancient Greek mythology.

The backbone layer completes end-to-end training on the original video data through the residual neural network (RegNet) and the BiFPN multi-scale feature fusion structure, extracts the multi-scale visual feature space (feature map) of the neck layer, and finally completes the sub-network training at the head layer according to different task types and outputs the perception results, supporting a total of more than 1,000 tasks including object detection, traffic light recognition, and lane line recognition.

HydraNet multi-task network structure

The core feature of the HydraNet network is that multiple subtask branches share the same feature space. Compared with using independent neural networks for a single task, it has the following advantages:

1) Using the same backbone to uniformly extract features and share them with each task head can avoid repeated calculations between different tasks and effectively improve the overall operation efficiency of the network;

2) Different subtask types can be decoupled. Each task runs independently without affecting other tasks. Therefore, when upgrading a single task, it is not necessary to verify whether other tasks are normal at the same time, which reduces the upgrade cost.

3) The generated feature space can be cached, which is convenient for calling at any time according to the needs of various tasks, and has strong scalability.

Data calibration layer: virtual camera builds standardized data

Tesla uses data collected from different cars to build a common perception network architecture. However, due to differences in camera installation external parameters, different cars may have slight deviations in the collected data. For this reason, Tesla adds a layer of "virtual standard camera" to the perception framework and introduces camera calibration external parameters to process the image data collected by each car through de-distortion, rotation and other methods, and uniformly map them to the same set of virtual standard camera coordinates, thereby achieving "calibration (RecTIfy)" of the original data of each camera, eliminating external parameter errors, ensuring data consistency, and feeding the calibrated data to the backbone neural network for training.

Insert a virtual camera layer before the raw data enters the neural network

Spatial understanding layer: Transformer realizes three-dimensional transformation

Since the data collected by the camera is at the 2D image level, which is not in the same dimension as the three-dimensional space of the real world, in order to achieve fully autonomous driving capabilities, it is necessary to transform the two-dimensional data into three-dimensional space.

In order to construct a three-dimensional vector space, the network needs to be able to output object depth information. Most autonomous driving companies use sensors such as lidar and millimeter-wave radar to obtain depth information and integrate it with visual perception results. Tesla insists on using video data obtained using pure visual solutions to calculate depth information. The idea is to introduce a BEV space conversion layer into the network structure to build the network's spatial understanding ability. The BEV coordinate system is a bird's-eye view coordinate system, which is a self-vehicle coordinate system that ignores elevation information.

The solution adopted by Tesla in the early days was to achieve perception in two-dimensional image space, then map it to three-dimensional vector space, and then fuse the results of all cameras. However, image-level perception is based on the ground plane hypothesis, that is, imagining the ground as an infinitely large plane, while the ground in the real world has slopes, which will lead to inaccurate depth information prediction. This is also the biggest difficulty faced by pure vision solutions based on cameras. At the same time, there is also the problem that a single camera cannot see the complete target, making "post-fusion" difficult to achieve.

In order to address this problem and make the perception results more accurate, Tesla adopts the idea of "front fusion" to directly fuse the different video data obtained by multiple cameras around the car body, and then use the same neural network for training to realize the transformation of features from two-dimensional image space to three-dimensional vector space.

Introducing BEV three-dimensional space conversion layer

The core module for realizing three-dimensional transformation is the Transformer neural network, which is a deep learning model based on the attention mechanism. It is derived from the information processing mechanism of the human brain. When faced with a large amount of external information, the human brain will filter out unimportant information and only focus on key information, which can greatly improve the efficiency of information processing. Transformer has a very outstanding performance in dealing with large-scale data learning tasks.