Why is it more accurate than the human eye? Tesla's vision solution explained-EEWORLD

Collect

On December 10, Tesla held an offline "T-talk" sharing and discussion meeting in Beijing with the theme of " The 'Bionic Brain' of Autonomous Driving ". Tesla mainly explained the latest progress of AI technology, including how Tesla can achieve precise autonomous driving capabilities with pure vision solutions, which brings a safer and more reliable experience than the radar + vision fusion solution, as well as unique intelligent algorithms and other exclusive content, leading participants to have a deep understanding of Tesla's exploration in the field of autonomous driving.

Adhere to visual perception and use AI neural network technology to improve assisted driving capabilities

As shown in Figure 1, Andrej said: "We hope to create a neural network connection similar to the visual cortex of animals to simulate the process of information input and output in the brain. Just like light enters the retina, we hope to simulate this process through a camera."

Schematic diagram of the camera simulating human image processing process

The multi-task learning neural network architecture HydraNets processes the raw data transmitted by 8 cameras through a backbone network, and uses the RegNet residual network and BiFPN algorithm model for unified processing to obtain various types of image features with different accuracies for use in neural network tasks with different requirements.

Multi-task learning neural network architecture HydraNets

However, since this structure processes a single-frame image from a single camera, it encounters many bottlenecks in actual application. Therefore, a Transformer neural network structure was added to this structure, so that the originally extracted two-dimensional image features are transformed into features of a three-dimensional vector space composed of multiple cameras, thereby greatly improving the recognition rate and accuracy.

It’s not over yet. Since it is still a single-frame picture, time and space dimensions are also needed to enable the vehicle to have a feature "memory" function to deal with various scenarios such as "occlusion" and "road signs". Ultimately, the features of the driving environment can be extracted in the form of a video stream to form a vector space, allowing the vehicle to accurately and low-latency judge the surrounding environment and form a 4D vector space. These video feature databases are used to train autonomous driving.

Neural network architecture for video-based 4D vector space

However, due to the difference between urban autonomous driving and high-speed autonomous driving, the vehicle planning module has two major challenges. The first is that the driving plan may not have an optimal solution, and there may be many local optimal solutions. This means that in the same driving environment, autonomous driving can choose many possible solutions, and all of them are good solutions. The second is that the dimension is relatively high. The vehicle not only needs to respond in the moment, but also needs to plan for the next period of time and estimate a lot of information such as position space, speed, acceleration, etc.

Therefore, Tesla chose two ways to solve the two major problems of the planning module. One is to use discrete search to solve the "answer" to the local optimal solution, with an ultra-high efficiency of 2,500 searches per 1.5 milliseconds; the other is to use continuous function optimization to solve high-dimensional problems. Through discrete search, a global optimal solution is first obtained, and then continuous function optimization is used to balance the demands of multiple dimensions, such as comfort and smoothness, to obtain the final planning path.

In addition to planning for itself, it is also necessary to "estimate" and guess the plans of other objects. That is, in the same way, based on the recognition of other objects and basic parameters such as speed and acceleration, it plans paths for other vehicles and responds accordingly.

However, road conditions around the world are ever-changing and very complex. If discrete search is used, it will consume a lot of resources and make the decision-making time too long. Therefore, we chose a deep neural network combined with a Monte Carlo search tree to greatly improve the decision-making efficiency, which is almost an order of magnitude difference.

Efficiency in different ways

The overall architecture of the final planning module is shown in Figure 5. The data is first processed into a 4D vector space based on the architecture of the pure vision solution. Then, based on the previously obtained object recognition and shared feature data, a deep neural network is used to find the global optimal solution. The final planning result is handed over to the execution agency for execution.

Visual recognition + planning and execution of the overall architecture

Of course, no matter how good the neural network architecture and processing methods are, they cannot be separated from an effective and large database. In the process of converting data from 2D to 3D and 4D, a manual labeling team of about 1,000 people also keeps up with the times and labels it in 4D space. After labeling in vector space, it will be automatically mapped into specific individual images of different cameras, greatly increasing the amount of data labeling. However, this is far from enough. The amount of manually labeled data is far from enough to meet the training required for autonomous driving.

Demonstration of manual annotation in 4D vector space

Since humans are better at semantic recognition, and computers are better at geometry, triangulation, tracking, reconstruction, etc., Tesla wants to create a model in which humans and computers "harmoniously divide the work" and jointly label.

Tesla has built a massive automatic labeling pipeline, using 45 seconds to 1 minute of videos, including a large amount of sensor data, and feeding them to neural networks for offline learning, and then using a large number of machines and artificial intelligence algorithms to generate labeled data sets that can be used to train the network.

Video clip automatic annotation process

To identify drivable areas such as roads, lanes, and intersections, Tesla uses NeRF "neural radiation field", an image processing algorithm that converts 2D to 3D. Given the established XY coordinate point data, the neural network predicts the height of the ground, thereby generating countless XYZ coordinate points, as well as various semantics, such as roadsides, lane lines, road surfaces, etc., forming a large number of information points, which are projected back into the camera image; then the road data is compared with the image segmentation results previously recognized by the neural network, and the images of all cameras are optimized as a whole; at the same time, the time dimension and the space dimension are combined to create a more complete reconstruction scene.

Demonstration of road reconstruction

This technology is used to cross-check the road information reconstructed by different vehicles passing through the same location. The information at all location points must be consistent for the prediction to be correct. This combined effect forms an effective road surface annotation method.

Multiple video data annotation overlaps and checks each other

This is completely different from high-precision maps. As long as the annotation information generated by all video clips becomes more and more accurate and precise, and the annotation information is consistent with the actual road conditions in the video, there is no need to maintain this data.

At the same time, these technologies can also be used to identify and reconstruct static objects, and both textured and non-textured objects can be marked based on these 3D information points; these marked points are very useful for cameras to identify any obstacles.

3D information point reconstruction of static objects

Another benefit of using offline processing and labeling of these data is that the bicycle network can only make predictions about other moving objects at a time, while the offline network can know the past and the future because the data is fixed. It can predict, calibrate and optimize the speed and acceleration of all objects according to the determined data, regardless of occlusion, and label them. The trained network can then make more accurate judgments about other moving objects, making it easier for the planning module to plan.

Offline speed and acceleration calibration and labeling of vehicles and pedestrians

Then, by combining these, we can identify, predict and reconstruct all road-related, static and dynamic objects in the video data, and annotate their dynamic data.

Reconstruction and annotation of surroundings from video clips

Such video data annotation will become the core part of training the autonomous driving neural network. One of the projects was to use this data to train the network within 3 months, successfully realizing all the functions of the millimeter wave radar and making it more accurate, so the millimeter wave radar was removed.

The camera can hardly see the speed and distance accurately.

This method is highly effective, so a large amount of video data is needed for training. Therefore, Tesla has also developed "simulation scene technology" to simulate "edge scenes" that are not common in reality for autonomous driving training. As shown in Figure 4, in the simulation scene, Tesla engineers can provide different environments and other parameters (obstacles, collisions, comfort, etc.), which greatly improves the training efficiency.

Simulation scenario

Tesla uses simulation mode to train the network. It has used 300 million images and 5 billion annotations to train the network. In the future, it will continue to use this mode to solve more difficult problems.

Improvements from simulation mode and what to expect in the coming months

In summary, if we want to improve the capabilities of autonomous driving networks more quickly, we need to process massive amounts of video clips and perform calculations. For example, in order to remove the millimeter-wave radar, 2.5 million video clips were processed and more than 10 billion annotations were generated; all of which make hardware increasingly a bottleneck for development speed.

[1] [2]

Reference address：Why is it more accurate than the human eye? Tesla's vision solution explained

Previous article：Powerful development platform enables the introduction of new network technologies into vehicle architecture
Next article：With Nvidia and Qualcomm eyeing them covetously, can Huawei and Horizon Robotics break through in the field of smart driving chips?

Popular Resources
Popular amplifiers

Latest Automotive Electronics Articles

2024 China Automotive Charging and Battery Swapping Ecosystem Conference held in Taiyuan
2024 is an important year for the charging and swapping industry to make innovative breakthroughs and move towards high-quality development. my country has built the world's largest charging and swapping service network. In order to further improve the charging service guarantee capability and promote the industrial chain ...
State-owned enterprises team up to invest in solid-state battery giant
According to Qichacha news, on November 15, the solid-state battery giant Tailan New Energy changed its investors. Although Foshan Birui Wanmu Equity Investment Partnership and Suzhou Qingrui Huaying Venture Capital Partnership withdrew ...
The evolution of electronic and electrical architecture is accelerating
On November 14, my country's annual output of new energy vehicles exceeded 10 million for the first time, making it the first country in the world to achieve an annual output of 10 million new energy vehicles. ...
The first! National Automotive Chip Quality Inspection Center established
On November 20, according to the Shanghai Market Supervision Bureau, the State Administration for Market Regulation officially approved the Shanghai Motor Vehicle Inspection and Certification Technology Research Center Co., Ltd. to prepare for the establishment of the National Automotive Chip Quality Inspection and Testing Center. ...
BYD releases self-developed automotive chip using 4nm process, with a running score of up to 1.15 million
Chips are very important for mobile phones. Qualcomm, MediaTek, etc. supply mobile phone chips to the world and can make money from mobile phone manufacturers. Chips are also important for smart cars. ...
GEODNET launches GEO-PULSE, a car GPS navigation device
Should Chinese car companies develop their own high-computing chips?
Infineon and Siemens combine embedded automotive software platform with microcontrollers to provide the necessary functions for next-generation SDVs
Continental launches invisible biometric sensor display to monitor passengers' vital signs

MoreSelected Circuit Diagrams

Change More Related Popular Components

MorePopular Articles

MoreDaily News

Guess you like