Why is it more accurate than the human eye? Tesla's vision solution explained

Publisher:NanoScribeLatest update time:2021-12-17 Source: 第一电动 Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

On December 10, Tesla held an offline "T-talk" sharing and discussion meeting in Beijing with the theme of " The 'Bionic Brain' of Autonomous Driving ". Tesla mainly explained the latest progress of AI technology, including how Tesla can achieve precise autonomous driving capabilities with pure vision solutions, which brings a safer and more reliable experience than the radar + vision fusion solution, as well as unique intelligent algorithms and other exclusive content, leading participants to have a deep understanding of Tesla's exploration in the field of autonomous driving.


Adhere to visual perception and use AI neural network technology to improve assisted driving capabilities


As shown in Figure 1, Andrej said: "We hope to create a neural network connection similar to the visual cortex of animals to simulate the process of information input and output in the brain. Just like light enters the retina, we hope to simulate this process through a camera."

picture1.png

Schematic diagram of the camera simulating human image processing process


The multi-task learning neural network architecture HydraNets processes the raw data transmitted by 8 cameras through a backbone network, and uses the RegNet residual network and BiFPN algorithm model for unified processing to obtain various types of image features with different accuracies for use in neural network tasks with different requirements.

picture2.png

Multi-task learning neural network architecture HydraNets


However, since this structure processes a single-frame image from a single camera, it encounters many bottlenecks in actual application. Therefore, a Transformer neural network structure was added to this structure, so that the originally extracted two-dimensional image features are transformed into features of a three-dimensional vector space composed of multiple cameras, thereby greatly improving the recognition rate and accuracy.


It’s not over yet. Since it is still a single-frame picture, time and space dimensions are also needed to enable the vehicle to have a feature "memory" function to deal with various scenarios such as "occlusion" and "road signs". Ultimately, the features of the driving environment can be extracted in the form of a video stream to form a vector space, allowing the vehicle to accurately and low-latency judge the surrounding environment and form a 4D vector space. These video feature databases are used to train autonomous driving.

Image3.png

Neural network architecture for video-based 4D vector space


However, due to the difference between urban autonomous driving and high-speed autonomous driving, the vehicle planning module has two major challenges. The first is that the driving plan may not have an optimal solution, and there may be many local optimal solutions. This means that in the same driving environment, autonomous driving can choose many possible solutions, and all of them are good solutions. The second is that the dimension is relatively high. The vehicle not only needs to respond in the moment, but also needs to plan for the next period of time and estimate a lot of information such as position space, speed, acceleration, etc.


Therefore, Tesla chose two ways to solve the two major problems of the planning module. One is to use discrete search to solve the "answer" to the local optimal solution, with an ultra-high efficiency of 2,500 searches per 1.5 milliseconds; the other is to use continuous function optimization to solve high-dimensional problems. Through discrete search, a global optimal solution is first obtained, and then continuous function optimization is used to balance the demands of multiple dimensions, such as comfort and smoothness, to obtain the final planning path.


In addition to planning for itself, it is also necessary to "estimate" and guess the plans of other objects. That is, in the same way, based on the recognition of other objects and basic parameters such as speed and acceleration, it plans paths for other vehicles and responds accordingly.


However, road conditions around the world are ever-changing and very complex. If discrete search is used, it will consume a lot of resources and make the decision-making time too long. Therefore, we chose a deep neural network combined with a Monte Carlo search tree to greatly improve the decision-making efficiency, which is almost an order of magnitude difference.

Image22.png

Efficiency in different ways


The overall architecture of the final planning module is shown in Figure 5. The data is first processed into a 4D vector space based on the architecture of the pure vision solution. Then, based on the previously obtained object recognition and shared feature data, a deep neural network is used to find the global optimal solution. The final planning result is handed over to the execution agency for execution.

Image5.png

Visual recognition + planning and execution of the overall architecture


Of course, no matter how good the neural network architecture and processing methods are, they cannot be separated from an effective and large database. In the process of converting data from 2D to 3D and 4D, a manual labeling team of about 1,000 people also keeps up with the times and labels it in 4D space. After labeling in vector space, it will be automatically mapped into specific individual images of different cameras, greatly increasing the amount of data labeling. However, this is far from enough. The amount of manually labeled data is far from enough to meet the training required for autonomous driving.

Image6.png

Demonstration of manual annotation in 4D vector space


Since humans are better at semantic recognition, and computers are better at geometry, triangulation, tracking, reconstruction, etc., Tesla wants to create a model in which humans and computers "harmoniously divide the work" and jointly label.


Tesla has built a massive automatic labeling pipeline, using 45 seconds to 1 minute of videos, including a large amount of sensor data, and feeding them to neural networks for offline learning, and then using a large number of machines and artificial intelligence algorithms to generate labeled data sets that can be used to train the network.

Image7.png

Video clip automatic annotation process


To identify drivable areas such as roads, lanes, and intersections, Tesla uses NeRF "neural radiation field", an image processing algorithm that converts 2D to 3D. Given the established XY coordinate point data, the neural network predicts the height of the ground, thereby generating countless XYZ coordinate points, as well as various semantics, such as roadsides, lane lines, road surfaces, etc., forming a large number of information points, which are projected back into the camera image; then the road data is compared with the image segmentation results previously recognized by the neural network, and the images of all cameras are optimized as a whole; at the same time, the time dimension and the space dimension are combined to create a more complete reconstruction scene.

Image8.png

Demonstration of road reconstruction


This technology is used to cross-check the road information reconstructed by different vehicles passing through the same location. The information at all location points must be consistent for the prediction to be correct. This combined effect forms an effective road surface annotation method.

Image9.png

Multiple video data annotation overlaps and checks each other


This is completely different from high-precision maps. As long as the annotation information generated by all video clips becomes more and more accurate and precise, and the annotation information is consistent with the actual road conditions in the video, there is no need to maintain this data.


At the same time, these technologies can also be used to identify and reconstruct static objects, and both textured and non-textured objects can be marked based on these 3D information points; these marked points are very useful for cameras to identify any obstacles.

picture10.png

3D information point reconstruction of static objects


Another benefit of using offline processing and labeling of these data is that the bicycle network can only make predictions about other moving objects at a time, while the offline network can know the past and the future because the data is fixed. It can predict, calibrate and optimize the speed and acceleration of all objects according to the determined data, regardless of occlusion, and label them. The trained network can then make more accurate judgments about other moving objects, making it easier for the planning module to plan.

Image11.png

Offline speed and acceleration calibration and labeling of vehicles and pedestrians


Then, by combining these, we can identify, predict and reconstruct all road-related, static and dynamic objects in the video data, and annotate their dynamic data.

picture12.png

Reconstruction and annotation of surroundings from video clips


Such video data annotation will become the core part of training the autonomous driving neural network. One of the projects was to use this data to train the network within 3 months, successfully realizing all the functions of the millimeter wave radar and making it more accurate, so the millimeter wave radar was removed.

Image13.png

The camera can hardly see the speed and distance accurately.


This method is highly effective, so a large amount of video data is needed for training. Therefore, Tesla has also developed "simulation scene technology" to simulate "edge scenes" that are not common in reality for autonomous driving training. As shown in Figure 4, in the simulation scene, Tesla engineers can provide different environments and other parameters (obstacles, collisions, comfort, etc.), which greatly improves the training efficiency.

Image14.png

Simulation scenario


Tesla uses simulation mode to train the network. It has used 300 million images and 5 billion annotations to train the network. In the future, it will continue to use this mode to solve more difficult problems.

Image15.png

Improvements from simulation mode and what to expect in the coming months


In summary, if we want to improve the capabilities of autonomous driving networks more quickly, we need to process massive amounts of video clips and perform calculations. For example, in order to remove the millimeter-wave radar, 2.5 million video clips were processed and more than 10 billion annotations were generated; all of which make hardware increasingly a bottleneck for development speed.

[1] [2]
Reference address:Why is it more accurate than the human eye? Tesla's vision solution explained

Previous article:Powerful development platform enables the introduction of new network technologies into vehicle architecture
Next article:With Nvidia and Qualcomm eyeing them covetously, can Huawei and Horizon Robotics break through in the field of smart driving chips?

Latest Automotive Electronics Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号