Comparison of pure vision and sensor fusion solutions
Tesla’s purely visual solution
Tesla demonstrated a purely visual FSD (Full Self Driving) system at the 2021 AI Day. Although it can only be regarded as L2 level (the driver must be prepared to take over the vehicle at any time), if we only compare the L2 level automatic driving system horizontally, the performance of FSD is still good. In addition, this purely visual solution integrates many successful experiences in the field of deep learning in recent years. It is very unique in multi-camera fusion. I personally think it is worth studying at least in terms of technology.
Multi-camera configuration of Tesla FSD system
Let’s digress a little bit and let’s talk about the person in charge of Tesla AI and Vision, Andrej Karpathy. This little brother was born in 1986 and received his PhD from Stanford University in 2015. He studied under Professor Li Feifei, a giant in the field of computer vision and machine learning . His research direction is the intersection of natural language processing and computer vision and the role of deep neural networks in it. application. Musk recruited this young talent in 2016 and later put him in charge of Tesla's AI department. He is the chief designer of algorithms for FSD, a pure visual system.
Andrej first mentioned in his report at AI Day that Tesla's vision system five years ago first obtained the detection results on a single image and then mapped it to vector space (Vector Space). This "vector space" is one of the core concepts in the report. I understand that it is actually the representation space of various targets in the environment in the world coordinate system. For example, for object detection tasks, the description characteristics such as the position, size, orientation, and speed of the target in 3D space form a vector. The space composed of the description vectors of all targets is the vector space. The task of the visual perception system is to transform information in image space into information in vector space. This can be achieved through two methods: one is to first complete all perception tasks in the image space, then map the results to the vector space, and finally fuse the results of multiple cameras; the other is to first convert the image features to the vector space and then fuse them Features from multiple cameras are finally used in vector space to complete all perception tasks.
Andrej gave two examples of why the first approach is inappropriate. First, due to perspective projection, what looks good perceptually in an image has poor accuracy in vector space, especially in distant regions. As shown in the figure below, the positions of lane lines (blue) and road edges (red) are very inaccurate after being projected into vector space, making it impossible to use applications that support autonomous driving.
Perceptual results in image space (top) and their projection in vector space (bottom)
Secondly, in a multi-view system, a single camera may not be able to see the complete target due to the limitation of the field of view. For example, in the example below, a large truck appears in the field of view of some cameras, but many cameras only see part of the target, so they cannot make correct detection based on the incomplete information, so the subsequent fusion effect is also Can not guarantee. This is actually a general problem in multi-sensor decision-making layer fusion.
Single camera limited field of view
Based on the above analysis, image space perception + decision-making layer fusion is not a good solution. Completing fusion and perception directly in vector space can effectively solve the above problems, which is also the core idea of the FSD perception system. In order to realize this idea, two important problems need to be solved: one is how to transform features from image space to feature space, and the other is how to obtain annotation data in vector space.
spatial transformation of features
For the spatial transformation problem of features, the general approach is to use the calibration information of the camera to map the image pixels to the world coordinate system. But this is an ill-posed problem and requires certain constraints. Ground plane constraints are usually used in autonomous driving applications, that is, the target is located on the ground and the ground is horizontal. This constraint is too strong and cannot be satisfied in many scenarios.
There are three core points in Tesla's solution . First, the correspondence between image space and vector space is established through Transformer and Self-Attention . Here, the position encoding of vector space plays a very important role. The specific implementation details will not be discussed here. I will write a separate article to introduce it in detail when I have time. To understand it simply, the characteristics of each position in the vector space can be regarded as a weighted combination of the characteristics of all positions in the image. Of course, the weight of the corresponding position must be larger. However, this weighted combination process is automatically implemented through Self-Attention and spatial coding. It does not require manual design and performs end-to-end learning completely based on the tasks that need to be completed.
Secondly, in mass production applications, the calibration information of the cameras on each vehicle is different, causing the input data to be inconsistent with the pre-trained model. Therefore, these calibration information need to be provided as additional input to the neural network. A simple method can be to splice the calibration information of each camera, encode it through MLP and then input it to the neural network. However, a better approach is to correct the images from different cameras through calibration information so that the corresponding cameras on different vehicles output consistent images .
Finally, video (multi-frame) input is used to extract temporal information to increase the stability of the output results, better handle occlusion scenes, and predict target motion . An additional input to this part is the vehicle's own motion information (which can be obtained through the IMU) to support the neural network in aligning feature maps at different time points. The processing of time series information can use 3D convolution, Transformer or RNN. The FSD solution uses RNN. From my personal experience, this is indeed the solution with the best balance between accuracy and calculation amount.
Through the above algorithm improvements, the output quality of FSD in vector space has been greatly improved. In the comparison chart below, the left side below is the output from the image space perception + decision-making layer fusion scheme, while the above-mentioned feature space transformation + vector space perception fusion scheme on the right side below.
Image space perception (bottom left) vs. vector space perception (bottom right)
Annotations in vector space
Since it is a deep learning algorithm, data and annotation are naturally the key links. Annotation in image space is very intuitive, but what the system ultimately needs is annotation in vector space. Tesla's approach is to use images from multiple cameras to reconstruct a 3D scene and annotate the 3D scene. The annotator only needs to annotate once in the 3D scene, and can see the mapping of the annotation results in each image in real time, so as to make corresponding adjustments.
Annotation in 3D space
Manual annotation is only a part of the entire annotation system. In order to obtain faster and better annotations, automatic annotation and simulators are also needed . The automatic annotation system first generates annotation results based on images from a single camera, and then integrates these results through various spatial and temporal clues. To put it figuratively, each camera comes together to discuss a consistent annotation result. In addition to the cooperation of multiple cameras, multiple Tesla vehicles driving on the road can also integrate and improve the annotation of the same scene. Of course, GPS and IMU sensors are also needed to obtain the position and attitude of the vehicle, so as to spatially align the output results of different vehicles. Automatic annotation can solve the problem of annotation efficiency, but for some rare scenes, such as pedestrians running on the highway as demonstrated in the report, a simulator is needed to generate virtual data. The combination of all the above technologies constitutes Tesla's complete data collection and annotation system.
Vision + lidar solution
Almost at the same time, Hao Mo Zhixing also proposed to introduce Transformer into its data intelligence system MANA, and gradually applied it to actual road perception problems, such as obstacle detection, lane line detection, drivable area segmentation, traffic sign detection, etc. From this point, we can see the convergence of the technical routes of mass-produced car companies after they have extremely large data sets as support. In an era when autonomous driving technology is flourishing, choosing the right track and establishing the advantages of one's own technology is extremely important for both Tesla and HaoMo.
In the development of autonomous driving technology, there has been a debate over which sensors to use. The focus of the current debate is whether to take the purely visual route or the lidar route. Based on first principles, Tesla adopts a pure visual solution, which is also a choice based on its million-level fleet and tens of billions of kilometers of real road condition data. There are two main considerations for using lidar. First, the gap in data scale is difficult for other autonomous driving companies to fill. To gain a competitive advantage, the sensing capabilities of sensors must be increased. At present, the cost of semi-solid lidar has been reduced to a few hundred dollars, which can basically meet the needs of mass-produced models. Second, judging from the current technological development, pure vision-based technology can meet L2/L2+ level applications, but for L3/4 level applications (such as RoboTaxi), lidar is still indispensable.
In this context, whoever can possess massive data and support both visual and lidar sensors will undoubtedly have a first-mover advantage in the competition. According to Gu Weihao, CEO of HaoMo Zhixing, at the AI Day presentation, the MANA system uses Transformer to fuse visual and lidar data at the bottom layer, thereby achieving deep perception of space, time, and sensors.
visual perception module
After the camera obtains the raw data, it must undergo digital processing by ISP (Image Signal Process) before it can be provided to the back-end neural network for use. Generally speaking, the function of ISP is to obtain better visual effects, but neural networks do not actually need to actually "see" the data. The visual effects are only designed for humans. Therefore, the ISP is used as a layer of the neural network, and the neural network determines the parameters of the ISP according to the back-end tasks and calibrate the camera. This is conducive to retaining the original image information to the greatest extent, and also ensures that the collected images are consistent with the neural network. The training images of the network are as consistent as possible in terms of parameters.
The processed image data is sent to the backbone network (Backbone). The DarkNet used by Haimo is similar to a multi-layer convolutional residual network (ResNet), which is also the most commonly used backbone network structure in the industry. The features output by the backbone network are then sent to different heads to complete different tasks. The tasks here are divided into three major categories: Global Tasks, Road Tasks and Object Tasks. Different tasks share the features of the backbone network, and each task has its own independent Neck network to extract features for different tasks. This is basically the same idea as Tesla HydraNet. However, the characteristic of the MANA perception system is that it designs a Neck network (Global Context Pooling) to extract global information for global tasks. This is actually very important, because global tasks (such as the detection of drivable roads) rely very much on the understanding of the scene, and the understanding of the scene relies on the extraction of global information.
Vision and lidar perception modules of the MANA system
Lidar sensing module
Lidar perception uses the PointPillar algorithm, which is also a commonly used point cloud-based three-dimensional object detection algorithm in the industry. The characteristic of this algorithm is to project three-dimensional information into two-dimensional (top-down view), and perform feature extraction and object detection on two-dimensional data similar to that in vision tasks. The advantage of this approach is that it avoids a very computationally intensive three-dimensional convolution operation, so the overall speed of the algorithm is very fast. PointPillar is also the first algorithm in the field of point cloud object detection that can meet real-time processing requirements.
In previous versions of MANA, visual data and lidar data were processed separately, and the fusion process was completed at the level of their respective output results, that is, post-fusion. This ensures the independence between the two systems as much as possible and provides security redundancy for each other. However, post-fusion also causes the neural network to be unable to fully exploit the complementarity of data between two heterogeneous sensors to learn the most valuable features.
Fusion sensing module
The concept of trinity fusion was mentioned earlier, which is also the key point that distinguishes the MANA perception system from other perception systems. As Gu Weihao introduced at AI Day, most current perception systems suffer from the problems of "perception discontinuity in time and fragmentation in space".
Fusion sensing module of MANA system
The spatial discontinuity is caused by the different spatial coordinate systems where multiple homogeneous or heterogeneous sensors are located. For homogeneous sensors (such as multiple cameras), due to different installation positions and angles, their field of view (FOV) is also different. The FOV of each sensor is limited, and data from multiple sensors need to be fused together to obtain a 360-degree perception around the car body, which is very important for autonomous driving systems at L2 and above. Heterogeneous sensors (such as cameras and lidar), due to different data collection methods, the data information and forms obtained by different sensors are very different. The camera collects image data, which has rich texture and semantic information, and is suitable for object classification and scene understanding; while the lidar collects point cloud data, whose spatial position information is very accurate, and is suitable for sensing three-dimensional objects. information and detect obstacles. If the system processes each sensor individually and performs post-fusion on the processing results, it cannot take advantage of the complementary information contained in the data from multiple sensors.
The discontinuity in time is due to the system processing in units of frames, and the time interval between two frames may be tens of milliseconds. The system pays more attention to the processing results of single frames and uses temporal fusion as a post-processing step. For example, a separate object tracking module is used to concatenate the object detection results of single frames. This is also a post-fusion strategy and therefore fails to fully exploit useful information on timing.
So how to solve these two problems? The answer is to use Transformer to do pre-fusion in space and time.
Let’s talk about space fusion first. Different from the role played by Transformer in general visual tasks (such as image classification and object detection), the main role of Transformer in pre-spatial fusion is not to extract features, but to transform the coordinate system. This is similar to the technology used by Tesla, but it further adds lidar and performs multi-sensor (cross-modal) front-end fusion, which is the Cross-Domain Association module. The previous article introduced the basic working principle of Transformer. Simply put, it calculates the correlation between various elements of the input data and uses this correlation for feature extraction. Coordinate system transformation can also be formalized into a similar process. For example, if you need to convert images from multiple cameras into a three-dimensional spatial coordinate system that is consistent with the lidar point cloud, then what the system needs to do is find the correspondence between each point in the three-dimensional coordinate system and the image pixels. The traditional method based on geometric transformation maps a point in the three-dimensional coordinate system to a point in the image coordinate system, and uses a small neighborhood (such as 3x3 pixels) around the image point to calculate the pixel value of the three-dimensional point. The Transformer will establish a connection between the three-dimensional points and each image point, and use the self-attention mechanism (that is, correlation calculation) to determine which image points will be used to calculate the pixel values of the three-dimensional points. As shown in the figure below, Transformer first encodes image features and then decodes them into three-dimensional space, and the coordinate system transformation has been embedded in the calculation process of self-attention. This idea breaks the constraints on the neighborhood in the traditional method. The algorithm can see a larger range in the scene and perform coordinate transformation through understanding the scene. At the same time, the process of coordinate transformation is carried out in the neural network, and the parameters of the transformation can be automatically adjusted according to the specific tasks received by the back end. Therefore, this transformation process is completely data-driven and task-related. On the premise of having a very large data set, it is completely feasible to transform the spatial coordinate system based on Transformer.
Use Transformer to convert the image coordinate system to the three-dimensional space coordinate system
Let’s talk about pre-fusion in time. This is easier to understand than spatial pre-fusion, because Transformer was originally designed to process time series data. Feature Queue is the temporal output of the spatial fusion module, which can be understood as multiple words in a sentence, so that Transformer can be naturally used to extract temporal features. Compared with Tesla's solution using RNN for timing fusion, Transformer's solution has stronger feature extraction capabilities, but its operating efficiency will be lower. RNN is also mentioned in Haomo's plan. I believe that the two plans are currently being compared and even combined to some extent to take full advantage of the advantages of the two. In addition, due to the blessing of lidar, Haimo adopts SLAM tracking and optical flow algorithms, which can quickly complete its own positioning and scene perception, and better ensure the continuity of timing.
cognitive module
In addition to the perception module, Haimo also has some special designs in the cognitive module, which is the path planning part. Gu Weihao introduced at the AI Day that the biggest difference between the cognitive module and the perception module is that the cognitive module does not have a certain "ruler" to measure its performance, and the cognitive module needs to consider many factors, such as Safety, comfort and efficiency undoubtedly increase the difficulty of cognitive module design. To address these problems, Hao Mo’s solution is scene digitization and large-scale reinforcement learning.
Scene digitization is a parameterized representation of different scenes on the driving road. The advantage of parameterization is that it can effectively classify scenes and perform differentiated processing. According to different granularities, scene parameters are divided into macro and micro. Macroscopic scene parameters include weather, lighting, road conditions, etc., while microscopic scene parameters describe the driving speed of the vehicle, the relationship with surrounding obstacles, etc.
Macroscopic scene clustering in MANA system
Microscopic scenes in the MANA system (an example is the car following scene)
After digitizing various scenarios, artificial intelligence algorithms can be used for learning. In general, reinforcement learning is a better choice for completing this task. Reinforcement learning is the method used in the famous AlphaGo. But unlike Go, the evaluation criterion for autonomous driving tasks is not winning or losing, but the rationality and safety of driving. How to correctly evaluate each driving behavior is the key to the design of reinforcement learning algorithms in cognitive systems. The strategy adopted by Haimo is to simulate the behavior of human drivers, which is also the fastest and most effective method. Of course, data from just a few drivers is far from enough. The basis for this strategy is also massive amounts of manual driving data.
Read the original article and follow the author on Zhihu!