A brief analysis of autonomous driving visual perception algorithms-EEWORLD

Collect

Autonomous Driving Visual Perception Algorithm (I)

Environmental perception is the first link of autonomous driving and the link between the vehicle and the environment. The overall performance of an autonomous driving system depends largely on the quality of the perception system. Currently, there are two mainstream technical routes for environmental perception technology:

① Multi-sensor fusion solution dominated by vision, a typical example is Tesla;

② A technical solution that uses lidar as the main technology and other sensors as auxiliary technologies. Typical representatives include Google, Baidu, etc.

We will introduce the key visual perception algorithms in environmental perception. The scope of their tasks and the technical fields they belong to are shown in the figure below. We divide them into two sections to sort out the context and direction of 2D and 3D visual perception algorithms respectively.

In this section, we will first introduce 2D visual perception algorithms based on several tasks widely used in autonomous driving, including 2D object detection and tracking based on images or videos, and semantic segmentation of 2D scenes. In recent years, deep learning has penetrated into various fields of visual perception and achieved good results. Therefore, we have sorted out some classic deep learning algorithms.

01 Object Detection

1.1 Two-stage detection

Two-stage refers to the fact that there are two steps in the detection process: extracting the object region; and performing CNN classification and recognition on the region. Therefore, the "two-stage" is also called target detection based on candidate regions (Region proposal). Representative algorithms include the R-CNN series (R-CNN, Fast R-CNN, Faster R-CNN), etc.

Faster R-CNN is the first end-to-end detection network. In the first stage, a region proposal network (RPN) is used to generate candidate boxes based on the feature map, and ROIPooling is used to align the size of the candidate features; in the second stage, a fully connected layer is used for refined classification and regression. The idea of Anchor is proposed here to reduce the difficulty of calculation and increase the speed. Anchors of different sizes and aspect ratios will be generated at each position of the feature map, which are used as references for object box regression. The introduction of Anchors allows the regression task to only deal with relatively small changes, so it is easier to learn the network. The figure below is a diagram of the network structure of Faster R-CNN.

The first stage of CascadeRCNN is exactly the same as Faster R-CNN, and the second stage uses multiple RoiHead layers for cascading. Some subsequent work is mostly around some improvements of the above network or a hodgepodge of previous work, with few breakthrough improvements.

1.2 Single-stage detection

Compared with the two-stage algorithm, the single-stage algorithm only needs to extract features once to achieve target detection. It is faster and generally has slightly lower accuracy. The pioneer of this type of algorithm is YOLO, which was subsequently improved by SSD and Retinanet. The team that proposed YOLO incorporated these tricks that help improve performance into the YOLO algorithm, and subsequently proposed four improved versions YOLOv2~YOLOv5. Although the prediction accuracy is not as good as the two-stage target detection algorithm, YOLO has become the mainstream in the industry due to its faster running speed. The figure below is the network structure diagram of YOLOv3.

1.3 Anchor-free detection

This type of method generally represents objects as some key points, and CNN is used to regress the positions of these key points. Key points can be the center point (CenterNet), corner point (CornerNet) or representative point (RepPoints) of the object box. CenterNet converts the target detection problem into a center point prediction problem, that is, the target is represented by the center point of the target, and the rectangular box of the target is obtained by predicting the offset and width and height of the target center point. Heatmap represents classification information, and each category will generate a separate Heatmap image. For each Heatmap image, when a certain coordinate contains the center point of the target, a key point will be generated at the target. We use Gaussian circle to represent the entire key point. The following figure shows the specific details.

RepPoints proposes to represent an object as a representative point set and adapt to the shape changes of the object through deformable convolution. The point set is finally converted into an object box for calculating the difference with manual annotation.

1.4 Transformer Detection

Whether it is single-stage or two-stage target detection, whether Anchor is used or not, the attention mechanism is not well utilized. In response to this situation, Relation Net and DETR use Transformer to introduce the attention mechanism into the field of target detection. Relation Net uses Transformer to model the relationship between different targets, integrates the relationship information into the features, and achieves feature enhancement. DETR proposes a new target detection architecture based on Transformer, opening a new era of target detection. The figure below is the algorithm flow of DETR. CNN is first used to extract image features, and then Transformer is used to model the global spatial relationship. The final output is matched with the manual annotation through the bipartite graph matching algorithm.

The accuracy in the table below uses mAP on the MSCOCO database as an indicator, while the speed is measured in FPS. Some of the above algorithms are compared. Due to the many different choices in the structural design of the network (such as different input sizes, different Backbone networks, etc.), the hardware platforms for implementing each algorithm are also different, so the accuracy and speed are not completely comparable. Here is only a rough result for your reference.

02 Target Tracking

In autonomous driving applications, the input is video data, and there are many targets to focus on, such as vehicles, pedestrians, bicycles, etc. Therefore, this is a typical multi-object tracking task (MOT). For the MOT task, the most popular framework is Tracking-by-Detection, and its process is as follows:

① The target detector obtains the target frame output on a single frame image;

② Extract the features of each detected target, usually including visual features and motion features;

③ Calculate the similarity between target detections from adjacent frames based on the features to determine the probability that they are from the same target;

④ Match the target detections of adjacent frames and assign the same ID to objects from the same target.

Deep learning is used in all four steps above, but mainly in the first two steps. In step 1, the application of deep learning is mainly to provide high-quality target detectors, so generally a method with higher accuracy is selected. SORT is a target detection method based on Faster R-CNN, and uses the Kalman filter algorithm + Hungarian algorithm to greatly improve the speed of multi-target tracking, while achieving SOTA accuracy. It is also an algorithm that is widely used in practical applications. In step 2, the application of deep learning is mainly to use CNN to extract the visual features of objects. The biggest feature of DeepSORT is the addition of appearance information, borrowing the ReID module to extract deep learning features, and reducing the number of ID switches. The overall flow chart is as follows:

In addition, there is a framework called Simultaneous Detection and Tracking. For example, the representative CenterTrack originates from the single-stage anchor-free detection algorithm CenterNet introduced earlier. Compared with CenterNet, CenterTrack adds the RGB image of the previous frame and the object center Heatmap as additional inputs, and adds an Offset branch for the Association between the previous and next frames. Compared with the multi-stage Tracking-by-Detection, CenterTrack implements the detection and matching stages with one network, which improves the speed of MOT.

03 Semantic Segmentation

Semantic segmentation is used in the lane detection and drivable area detection tasks of autonomous driving. Representative algorithms include FCN, U-Net, DeepLab series, etc. DeepLab uses dilated convolution and ASPP (Atrous Spatial Pyramid Pooling) structure to perform multi-scale processing on the input image. Finally, the conditional random field (CRF) commonly used in traditional semantic segmentation methods is used to optimize the segmentation results. The figure below shows the network structure of DeepLab v3+.

In recent years, the STDC algorithm has adopted a structure similar to the FCN algorithm, removing the complex decoder structure of the U-Net algorithm. However, during the network downsampling process, the ARM module is used to continuously fuse information from feature maps of different layers, thus avoiding the shortcoming of the FCN algorithm that only considers the relationship between a single pixel. It can be said that the STDC algorithm has achieved a good balance between speed and accuracy, and it can meet the real-time requirements of the autonomous driving system. The algorithm flow is shown in the figure below.

Autonomous Driving Visual Perception Algorithm (Part 2)

In the previous section, we introduced the 2D visual perception algorithm. In this section, we will introduce 3D scene perception, which is essential for autonomous driving. Because depth information, three-dimensional size of the target, etc. cannot be obtained in 2D perception, and this information is the key for the autonomous driving system to make correct judgments about the surrounding environment. To obtain 3D information, the most direct way is to use LiDAR. However, LiDAR also has its disadvantages, such as high cost, difficulty in mass production of automotive-grade products, and greater impact from weather. Therefore, 3D perception based solely on cameras is still a very meaningful and valuable research direction. Next, we have sorted out some 3D perception algorithms based on monocular and binocular.

01 Monocular 3D perception

[1] [2] [3]

Reference address：A brief analysis of autonomous driving visual perception algorithms

Previous article：Concept and technical details of automotive electronics life test
Next article：Comparative analysis of the advantages and disadvantages of various new energy motors

Popular Resources
Popular amplifiers