A brief analysis of the mainstream visual perception framework design in the autonomous driving industry-EEWORLD

Collect

This article briefly introduces visual perception in the autonomous driving industry, from sensor comparison to data collection and annotation, and then analyzes the perception algorithm, gives the difficulties and solutions of each module, and finally introduces the mainstream framework design of the perception module.

The visual perception system mainly uses the camera as the sensor input, and after a series of calculations and processing, it accurately perceives the environmental information around the vehicle. The purpose is to provide the fusion module with accurate and rich information, including the category, distance, speed, and direction of the detected object, and also to provide abstract semantic information. Therefore, the perception function of road traffic mainly includes the following three aspects:

Dynamic object detection (vehicles, pedestrians and non-motor vehicles)

Static object recognition (traffic signs and traffic lights)

Segmentation of drivable area (road area and lane lines)

If these three types of tasks are completed through the forward propagation of a deep neural network, it can not only improve the detection speed of the system and reduce the calculation parameters, but also improve the detection and segmentation accuracy by increasing the number of layers of the backbone network. As shown in the figure below: The visual perception task can be decomposed into target detection, image segmentation, target measurement, image classification, etc.

▍Sensor components

Front Line Camera

The viewing angle is small, and a camera module with an angle of about 52° is generally installed in the middle of the vehicle's front windshield. It is mainly used to sense scenes farther in front of the vehicle, and the perception distance is generally within 120 meters.

Panoramic wide-angle camera

The field of view is relatively large, and generally 6 camera modules with a field of view of about 100° are installed around the vehicle, mainly to sense the 360° surrounding environment (the installation scheme is similar to Tesla). The wide-angle camera has a certain degree of distortion, as shown in the following figure:

Surround view fisheye camera

The surround fisheye camera has a wide viewing angle of more than 180° and has better perception of close distances. It is usually used in parking scenarios such as APA and AVP. It is installed in four locations, including below the left and right rearview mirrors and below the front and rear license plates of the vehicle, to perform image stitching, parking space detection, visualization and other functions.

▍Camera calibration

The quality of camera calibration directly affects the accuracy of target ranging, which mainly includes intrinsic calibration and extrinsic calibration.

Intrinsic calibration is used to correct image distortion, and extrinsic calibration is used to unify the coordinate systems of multiple sensors and move their respective coordinate origins to the center of the vehicle's rear axle.

The most familiar calibration method is Zhang Zhengyou's checkerboard method. In the laboratory, a checkerboard board is usually made to calibrate the camera, as shown below:

Factory calibration

However, for mass production of autonomous driving, it is not possible to calibrate each vehicle using a calibration plate. Instead, a site is built for calibration when the vehicle leaves the factory, as shown in the following figure:

Online calibration

In addition, considering the deviation of the camera position when the vehicle runs for a period of time or during bumps, the perception system also has an online calibration model, which often uses information obtained from detections such as vanishing points or lane lines to update the changes in the pitch angle in real time.

▍Data Annotation

There are various emergencies in natural road scenes, so a large amount of real vehicle data needs to be collected for training. High-quality data annotation has become a crucial task, and all the information that the perception system needs to detect needs to be annotated. Annotation forms include object-level annotation and pixel-level annotation:

The target level annotation is as follows:

The pixel-level annotation is as follows:

Since detection and segmentation tasks in perception systems are often implemented using deep learning, which is a data-driven technology, it requires a large amount of data and annotation information for iteration. In order to improve the efficiency of annotation, a semi-automatic annotation method can be used, by embedding a neural network in the annotation tool to provide an initial annotation, which is then manually corrected, and after a period of time, new data and labels are loaded for iterative cycles.

▍Functional division

Visual perception can be divided into multiple functional modules, such as target detection and tracking, target measurement, drivable area, lane line detection, static object detection, etc.

Object Detection and Tracking

Identify dynamic objects such as vehicles (cars, trucks, electric vehicles, bicycles), pedestrians, etc., output the category and 3D information of the detected object, and match the information between frames to ensure the stability of the detection frame output and predict the running trajectory of the object. The accuracy of 3D regression directly performed by the neural network is not high, and the vehicle is usually split into the front, body, rear, and tire parts to form a 3D frame.

Difficulties in target detection: There are many occlusion situations and the accuracy of the orientation angle is a problem. There are many types of pedestrians and vehicles, which are prone to false detection. There are problems with multi-target tracking and ID switching.

For visual target detection, the perception performance will decline in bad weather conditions, and it is easy to miss detections at night when the lights are dim. If the results of the LiDAR are combined, the recall rate of the target will be greatly improved.

Target detection scheme:

The detection of multiple targets, especially vehicles, requires the 3D Bounding Box of the vehicle. The advantage of 3D is that it can provide the vehicle's orientation angle information and height information. By adding a multi-target tracking algorithm, corresponding ID numbers are given to vehicles and pedestrians.

As a probabilistic algorithm, deep learning cannot cover all dynamic object features even if it has a strong feature extraction capability. In engineering development, some geometric constraints can be added based on real scenarios (such as the length-to-width ratio of cars and trucks is fixed, the distance between vehicles cannot change suddenly, and the height of pedestrians is limited, etc.).

The benefit of adding geometric constraints is to improve the detection rate and reduce the false detection rate. For example, a car cannot be mistakenly detected as a truck. You can train a 3D detection model (or 2.5D model) and then cooperate with the back-end multi-target tracking optimization and the distance measurement method based on monocular vision geometry to complete the functional module.

Target measurement

Target measurement includes measuring the horizontal and vertical distance, horizontal and vertical speed of the target. Based on the output of target detection and tracking, the distance information and speed information of dynamic obstacles such as vehicles are calculated from the 2D plane image with the help of prior knowledge such as the ground, or the position of the object in the world coordinate system is directly regressed through the NN network. As shown in the following figure:

Difficulties of monocular measurement:

How to calculate the distance of an object in a certain direction from a monocular system that lacks depth information? Then we need to figure out the following questions:

What kind of needs are there?

What kind of priors are there?

What kind of maps are there?

What kind of accuracy is required?

What kind of energy can be provided

If we rely heavily on pattern recognition technology to make up for the lack of depth, is the pattern recognition robust enough to meet the stringent detection accuracy requirements of serially produced products?

Monocular measurement solution:

First, the geometric relationship between the world coordinates of the test object and the image pixel coordinates is established through an optical geometric model (i.e., a pinhole imaging model). Combined with the calibration results of the camera's internal and external parameters, the distance to the vehicle or obstacle in front can be obtained.

The second method is to directly regress the collected image samples to obtain the functional relationship between the image pixel coordinates and the vehicle distance. This method lacks the necessary theoretical support and is a pure data fitting method. Therefore, it is limited by the extraction accuracy of the fitting parameters and has relatively poor robustness.

Passable area

The division of the drivable area for vehicles mainly involves dividing the vehicle, ordinary road edges, curb edges, boundaries without visible obstacles, and unknown boundaries, and finally outputting the safe area where the vehicle can pass.

Difficulties in road segmentation:

In complex environment scenes, the boundary shapes are complex and diverse, which makes generalization difficult. Unlike other detections that have clear detection types (such as vehicles, pedestrians, and traffic lights), the passage space needs to divide the driving safety area of the vehicle, and all the obstacles that affect the vehicle's forward movement need to be divided, such as uncommon water barriers, cones, potholes, non-cement roads, green belts, tile-shaped road boundaries, crossroads, T-junctions, etc.

[1] [2]

Reference address：A brief analysis of the mainstream visual perception framework design in the autonomous driving industry

Previous article：Why Japanese cars prefer CVT transmissions
Next article：In-depth analysis of the domain controller, the core component of automotive intelligence

Popular Resources
Popular amplifiers