A brief analysis of the mainstream visual perception framework design in the autonomous driving industry-EEWORLD

Collect

Calibration parameter correction: When the vehicle accelerates or decelerates, the road is bumpy, or you go up or down a ramp, the camera's pitch angle will change. The original camera calibration parameters will no longer be accurate, and after being projected to the world coordinate system, a large ranging error will appear, and the boundary of the passage space will shrink or open.

The strategy for selecting boundary points and post-processing; the traffic space is more focused on the edges, so the burrs and jitters on the edges need to be filtered to make the edges smoother. The boundary points on the side of the obstacle are easily projected incorrectly to the world coordinate system, causing the passable lane next to the front vehicle to be identified as an impassable area.

Road division scheme:

First, camera calibration (online calibration is best, but the accuracy may be compromised). If real-time online calibration is not possible, read the vehicle's IMU information and use the pitch angle obtained from the vehicle's IMU information to adaptively adjust the calibration parameters.

Second, select a lightweight and suitable semantic segmentation network, label the categories that need to be segmented, and cover as wide a scene as possible; use polar coordinates to draw points, and use a filtering algorithm to smooth and process edge points.

Lane Detection

Lane line detection includes detection of various types of single-side/double-side lane lines, solid lines, dotted lines, double lines, line colors (white/yellow/blue) and special lane lines (merging lines, deceleration lines, etc.). As shown in the following figure:

Difficulties in lane line detection:

There are many types of lane lines, and it is difficult to detect lane lines on irregular roads. Lane lines are prone to misdetection or omission when there is water on the ground, invalid markings, repaired roads, or shadows.

When driving uphill or downhill, on bumpy roads, or when the vehicle starts or stops, it is easy to fit trapezoidal or inverted trapezoidal lane lines.

The lane lines of curved lanes, far lane lines, and roundabout lane lines are difficult to fit, and the detection results are prone to flickering;

Lane detection solution:

First, traditional image processing algorithms need to undergo camera distortion correction, perform perspective transformation on each frame, convert the photos taken by the camera to a bird's-eye view, and then use feature operators or color space to extract the feature points of the lane lines. Histograms and sliding windows are used to fit the lane line curves. The biggest drawback of traditional algorithms is their poor adaptability to scenes.

Secondly, the use of neural network methods to detect lane lines is similar to the detection of traffic space. A suitable lightweight network is selected and labeled. The difficulty of lane lines lies in the fitting of lane lines (cubic equations, quartic equations), so in post-processing, vehicle information (speed, acceleration, steering) and sensor information can be combined to perform dead reckoning to make the lane line fitting results as good as possible.

Static object detection

Static object detection includes the detection and recognition of static targets such as traffic lights and traffic signs.

Difficulties in static object detection:

Traffic lights and traffic signs are small objects that occupy very few pixels in the image, especially at long-distance intersections, where identification is more difficult. In strong light conditions, it is difficult for the human eye to distinguish them, and a car parked in front of a zebra crossing at an intersection needs to correctly identify the traffic light before making the next decision.

There are many types of traffic signs, and the collected data is prone to uneven quantity.

Traffic lights are easily affected by lighting. It is difficult to distinguish the colors (red and yellow) under different lighting conditions. At night, the colors of red lights are similar to those of street lights and store lights, which can easily lead to false detection.

Static object detection solution:

The effect of identifying traffic lights through perception is average and the adaptability is poor. If conditions permit (such as fixed park limited scenes), V2X/high-precision maps and other information can be used. Multiple backup redundancies, V2X > high-precision maps > perception recognition. If the GPS signal is weak, predictions can be made based on the results of perception recognition, but in most cases, V2X is sufficient to cover many scenarios.

▍Common problems

Although the implementation of perception subtasks is independent of each other, there are upstream and downstream dependencies and common algorithmic issues between them:

Source of truth

Definition, calibration, analysis and comparison are not just about looking at the test result graph or frame rate. The accuracy of the ranging results under different working conditions (daytime, rainy days, occlusion, etc.) needs to be verified by using laser data or RTK data as the true value.

Resource consumption

The coexistence of multiple networks and the sharing of multiple cameras consume CPU and GPU resources. How to properly handle the allocation of these networks? The forward reasoning of multiple networks may share some convolutional layers. Can they be reused? Introduce the ideas of threads and processes to handle each module, and more efficiently handle and coordinate various functional blocks. In the area of multi-camera reading, do some work on the encoding and decoding of camera code streams while achieving multi-eye input without losing frame rate.

Multi-view fusion

Generally, a car is equipped with four cameras (front, rear, left, and right). When an object moves from the rear to the front of the car, it can be seen by the rear camera, then by the side camera, and finally by the front camera. During this process, the object ID should remain unchanged (the same object should not change due to changes in camera observation), and the distance information jump should not be too large (the distance deviation given by switching to different cameras should not be too large).

Scenario Definition

For different perception modules, it is necessary to clearly divide the data set, that is, the scene definition, so that the algorithm verification is more targeted; for example, for dynamic object detection, the detection scene when the vehicle is stationary can be divided into the scene when the vehicle is moving. For traffic light detection, it can be further subdivided into specific scenes such as left-turn traffic light scenes, straight-ahead traffic lights, and U-turn traffic lights. Verification of public and proprietary data sets.

▍Module Architecture

Many researchers or small and medium-sized companies will draw on the ideas of the currently open source perception frameworks Apollo and Autoware when developing perception systems, so here we will introduce the modular composition of the Apollo perception system.

Camera input-->image preprocessing-->neural network-->multiple branches (traffic light recognition, lane line recognition, 2D object recognition to 3D)-->post-processing-->output results (output object type, distance, speed represents the direction of the detected object)

That is, the data from the camera is input, and detection, classification, segmentation and other calculations are performed based on each frame of information. Finally, multi-frame information is used to track multiple targets and output relevant results. The entire perception flow chart is as follows:

The core link mentioned above is still the neural network algorithm. Its accuracy, speed, and hardware resource utilization are all indicators that need to be measured and considered. It is not easy to do any link well. Object detection is most likely to be misdetected or missed, lane line detection is not easy to fit the fourth-order equation curve, and small objects such as traffic lights are difficult to detect (the length of existing intersections is often more than 50 meters), and the boundary points of the passage space have high requirements.

[1] [2]

Reference address：A brief analysis of the mainstream visual perception framework design in the autonomous driving industry

Previous article：Why Japanese cars prefer CVT transmissions
Next article：In-depth analysis of the domain controller, the core component of automotive intelligence

Popular Resources
Popular amplifiers