For the study of system design and software development of next-generation smart cars, many problems including architecture design, function development, vehicle control, etc. need to be solved. The root of the above problems lies in the research of environmental perception capabilities.
In addition to the hardware performance of the sensor itself, the algorithm model, training neural network, sensor data processing capacity, etc. in its software are all issues that need to be solved.
Currently, the development of perceptual capabilities mainly includes the following process: camera input --> image preprocessing --> neural network --> building a branch processing structure --> post-processing --> output results. The branch structure constructed includes traffic light recognition, lane line recognition, 2D object recognition converted to 3D, etc.; the final output results include output object type, distance, speed representing the direction of the detected object, etc.;
Currently, the key to all perception problems is still the neural network algorithm. For the processing capabilities of the domain controller, it needs to focus on calculation accuracy, real-time performance, computing power utilization, etc. This is to ensure that objects are not missed or misdetected. premise. Among them, due to the problem of ultra-large resolution images input in sensing hardware devices, issues involving the processing of sensing inputs by single- or multi-camera cameras need to be focused on. The difficulty or core optimization direction of this type of perception task mainly lies in the following directions:
① How to handle high-resolution input
② How to improve dense small target detection
③ How to solve the overlapping problem of multiple targets
④ How to use a small amount of training data to solve the problem of target diversity
⑤ How to use a monocular camera to accurately estimate the target position
Camera data calibration in perception
Monocular ranging is to establish the geometric relationship between the world coordinates of the test object and the image pixel coordinates through an optical geometric model (i.e., a small hole imaging model). Combined with the calibration results of the internal and external parameters of the camera, the distance to the vehicle or obstacle in front can be obtained. distance. Whether it is a monocular camera or a binocular camera, the internal and external parameters of the camera must be calibrated before data detection. The calibration process is to calculate the conversion of world coordinates to image coordinates through the following formula.
Camera internal parameter calibration is used to correct image distortion, and external parameter calibration is used to normalize the coordinate systems of multiple cameras and move their respective coordinate origins to the center of the vehicle's rear axle. fx and fy represent the focal length of the camera, and x and y represent the position of the target in the image coordinate system. It is not difficult to see from the above formula that the calibration results of camera parameters seriously affect the detection of the image position in the world coordinate system.
When the actual camera is installed on the vehicle, there are two calibration methods, one is production line calibration, and the other is real-time calibration. Production line calibration uses the grid information in the calibration board to calibrate the camera position. In general, Zhang Zhengyou's classic checkerboard model can be used for corner point position calibration, or a dot plate diagram can be used for online calibration. In addition, taking into account the deviation of the camera position during vehicle operation for a period of time or bumpy processes, the camera will also set up an online real-time calibration model at the same time. During the actual driving process, the detection result information such as lane line vanishing point or lane line will be updated in real time. Changes in pitch angle to optimize calibration parameters.
Effective detection of targets in ultra-large resolution images
In order to achieve target detection in large-size images, our common method is to set a traversal window, use the smooth window to traverse the super-large resolution image and then crop it into multiple sub-images, and then perform target extraction on each sub-image respectively. Finally, The target extraction results of all sub-images are spliced and then smoothed.
The processing logic of a domain controller designed with the chip architecture of the current computing power for ultra-large resolution images is to use certain means to resize the image. Or image downsampling is performed based on certain criteria (such as NXN's subgraph network) to reduce image resolution. However, both methods may cause the target to be missed.
Two important issues need to be resolved here:
1) How to set the traversal window size;
Generally, in a fixed-size traversal sub-image window, it may not be possible to traverse the total image to form an integer sub-image. In this case, many window edges will be expanded through image generalization or expansion. When we train the network on the data set, we usually need to transform the data set to the same size, but the usual resize function will destroy the aspect ratio of the image, and the aspect ratio is very important for the detection effect. In order to better retain For image features, the edge sub-image needs to be scaled using letterbox to the same size as the traversed sub-window. Letterbox resizes the image while maintaining the aspect ratio. First resize and then add 0 pixels to the surrounding pad as needed.
2) Assuming that the cutting target is located at the edge of the large image, how to ensure that it is not cut off;
It should be noted that if a target happens to be at the edge of the window, the target itself occupies fewer pixels and is truncated. At this time, it is easy to be divided during the sliding window detection process, which will ultimately make it more difficult to detect. Therefore, there must be a certain overlapping area during sliding window cropping. The reason is that if a target is cut into two pieces just at the edge of the window, the repeated part will cause the same target to appear repeatedly in multiple detection frame images. This is solved The method is to filter by combining the detection results of all subgraphs and using non-maximum suppression.
During the target detection process, the rotation invariance characteristics of autonomous driving detection images can be used, and the problem can be alleviated by rotating the image through data augmentation to generate objects of more shapes.
At the same time, in order to retain as much of the original image information as possible, it is generally necessary to expand the original image by two times, that is, upsample, to generate a set of sampling images. In order to ensure the pertinence and real-time nature of the subsequent image processing process, downsampling needs to be performed after Gaussian blur. That is, in many cases, in order to improve computing efficiency, large multiples of downsampling rates are often used for downsampling (such as 32 times the downsampling rate). ) What needs to be paid attention to during the downsampling process is to avoid over-downsampling, because over-sampling may cause small targets at large resolutions to be directly filtered out. A good method is to reduce the sampling multiple and increase the number of sampling network layers. This can Effectively increase feature extraction capabilities.
In addition, in target detection tasks in images, there may be an imbalance between the front and rear backgrounds, and the amount of data between different categories may be greatly different. First, data upsampling and downsampling can be used to balance different data. The second is to use data augmentation to increase the proportion of foreground targets in an image; finally, to adjust the detection weights of different targets by designing cost functions to control their detection optimization levels.
3) How to ensure that small target objects will not be missed
In large-resolution images, small target object detection has always been a difficulty. The general processing method is to use image pyramids for multi-scale training. Generally, the feature pyramid contains different information from shallow to deep layers, where the shallow layer involves more detailed features, and the deep network involves more semantic feature information. By downsampling the original large image to a certain extent, multiple image pyramids of different low resolutions are generated, and then using sub-image classifier sliding with different resolutions from shallow to deep layers of each pyramid to effectively detect the target.
Optimization scheme for monocular visual depth information estimation
Current assisted driving or autonomous driving systems usually use monocular vision to achieve target depth estimation. The monocular ranging method mainly achieves target recognition through image matching between frames, and then estimates the target distance through the size of the target in the image. Monocular ranging requires the projection of multiple 3D scenes into a 2D scene, while extracting geometric position coordinates from a single image requires not only considering local clues, but also the global context of the entire video frame. This process requires the use of convolutional neural network ideas, the core of which lies in local connections within the receptive field, weight sharing in the convolution state, and spatial or temporal downsampling of the pooling layer. The biggest advantage of convolutional neural network in information detection is its powerful feature extraction ability, which makes it strong in detecting local details. On the contrary, its ability to detect global target information is relatively weak.
Monocular visual estimation only establishes the geometric relationship between the world coordinates of the test object and the image pixel coordinates through the optical geometric model (i.e., the small hole imaging model). Combined with the calibration results of the internal and external parameters of the camera, the distance to the vehicle or obstacle in front is calculated. distance. The advantages of monocular visual estimation are low cost, simple system structure, and low computational requirements. The disadvantage is that the recognition process requires matching with a huge number of data samples, which not only results in a large ranging delay, but also a low accuracy rate. In this regard, there is a big shortcoming compared to the principle of using binocular cameras to directly use disparity maps for distance measurement.
In order to make up for this shortcoming in global detection capabilities, the Transformer detection mechanism was proposed in 2017. Its core idea is the attention mechanism. Its built-in long-distance detection characteristics ensure the detection range from shallow to deep, and better improve the overall situation. Modeling capabilities. Therefore, combining the image detection and tracking methods of CNN and Transformer can better improve vehicle target tracking capabilities. The basic block diagram architecture is as follows:
As shown in the figure above, the input three-dimensional image information is first encoded and decoded. The decoded features represent high-resolution and local pixel-level features. The decoded images use global attention to compute the unit width vector for each input image. The vector output includes two parts: one is to define how to divide the depth interval for the depth image; the other is to contain information useful for pixel-level depth calculation.
For the output from the Transformer, a set of two-dimensional convolution kernels are made and convolved with the decoded feature map to obtain the range attention map R. Secondly, by convolving the input unit vector of a certain size (h, w) (convolution kernel pxp, convolution step size s), the convolution output result is h/p×w/p×s Zhang quantity. Finally, after normalization, a unit width vector can be generated for calculating the spacing width b of the image.
Final depth map information = global information R + local information b.
The above monocular deep learning feature extraction method has good feature extraction capabilities. Even if the best feature extraction operator is used, it cannot cover the characteristics of all dynamic objects in the scene. For example, cars are easily misidentified as trucks. In engineering development, some geometric constraints can be added based on the real scene to improve the detection rate and reduce the false detection rate (such as size information, spatial position information, motion coherence information, etc.), so that a 3D detection model can be trained and then cooperated with the backend Multi-target tracking optimization and the ranging method based on monocular vision geometry complete the function detection module.
Accuracy problem of road scene information detection
1. Improvement plan for drivable area detection and analysis
Typical visual detection problems can be summarized into several major categories. The previous chapter mentioned the problem of small target detection. For the next generation of autonomous driving systems, problems that must be solved also include drivable area detection. This detection method includes detection of vehicles, The roadside and obstacle-free areas are divided, and finally a safe area that can be passed by the vehicle is output.
Drivable area detection (well-lit vs. night)
The detection of drivable areas is actually a semantic segmentation problem in deep learning. Dilated convolutions, pooling pyramids, path aggregation, environment coding, etc. that are commonly used in deep learning can all be well applied in it. However, there are still many problems in the detection of drivable areas:
First, and most importantly, there is still some uncertainty in the detected static boundaries or dynamic obstacle boundaries. This uncertainty makes it impossible to make effective trajectory planning and status decisions on the vehicle's driving status. In order to solve this type of problem, the semantic edge information can be corrected by matching the results of curbs, lane lines, and target boxes, and the drivable area can be defined from vector envelopes or raster maps.
Second, the detection of drivable areas is prone to data imbalance, and this imbalance problem often occurs in the training stage. This process requires the definition of a reasonable loss function and data upsampling rate for optimization.
Third, the detection of drivable areas may be due to factors such as light, dust, heavy snow and fog, and it is necessary to fully combine vision and radar for obstacle detection to ensure detection stability.
2. Improvement plan for lane line detection problem
In the visual perception of autonomous driving, lane lines serve as the basis for lateral centering control, and their detection process is the most basic requirement. Many lane line detection algorithms have been developed, and the most important detection difficulties include:
First, lane lines have elongated morphological characteristics, which requires continuity in tracking and even includes certain image splicing technology. Corresponding detection methods need to refer to different hierarchical division mechanisms to obtain the global spatial structure relationship. Corner detection can also be used to determine the positioning accuracy of details.
Secondly, the shape of lane lines is easily affected by external interference (such as being blocked, worn, and discontinuous when the road changes), and there are many uncertainties. The solution is to use algorithms with strong inference capabilities to infer edge cases.
Third, during the activation of driving assistance functions (such as automatic lane changing and lane keeping), the vehicle will switch left/right lane lines during the wheel pressing process. In addition to setting the filter delay, the solution can also be optimized by assigning a fixed sequence number to the lane line in advance.
3. Traffic signs and cone recognition problems
In autonomous driving systems, recognition of small targets such as traffic signs and cones is an important and urgent problem to be solved. Usually, to deal with such problems, basic neural networks are still used for feature extraction and generalization to cover, but this requires a large number of prior databases as support. Traffic sign detection has the following difficulties: First, since traffic signs, cones and barrels are small targets, the detection process requires more feature extraction and even generates more pyramid layers in the neural network; second, different Traffic signs (such as circular traffic lights, arrow-shaped traffic lights, countdown traffic lights; ice cream cones, triangular cones, trapezoidal cones, etc.) have different shapes, and their diversity problem is a problem that has to be solved; third, the scene It has high complexity, such as the installation position and installation direction of the signal light at the intersection; the starting and ending points of the cone barrel in the construction area, etc.
Summarize
The issue of visual perception in intelligent driving has always been a key concern in the industry. It not only affects subsequent trajectory planning and decision-making control, but is also the key to whether the entire assisted driving system can further upgrade to autonomous driving. We have been able to pay attention to the needs of related scene recognition and detection capabilities for the entire visual perception. In the future, we need to pay more attention to how to solve the problem of scene limitations in visual perception. This article fully explains the different solution paths from the perspectives of visual perception tasks, capabilities, limitations and improvement plans, and has good implementation value in engineering applications.