The importance of road environment perception for mobile robot navigation-EEWORLD

Collect

This article focuses on the importance of road environment perception to mobile navigation and related technologies.

01Construction of 3D road geometry model

First of all, the mechanism behind the 3D geometric model is multi-view geometry, which means that in order to obtain the 3D geometric structure of the corresponding model, you must use a camera to take pictures at two different positions. As shown in Figure 1, you can use two cameras to take pictures at different positions to obtain a 3D geometric model; similarly, you can use a single camera to continuously move and then continuously perform 3D reconstruction.

The main principle is: use the left camera plane and the right camera plane to calculate R and T. Usually when doing SLAM, you need to match the corresponding points in the image first, then use at least eight corresponding points to use SVD to solve the extrinsic parameter matrix, and then use this extrinsic parameter matrix to decompose to get R and T. After getting the relative pose of the two cameras, you can get the coordinates of the corresponding 3D points.

Figure 1 Depth estimation

For binocular vision, the plane of the image needs to be transformed first, because if the previous method is used, when matching feature points, it is often a two-dimensional matching problem, and the amount of calculation is relatively large.

Therefore, the binocular camera needs to be transformed into the red plane in Figure 2. After the antipodal point is pulled to infinity, the matching of corresponding points becomes a one-dimensional search problem, that is, select a point from the left camera, and then when selecting the corresponding point from the right camera, you only need to search on the same row.

The advantage of using binoculars for depth estimation is that a fixed baseline can be obtained through camera calibration. After that, a one-dimensional search can save a lot of computational effort and obtain a dense disparity map, which in turn corresponds to a dense depth map and finally a dense three-dimensional point.

Figure 2 Stereo matching

With the development of the field, there are many deep learning networks that can be used to obtain disparity maps, but most deep learning methods are data-driven. A big problem with data-driven methods is that sometimes we don’t know what the ground-truth is.

Of course, now we can use Lidar for synchronization, then project the radar point cloud onto the binocular camera, and then use the depth to infer the parallax. Although this solution can get the true value, its true value is limited by the accuracy of the camera and Lidar calibration.

Based on this, we explored many self-supervision methods and designed the PVStereo structure, as shown in Figure 3.

Figure 3 PVStereo structure

It can be seen that the traditional matching method is used to match images at different levels. At that time, it is assumed that the disparity of the corresponding image points is reliable, so no matter which pyids it corresponds to, they are all reliable. This is consistent with the assumption of deep learning. Then, using traditional pyramid voting, a relatively accurate but sparse disparity map can be obtained.

Inspired by the KT dataset, I wanted to use some sparse true values to train a better network, so I used traditional methods to guess the true value of the disparity, avoiding the process of using the true value to train the network.

The loop-based approach proposes the OptSreo network, as shown in Figure 4. First, a multi-scale cost volume is constructed, and then a loop unit is used to iteratively update the high-resolution disparity estimate. This not only avoids the error accumulation problem in the coarse-to-fine paradigm, but also achieves a great trade-off between accuracy and efficiency due to its simplicity and efficiency.

The experimental results are relatively robust, but outliers may appear in some scenarios.

Figure 4 Disparity map generation

Since ground truth is difficult to obtain, one method is to use traditional methods to guess some true values as false ground truth and then train the network; another method is to train in an unsupervised manner. Based on previous work, CoT-Stereo was proposed, as shown in Figure 5.

Using two different networks, network A and network B, these two networks are like two students, with different initializations, but the network structure is exactly the same. During initialization, network A and network B master different knowledge, and then network A shares the knowledge it thinks is correct with network B, and network B also shares it with network A. In this way, they can continuously learn from each other and evolve.

Figure 5 CoT-Stereo architecture

The results of unsupervised binocular estimation are also compared with many methods. Although the ground-truth results cannot be compared with the fully supervised methods, the overall influence me and the corresponding L of the network are well balanced, as shown in Figure 6.

Figure 6 Experimental results

How to use depth or parallax to convert into normal vector information? When doing some perception tasks, it is found that sometimes depth is not a very useful information, and if RGB-D information is used for training, there are other problems. If normal vector information is used, no matter near or far, the information given in the end is almost the same, and normal vector information is some additional assistance for many tasks.

The research found that there is not much or almost no work on how to quickly convert depth maps or disparity maps into normal vector information, so this type of work is studied here. The original intention is to be able to translate depth to normal vector without taking up almost any computing resources. The general framework is shown in Figure 7.

Figure 7 Three-Filte-to-Normal Framework

It can be seen that this is the most basic perspective transformation process, that is, a 3D coordinate can be transformed into an image coordinate using the camera's intrinsic parameters. If it is known that the local point satisfies the plane characteristic equation, it can be surprisingly found that if these two equations are combined, a formula expression such as one-half Z can be obtained.

After a series of calculations, we can see that the partial derivative of 1/V in the u direction is very easy to handle in the field of image processing. 1/V corresponds to disparity, but it is a multiple of the disparity. Therefore, if we take the partial derivative of 1/V, we are convolving the disparity map. Therefore, the normal vector estimation method does not need to convert the depth map into a three-dimensional point cloud, perform KNN, and then perform local plane fitting like the traditional method. This process is very complicated. But this method can easily convert the normal vector by knowing Z or the known depth map or disparity map.

We used this method to conduct a series of related experiments, and the results are shown in Figure 8. Compared with the most mainstream method at the time, we found that the method in this paper has a very good balance between speed and accuracy. Although the accuracy may be slightly worse, it has surpassed almost most methods. The speed can reach 260Hz using a single core, and 21kHz if CUDA is used, corresponding to an image resolution of 640×480.

Figure 8 Experimental results

After obtaining the above information, scene analysis is required. The current mainstream methods are semantic segmentation, object and instance segmentation. For scene understanding, especially semantic segmentation and some traditional methods are based on RGB information processing.

The main focus here is RGB-X, that is, how to extract features from RGB plus depth or normal. The main application focuses on feasible ordered detection, that is, the feasible area seen when driving. Currently, a framework as shown in Figure 9 is proposed.

Figure 9 Network structure

Here, a dual-path structure is used to extract features separately. One path is to extract features from RGB information, and the other path is to extract features from depth or normals. If it is depth, it needs to be converted to normal. Then, the features of these two different information can be fused to finally get a better feature, which includes both the texture characteristics in the RGB information and the geometric characteristics in the depth image. Finally, a better semantic segmentation result map is obtained through connection.

Some improvements are made to the above version, as shown in Figure 10. Since the fusion structure of the network is relatively complex, there is room for further improvement, so the following work is done here: First, some constraints are added to different channels using deep supervision, and then learning is done, which can solve the problem of gradient explosion. Secondly, since the previous network converged too quickly, a new set of SNE+ is designed here, which is better than SNE.

Figure 10 Improved network structure

The previous work has been based on feature-level fusion, and here we also study some data-level fusion. How to improve performance through multiple perspectives and a single ground-truth? Here we propose a network structure as shown in Figure 11.

It is mainly based on the homography of the plane. The homography is that the corresponding points can be estimated through the homography matrix of four pairs of points, and if the homography matrix and the left-right image are known, it can be converted into the perspective of another image through the ground-truth. It can be seen that here a reference image is given, a target image is given, and then the corresponding homegra-marix is estimated through the corresponding points, and then the target image can be directly converted into a generated image.