Research status of visual SLAM-EEWORLD

Collect

Some studies use pose estimation of visual SLAM for point cloud annotation in the mapping stage. Zhang et al. (2018b) proposed a monocular visual SLAM method based on a one-dimensional lidar rangefinder, which achieves effective drift correction on low-cost hardware and is used to solve the scale drift problem that often occurs in monocular SLAM. Scherer et al. (2012) used drones to map waterways and vegetation along rivers, used a fusion framework combining visual odometer measurement and IMU to estimate the state, and used lidar to detect obstacles and map the river boundary, but this method produced a point cloud containing occluded points, which reduced the accuracy of state estimation to a certain extent. Huang et al. (2019) solved this problem and proposed a direct SLAM method that includes occluded point detection and coplanar point detection mechanisms.

3.4.2 LIDAR guidance method

In terms of LIDAR-guided methods, it uses visual information to improve the accuracy of loop detection, or constructs a joint optimization function of LIDAR feature transformation error and visual reprojection error in the pose estimation stage to improve the robustness of pose estimation. For example, Bai et al. (Bai et al., 2016) uses convolutional neural networks to extract features to achieve loop detection, and effectively avoids mismatching of loop closure scenes by setting the matching range, and ensures the real-time performance of the SLAM system through feature compression. Liang et al. (2016) used scan matching and ORB feature-based loop detection technology to improve the weak performance of LIDAR-based SLAM. Zhu et al. (2018) proposed a 3D laser SLAM method using visual loop detection, which achieves loop detection by using the keyframe technology of visual vocabulary bags. In addition, the iterative closest point (ICP) method (Arun et al., 1987) can also be optimized by lidar and visual fusion. Pande et al. (2011) used visual information to estimate rigid body transformation and then proposed a generalized ICP framework.

3.4.3 Vision-LiDAR Mutual Calibration Method

Most of the above research methods use a single SLAM method and use another sensor as an auxiliary device. There are also some studies that try to combine the two SLAM methods to correct each other. VLOAM (Zhang and Singh, 2015) is a classic real-time method for visual lidar mutual correction. This method uses the camera pose estimated by the visual odometer within the lidar scanning circle to correct the laser point cloud. The point cloud motion is distorted, and the relative pose estimated from the LIDAR point cloud after adjacent scan correction is used to correct the visual estimated pose, and the corrected point cloud is mapped to the local map for subsequent pose optimization.

Seo and Chou (2019) proposed a parallel SLAM method that uses both lidar SLAM and visual SLAM, characterized by using the measurement residuals of the two modes to optimize the backend. Jiang et al. (2019) used LIDAR constraints and feature point constraints to define the cost function of graph optimization and constructed a 2.5D map to speed up the loop detection process. At present, the research results and practical applications of SLAM methods based on visual LIDAR fusion are less than those of visual inertial fusion, and further exploration and research are needed.

3.5 Visual-LIDAR-IMU SLAM

At present, multi-sensor fusion methods (such as vision-LIDAR-IMU fusion SLAM) are considered to be suitable for L3 autonomous driving and have attracted the attention of many scholars. LiDAR-based SLAM systems can obtain a wide range of environmental details, but they can easily fail in scenes lacking structural information, especially autonomous driving scenes. For example, long corridors or open squares. Vision-based methods perform well in scenes with rich texture information and can easily re-identify scenes (Shin et al., 2020). But it is very sensitive to changes in lighting, rapid movement, and initialization processes. Therefore, LiDAR and vision sensors are often fused with IMU to improve the accuracy and robustness of the system. IMU can eliminate motion distortion of point clouds and persist for a period of time in feature-less environments, while helping the visual system to recover scale information.

At present, there are few research results on visual-LIDAR-IMU fusion SLAM (Debeunne and Vivet, 2020b). Some scholars have tried to use visual-IMU fusion systems (i.e., visual-inertial systems, VIS) and LIDAR-IMU fusion systems (such as LIDAR-inertial systems, LIS) because these two separate modules are further fused to form a visual-LIDAR-IMU fusion system (LIDAR-visual-inertial system, LVIS) with better performance (Chen et al., 2018). This paper also introduces the research status of laser-IMU fusion SLAM methods. The schemes based on LIDAR-IMU fusion are divided into two categories: loosely coupled and tightly coupled schemes. Typical loosely coupled schemes are LOAM, (Figure 16 (a)) and LeGO-LOMA (Shan and Englot, 2018), in which the IMU measurement information is not used in the optimization step. Compared with loosely coupled schemes, tightly coupled schemes are in the development stage, which generally greatly improves the accuracy and robustness of the system. Among the currently publicly available tightly coupled systems, LIO-Mapping (Ye et al., 2019) uses the optimization process in VINS-Mono to minimize IMU residuals and LIDAR measurement errors. Since LIO mapping aims to optimize all measurements, the real-time effect of the system is poor. Zou et al. proposed LIC fusion, as shown in Figure 16 (b). It fuses LiDAR features and sparse visual features extracted from the point cloud. The blue and red LiDARR points are plane and edge features, respectively, and the estimated trajectory is marked in green. In order to save computing resources, LIO-SAM (Figure 16 (c)) introduces a sliding window optimization algorithm and uses a factor graph method to jointly optimize the measurement constraints of IMU and LIDAR. LINS (Figure 16 (e)), designed specifically for ground vehicles, uses an error-state-based Kalman filter to iteratively correct the state quantity to be estimated.

Zhang and Singh (2018) proposed a tightly coupled LVIO (LiDAR Visual Inertial Odometry) system that uses a coarse-to-fine state estimation approach, starting with a rough estimate from the IMU prediction, which is then further refined by VIO and LIO. Currently, the LVIO algorithm is the algorithm with the highest test accuracy on the KITTI dataset. Zoo et al. (2019) implemented online calibration of spatiotemporal multi-sensors based on the MSCKF framework. Unfortunately, the code implemented by Zhang and Singh (2018) and Zoo et al. (2019) is not currently open source. Shan et al. (2021) released the latest visual LIDAR-IMU tightly coupled scheme: LVI-SAM (Figure 16 (d)) in 2021. To improve the real-time performance of the system, it uses smoothing and mapping algorithms. The authors regard the visual IMU and the LiDAR IMU as two independent subsystems. When enough feature points are detected, the two subsystems are linked together. When one of the subsystems cannot detect, the two subsystems can be separated independently because they do not affect each other. Table 5 summarizes the main algorithms in the visual-inertial SLAM framework in recent years.

04 Discussion

Although visual SLAM has achieved great success in the localization and mapping of autonomous vehicles as mentioned above, the existing technology is not mature enough to fully solve the current problems. Current vision-based localization and mapping solutions are still in their infancy. In order to meet the requirements of autonomous driving in complex urban environments, future researchers face many challenges. The practical application of these technologies should be regarded as a systematic research problem. In addition, the SLAM system is only a component of the complex system of autonomous vehicles. The autonomous driving system cannot rely entirely on the SLAM system, but also needs to be equipped with modules such as control, target detection, path planning, and decision-making. This section discusses the current key issues of vision and vision-based SLAM for autonomous vehicle applications and the overall observation and inference of future development trends.

4.1 Real-time Performance

The application of autonomous vehicles requires visual SLAM systems to respond as quickly as possible. In the case of vision algorithms, a frequency of 10 Hz is considered to be the minimum frame rate required for vehicles to maintain autonomous driving on urban roads. On the one hand, some vision algorithms have been proposed to explicitly optimize real-time performance, and on the other hand, further improvements can be made through hardware with higher-specification performance, such as GPUs. In addition, various environmental dynamics (such as scene changes, moving obstacles, and lighting invariants) should be considered for the accuracy and robustness of the system. Currently, in specific scenarios, such as automatic valet parking (APV), cameras are most commonly used to achieve obstacle detection or avoidance and lane keeping for autonomous driving;

4.2 Positioning

Autonomous driving in urban road scenarios is still at the technical research stage between L2 and L3, and one of the key issues is that the position accuracy of the vehicle is very rough. The paper observes that high-quality autonomous driving is inseparable from accurate positioning, and the vehicle can navigate at the centimeter level even in an unmapped environment. This accuracy cannot be achieved by relying solely on traditional GPS receivers with an accuracy of about 10 meters. Expensive differential GPS (DGPS) receivers are usually installed to achieve this, but it introduces redundancy, while the visual SLAM algorithm itself can be used for precise positioning. As described in this paper, other GPS-independent methods for achieving relative positioning, such as visual-inertial fusion methods, visual-LIDAR fusion methods, and visual-LIDAR-IMU fusion methods, are studied. The drift error introduced by the IMU will exponentially affect the accuracy. In terms of visual LIDAR fusion methods, the positioning robustness of autonomous vehicles cannot be guaranteed due to the lack of their own dead reckoning (DR) sensors (such as encoders and IMU sensors). As far as the visual LIDAR-IMU fusion method is concerned, as far as the paper knows, there is currently no mature vision-based fusion SLAM algorithm that has been successfully applied to real-world autonomous vehicles, but many excellent fusion methods are being studied in recent years. As the cost of LiDAR sensors decreases, we believe that the visual LiDAR IMU fusion method is the ultimate solution for high-precision positioning of autonomous vehicles;