Research status of visual SLAM-EEWORLD

Collect

3.2 Visual-Inertial SLAM

IMU sensors can provide a good solution to the problem of tracking failure when the camera moves into a challenging environment (less texture and/or lighting changes), on the other hand, visual sensors can compensate for the accumulated drift of the IMU. This combination of vision and IMU is called a golden pair. Due to the complementary functions of cameras and IMUs, it has good development prospects in fields such as unmanned driving (Sun and Tian, 2019). The main approach of VI-SLAM is to combine IMU information into the front end of the visual SLAM system, which is also called the visual inertial odometry (VIO) system. Generally, VI-SLAM systems can be divided into two categories: filter-based methods and optimization-based methods: 3.2.1 Feature-based methods In 2007, Mourikis and Roumeliotis (2007) proposed the multi-state constrained Kalman filter (MSCKF), which is the earliest visual inertial SLAM system based on the extended Kalman filter (EKF) algorithm.

Compared with pure visual odometry, MSCKF (Figure 5 (a)) can adapt to more drastic motion and texture loss within a certain period of time and has higher robustness. In 2012, Stephan (2012) proposed SSF (Figure 5 (b)), which is a time delay compensated single sensor and multi-sensor fusion framework based on EKF and loose coupling method. In 2013, Li and Mourikis (2013) pointed out the inconsistency of MSCKF in the state estimation process. In 2017, Paul et al. (2017) proposed MSCKF2.0, which greatly improved the accuracy, consistency and computational efficiency. In addition, ROVIO (Robust Visual Inertial Odometry) (Bloesch et al., 2015) (Figure 5 (c)) and MSCKF-VIO (Ke et al.) (Figure 6 (d)) are also excellent works based on filtering methods in recent years;

3.2.2 Optimization-based methods

As far as optimization-based VI-SLAM systems are concerned, the most classic framework is OKVIS. In 2015, Leutinegge et al. proposed OKVIS, which uses IMU measurements to predict the current state, spatial points, and 2D image features to constitute the reprojection error. The predicted IMU state quantity and the optimized parameters constitute the IMU error term, and then the reprojection error is combined with the IMU error for optimization. In 2017, Tong et al. (2017) proposed VINS-Mono, which is regarded as an excellent monocular VI-SLAM system, using an optical flow method at the front end and a sliding window-based nonlinear optimization algorithm at the back end (Cheng et al., 2021b). In addition, the initialization method of VINS-Mono is noteworthy, which adopts a disjoint method (as well as VI-ORBSLAM Mur Artal and Tards, 2017), which first initializes the pure vision subsystem and then estimates the bias, gravity, scale, and velocity of the IMU (accelerometer and gyroscope).

VINS Mono has been shown to have comparable positioning accuracy to OKVIS through tests on the KITTI and EuRoC datasets, with more complete and robust initialization and loop closure phases. In 2019, the VINS-Mono team proposed a stereo version that integrates GPS information, VINS-Fusion (Tong et al., 2019). As shown in Figure 6 (c), it achieves good positioning and mapping results in outdoor environments due to the addition of GPS measurements, and is considered a good application in the field of autonomous vehicles. In 2020, Campos et al. (2020) proposed a feature-based tightly integrated visual-inertial SLAM system ORB-SLAM3. This is the latest achievement of a more efficient initialization process achieved through the maximum a posteriori (MAP) algorithm, and it implements multi-map capabilities that rely on a new place recognition method with improved recall. In addition, the system is able to perform visual, visual-inertial, and multi-map SLAM using monocular, stereo, and RGB-D cameras. The experimental results of outdoor scenes are shown in Figure 6 (d).

The pipeline of ORB-SLAM3 is similar to that of ORB-SLAM2, and the whole system consists of three threads: tracking, local mapping, and loop closure threads. In addition, ORB-SLAM3 can survive long periods of bad visual information. When it is lost, it starts a new map, and when the map area is revisited, it will merge seamlessly with the previous map. Table 3 summarizes the main algorithms in the visual inertial SLAM framework in recent years. At present, the optimization-based VI-SLAM method has become the mainstream. In addition to the above methods, there are other state-of-the-art works that can be summarized as follows, but not limited to BASALT, Kimera, ICE-BA, Maplab, StructVIO.

3.3 Testing and Evaluation

In order to intuitively understand the positioning effect of the above SLAM methods, some typical algorithms were tested on the same airborne computer equipped with Intel Core i7-9700 CPU, 16 GB RAM and Ubuntu18.04+Melodic operating system, and compared with one of our previous works (Cheng et al., 2021a). As described in Cheng et al. (2021a), an improved trust region iteration strategy was proposed based on the traditional Gauss-Newton (GN) linear iteration strategy, and then the strategy was integrated into the VI-ORBSLAM framework (Mur-Artal and Tards, 2017) to achieve faster initialization and higher positioning accuracy. The model of the trust region iteration strategy is shown in Figure 7. It combines the steepest descent algorithm and the GN algorithm to approximate the objective function with a trust model. When the solution is considered to be the minimum of the model function near the current point, the minimization subproblem is solved in each iteration step.

The initial parameters that need to be estimated include scale factors, velocity, gravity, and biases of the accelerometer and gyroscope. In order to make all variables observable, a pure ORB-SLAM system takes several seconds to execute. The specific steps of this method are as follows: First, a visual initialization process is performed, including ORB extraction, map initialization, and initial pose estimation. Secondly, the IMU camera is frequency aligned using IMU pre-integration technology to generate key frames. Third, an improved trust region-based iterative strategy is proposed for gyro bias estimation, and the gravity direction is refined. Finally, the accelerometer bias and visual scale are estimated based on previous estimates. The pipeline of the previous work of the paper is shown in Figure 8.

The 2D trajectory of the algorithm on the EuRoC dataset V2_01_easy sequence is shown in Figure 9. It can be seen that the test results of each algorithm have different degrees of deviation compared with GT. The trajectory of the paper algorithm (red line) is closer to GT (black dashed line), while VI-ORBSLAM (blue line) has the largest drift. The position change curves in the X, Y, and Z directions are shown in Figure 10. The comparison curves of the Euler angles (i.e., roll, pitch, and yaw) are shown in Figure 11. Table 4 shows the quantitative root mean square error (RMSE) results and frame rates in the same CPU platform (i7-9700 CPU) tested throughout the 11 sequences. Because all algorithms use multi-threading, the third column of Table 4 reports the frame rate when processing the image stream. Figures 12 and 13 provide the RMSE and cumulative distribution function (CDF) of the translation error, respectively, and Figures 14 and 15 provide the RMSE and cumulative distribution function (CDF) of the orientation error, respectively.

It can be seen that the previous work of the paper, a fast monocular visual inertial system with an improved iterative initialization strategy method, achieved the best positioning accuracy in almost all sequences. In fact, due to the excellent initialization process, the method of the paper provides the best orientation performance on six sequences and seven sequences, and the system can quickly restart work even if it cannot extract ORB features.

3.4 Vision-LIDAR SLAM

Vision and LiDAR have their own advantages. For example, vision can obtain a large amount of texture information from the environment and has strong scene recognition capabilities, while LADAR does not rely on light, has good reliability, and has higher distance measurement accuracy. Therefore, in the field of autonomous driving, the SLAM system that integrates vision and LiDAR can provide a smarter and more reliable environment perception and state estimation solution. It follows the classic SLAM architecture with three main steps: (i) data processing step; (ii) estimation; (iii) global mapping step. According to the different proportions of vision and LiDAR in the SLAM system, the vision-LiDAR SLAM scheme can be divided into three categories: vision-guided method, LiDAR-guided method, and vision-LiDAR mutual correction method.

3.4.1 Visual guidance method

Visual SLAM, especially for monocular visual SLAM, is always unable to effectively extract the depth information of feature points, while LIDAR is an expert in this area. In order to make up for the shortcomings of visual SLAM, researchers have tried to integrate LIDAR data into visual SLAM systems. The representative work of visual guided SLAM is LIMO (Graeter et al., 2018). This method projects the spatial point cloud obtained by the lidar onto the image plane to estimate the scale of the visual features, and then constructs the error term of the visual feature scale restored by the lidar and the feature scale estimated from the camera pose as a constraint for back-end optimization.

Shin et al. (2018b) proposed a method to obtain sparse depth point clouds for visual SLAM using LIDAR. Since the resolution of the camera is much higher than that of LIDAR, this method has the problem that a large number of pixels have no depth information. To solve this problem, De Silva et al. (2018) used a Gaussian regression model to interpolate the missing depth values after calculating the geometric transformation between the two sensors. This method uses LiDAR to directly initialize the features detected in the image, and its effect is the same as that of the method using RGB-D sensors. There are also some studies that integrate LiDAR into visual SLAM to improve the application value of the solution, such as reducing costs, improving performance, and enhancing system robustness.