Research status of visual SLAM-EEWORLD

Collect

Therefore, as embedded computers have limited computing power, the ORB method is considered more suitable for the application of autonomous vehicles. Other image feature descriptors for VO are listed below, but not limited to DAISY (Tola et al., 2010), ASIFT (Morel and Yu, 2009), MROGH (Fan et al., 2011a), HARRIS (Wang et al., 2008), LDAHash (Fan et al., 2011b), D-BRIEF (Trzcinski and Lepetit, 2012), Vlfeat (Vedali and Fulkerson, 2010), FREAK (Alahi et al., 2012), Shape Context (Belongie et al., 2002), PCA-SIFT (Ke and Sukthantar, 2004).

2.3 Backend

The backend receives the camera pose estimated by the frontend and optimizes the initial pose to obtain a globally consistent motion trajectory and environment map (Sunderhauf and Protzel, 2012). Compared with the diverse algorithms of the frontend, the current types of backend algorithms can be mainly divided into two categories: filter-based methods (such as extended Kalman filter (EKF) Bailey et al., 2006) and optimization-based methods (such as factor graph Wrobel, 2001). They are described as follows: Filter-based methods, which mainly use the Bayesian principle to estimate the current state based on the previous state and current observation data (Liu, 2019).

Typical filter-based methods include extended Kalman filter (EKF) (Bailey et al., 2006), unscented Kalman filter (UKF) (Wan and Merwe, 2000) and particle filter (PF) (Arnaud et al., 2000). Taking the typical EKF-based SLAM method as an example, it is relatively successful in small-scale environments. However, since the covariance matrix is stored, its storage capacity increases with the square of the state quantity, so its application in large unknown scenes is always limited. Based on the optimization method, the core idea of the nonlinear optimization (graph optimization) method is to convert the back-end optimization algorithm into the form of a graph, with the subject pose and environmental features at different times as vertices, and the constraint relationship between vertices is represented by edges (Liang et al., 2013). After constructing the graph, the optimization-based algorithm is used to solve the pose of the target so that the state to be optimized on the vertex better satisfies the constraints on the corresponding edge. After executing the optimization algorithm, the corresponding graph is the target motion trajectory and the environment map. At present, most mainstream visual SLAM systems use nonlinear optimization methods.

2.4 Loopback

The task of loop closure is to allow the system to identify the current scene based on sensor information and determine that the area has been visited when returning to the original position, thereby eliminating the accumulated error of the SLAM system (Newman and Ho, 2005). For visual SLAM, traditional loop closure detection methods mainly use the bag of words (BoW) model (Galvez LoPez and Tardos, 2012), which is implemented as follows: i) Construct a word list containing K words by K-means clustering of local features extracted from the image. ii) Represent the image as a K-dimensional numerical vector based on the number of occurrences of each word. iii) Determine the difference in the scene and identify whether the autonomous vehicle has reached the identified scene.

2.5 Mapping

A fundamental component of autonomous vehicles is the ability to build a map of the environment and localize on the map. Mapping is one of the two tasks of a visual SLAM system (i.e., localization and mapping), and it plays an important role in navigation, obstacle avoidance, and environment reconstruction for autonomous driving. In general, map representations can be divided into two categories: metric maps and topological maps. Metric maps describe the relative positional relationships between map elements, while topological maps emphasize the connectivity between map elements. For classic SLAM systems, metric maps can be further divided into sparse maps and dense maps. Sparse maps contain only a small amount of information in the scene, which is suitable for localization, while dense maps contain more information, which is beneficial for vehicles to perform navigation tasks based on the map.

03 SOTA Research

3.1 Visual SLAM

Similar to the VO subsystem described above, pure visual SLAM systems can be divided into two categories according to the method of utilizing image information: feature-based methods and direct methods. Feature-based methods refer to estimating camera motion between adjacent frames and building environment maps by extracting and matching feature points. The disadvantage of this method is that it takes a long time to extract feature points and calculate descriptors. Therefore, some researchers suggest abandoning the calculation of key points and descriptors and then generating direct methods (Zou et al., 2020).

In addition, according to the type of sensor, visual SLAM can be divided into monocular, binocular, RGB-D and event camera-based methods. According to the density of the map, it can be divided into sparse, dense and semi-dense SLAM, which are introduced as follows:

3.1.1 Feature-based methods

In 2007, Davison et al. (2007) proposed the first real-time monocular vision SLAM system, Mono-SLAM. The result of real-time feature patch direction estimation is shown in Figure 3 (a). The EKF algorithm is used in the back end to track the sparse feature points obtained from the front end, and the camera pose and landmark point direction are used as state quantities to update its mean and covariance. In the same year, Klein and Murray (2007) proposed a parallel tracking and mapping system PTAM. It realizes the parallelization of tracking and mapping work. The process of feature extraction and mapping is shown in Figure 3 (b). For the first time, the front end and the back end are distinguished by a nonlinear optimization method, and a key frame mechanism is proposed.

Key images are connected in series to optimize motion trajectories and feature orientation. Many subsequent visual SLAM system designs have also adopted similar approaches. In 2015, Mur Artal et al. (2015) proposed ORB-SLAM, a relatively complete keyframe-based monocular SLAM method. Compared with the dual-thread mechanism of PTAM, this method divides the entire system into three threads: tracking, mapping, and loop closure. It should be noted that the processes of feature extraction and matching (left column), map construction, and loop detection are all based on ORB features (right column). Figure 3 (c) shows the real-time feature extraction process (left column) and trajectory tracking and mapping results (right column) of a monocular camera in a university road environment.

In 2017, Mur Artal et al. proposed a follow-up version of ORB-SLAM2 (Murartal and Tardos, 2017). This version supports loop detection and relocalization, has real-time map reuse capabilities, and the improved framework also opens the interface between stereo cameras and RGB-D cameras. The left column of Figure 3 (d) shows the stereo trajectory estimation and feature extraction of ORB-SLAM2. The right column of Figure 3 (d) shows the keyframes and dense point cloud mapping effects of the RGB-D camera in indoor scenes. The continuous green squares in the picture constitute the trajectory of the keyframe, and the dense 3D scene map constructed by the RGB-D camera surrounds the keyframe.

3.1.2 Direct-based methods

In 2011, Newcombe et al. (2011b) proposed a monocular SLAM framework based on the direct method DTAM. Unlike feature-based methods, DTAM adopts an inverse depth-based method to estimate the depth of features. The pose of the camera is calculated by direct image matching, and a dense map is constructed by an optimization-based method (Figure 4 (a)). In 2014, Jakob et al. (2014) proposed LSD-SLAM (Figure 4 (b)), which is a successful application of direct methods in the monocular visual SLAM framework. This method applies a pixel-oriented method to a semi-dense monocular SLAM system. Compared with feature-based methods, LSD-SLAM has lower sensitivity, but the system is fragile when the camera intrinsics and illumination change. In 2017, Forster et al. (2017) proposed SVO (Semi-Direct Visual Odometry). It uses a sparse direct method (also called a semi-direct method) to track key points (bottom of Figure 4 (c)) and estimates the pose based on the information around the key points. The top of Figure 4 (c) shows the trajectory of the sparse map in an indoor environment. Since the semi-direct method tracks sparse features and neither calculates descriptors nor processes dense information, SVO has lower time complexity and stronger real-time performance.

In 2016, Engel et al. (2018) proposed DSO, which also uses a semi-direct method to ensure higher accuracy at faster operating speeds. However, they are only visual odometry. Due to the lack of back-end optimization modules and loop closure modules, the tracking error of the system accumulates over time. Figure 4 (d) shows the 3D reconstruction and tracking effects of DSO (monocular visual odometry). The direct method has the advantages of fast calculation speed and insensitivity to weak feature conditions. However, it is based on the strong assumption that the grayscale is unchanged, so it is very sensitive to changes in lighting. On the contrary, the feature point method has good invariance.

In 2020, Zubizarreta et al. (2020) proposed a direct sparse mapping method DSM, which is a fully monocular visual SLAM system based on the photometric bundle adjustment (PBA) algorithm. Table 1 summarizes the main features of the state-of-the-art visual SLAM frameworks and their advantages and disadvantages. In addition to the above typical frameworks, other related works have been studied, such as (i) sparse visual SLAM; (ii) semi-dense visual SLAM; (iii) dense visual SLAM. As you can see, there are many achievements in the field of visual SLAM, and the paper only reviews the popular methods. Even though visual SLAM provides good localization and mapping results, all these solutions have advantages and disadvantages. In this work, the advantages and disadvantages of "sparse-based methods", "dense-based methods" and "feature-based methods" are summarized. The advantages and disadvantages of "direct-based methods", "monocular methods", "stereoscopic methods", "RGB-D methods" and "event camera methods" can be found in Table 2.