Research status of visual SLAM-EEWORLD

Collect

Autonomous vehicles require precise positioning and mapping solutions in different driving environments. In this context, SLAM technology is a good solution. LIDAR and camera sensors are usually used for positioning and perception. However, after ten or twenty years of development, the lidar SLAM method does not seem to have changed much. Compared with lidar-based solutions, visual SLAM has the advantages of low cost and easy installation, and has strong scene recognition capabilities. In fact, people are trying to replace lidar sensors with cameras, or integrate other sensors based on cameras in the field of autonomous driving.

Based on the current status of visual SLAM research, this paper reviews visual SLAM technology. In particular, the paper first explains the typical structure of visual SLAM. Secondly, it comprehensively reviews the latest research on visual and vision-based (i.e., visual-inertial, visual-LIDAR, visual-LIAR-IMU) SLAM, and compares the positioning accuracy of previous work with well-known frameworks on public datasets. Finally, the key issues and future development trends of visual SLAM technology for autonomous vehicles are discussed.

01 Introduction

With the development of robotics and artificial intelligence (AI) technologies, autonomous vehicles (automobiles) have become a hot topic in industry and academia (Badue et al., 2021). In order to navigate safely, it is necessary to create an accurate representation of the surrounding environment and estimate the state of the ego vehicle in it (i.e., ego vehicle localization). Traditional localization methods are based on GPS or real-time kinematic (RTK) positioning systems (Cadena et al., 2016b). However, due to signal reflections, time errors, and atmospheric conditions, the measurement error of GPS is limited to less than a dozen meters, which is unacceptable for vehicle navigation, especially when the vehicle is driving in tunnels and urban canyon scenarios (Cheng et al., 2019). RTK is able to correct these errors through internal correction signals from fixed calibration base stations, but such systems rely on additional infrastructure with high costs (Infotip Service GmbH, 2019).

The SLAM method is considered to be a good solution for the positioning and navigation of autonomous vehicles. It can estimate the posture of a moving vehicle in real time while building a map of the surrounding environment (Durrantwhyte and Bailey, 2006). According to the type of sensor, SLAM methods are mainly divided into two categories: LIDAR SLAM and visual SLAM. Since LIDAR SLAM started earlier than visual SLAM, it is relatively mature in the application of autopilots (Debeunne and Vivet, 2020a). Compared with cameras, LIDAR sensors are less sensitive to changes in lighting and nighttime. In addition, it can also provide 3D map information with a larger field of view (FOV). However, the unaffordable cost and large-scale long development cycle make it difficult for LIDAR sensors to be popularized. In contrast, visual SLAM has the advantages of rich information, easy installation, and makes the system cheaper and lighter.

Currently, visual SLAM systems can be run in micro personal computers (PCs) and embedded devices, and even in mobile devices such as smartphones (Klein and Murray, 2009). Unlike indoor or outdoor mobile robots, autonomous vehicles have more complex parameters, especially when the vehicle is autonomously driving in an urban environment. For example, the environment is larger in area and has dynamic obstacles, so the performance of visual SLAM methods is not accurate and robust enough (Cadena et al., 2016a).

Issues such as error accumulation and lighting changes and rapid motion lead to problematic estimates. Various approaches have been considered to address these issues associated with autonomous vehicles. For example, feature point/direct/semi-direct/point-line fusion-based algorithms for visual odometry (VO) (Singandhupe and La, 2019), and extended Kalman filter (EKF)/graph-based optimization algorithms for pose estimation (Takleh et al., 2018). Meanwhile, vision-based multi-sensor fusion methods have also attracted great attention to improve the accuracy of autonomous systems. In vision-based SLAM systems, in addition to the mapping module, the collection of sensor data (such as cameras or inertial measurement units (IMUs), VO, and visual-inertial odometry (VIO) systems) is done at the front end, while optimization and loop closure are done at the back end. Relocalization has always been considered as an additional module to improve the accuracy of visual SLAM systems (Taketomi et al., 2017).

This paper reviews visual SLAM methods. This is mainly considered from the perspective of the positioning accuracy of visual SLAM systems, and methods that may be applied to autonomous driving scenarios have been studied as detailed as possible, including pure visual SLAM methods, visual-inertial SLAM methods, and visual-LIDAR-inertial SLAM methods, and the positioning accuracy of previous work in the paper is compared with known methods on public datasets. This review provides a detailed overview of visual SLAM technology and can provide a friendly guide for new researchers in the field of autonomous vehicles. In addition, it can be regarded as a dictionary for experienced researchers to find possible directions in future work.

02 Principles of Visual SLAM

The classic structure of the visual SLAM system can be divided into five parts: camera sensor module, front-end module, back-end module, loop closure module and mapping module. As shown in Figure 1, the camera sensor module is responsible for collecting image data, the front-end module is responsible for tracking image features between two adjacent frames to achieve initial camera motion estimation and local mapping, the back-end module is responsible for numerical optimization of the front end and further motion estimation, the loop closure module is responsible for eliminating cumulative errors by calculating image similarity in a large-scale environment, and the mapping module is responsible for reconstructing the surrounding environment (Gao et al., 2017).

2.1 Camera Sensor

According to the different sensor types, common visual sensors can be mainly divided into monocular, binocular, RGB-D and event cameras. Camera sensors are shown in Figure 2. Popular visual sensor manufacturers and products on the market are as follows, but not limited to:

MYNTAI: S1030 series (stereo camera with IMU), D1000 series (depth camera), D1200 series (for smartphones);

Stereolabs ZED: Stereolab ZED camera (depth range: 1.5 to 20 meters);

Intel: 200 series, 300 series, Module D400 series, D415 (active infrared binocular, rolling shutter), D435 (active infrared binocular, global shutter), D4 35i (integrated IMU);

Microsoft: Azure Kinect (for microphone with IMU), Kinectc-v1 (structured light), Kinect-v2 (TOF);

Occipital Structure: Structure Camera (applied to iPad);

Samsung: 2nd and 3rd generation dynamic cameras and event-based vision solutions (Son et al., 2017b).

2.2 Frontend

The front end of visual SLAM is called visual odometry (VO). It is responsible for roughly estimating the camera motion and feature orientation based on the information of adjacent frames. In order to obtain accurate poses with fast response speed, an effective VO is required. Currently, the front end can be mainly divided into two categories: feature-based methods and direct methods (including semi-direct methods) (Zou et al., 2020). This section mainly reviews the feature-based methods of VO.

Semi-direct and direct methods are described later. The VO system based on feature points runs more stably and is relatively insensitive to light and dynamic targets. Feature extraction methods with high scale and good rotation invariance can greatly improve the reliability and stability of the VO system (Chen et al., 2019). In 1999, Lowe (2004) proposed the scale-invariant feature transform (SIFT) algorithm, which was improved and developed in 2004. The entire algorithm is divided into three steps to complete the extraction and description of image feature points. i) Construct the scale space through the Gaussian difference pyramid method, and identify the points of interest through the Gaussian differential function. ii) Determine the position and scale of each candidate, and then find the key points. iii) Assign the pointing features to the key points to obtain the descriptor. SIFT consumes a lot of calculations. SURF (Herbert et al., 2007) is an improvement on SIFT. It solves the shortcomings of SIFT's large amount of calculation and poor real-time performance, and maintains the excellent performance of the SIFT operator. Nevertheless, the SURF algorithm has greater limitations when applied to real-time SLAM systems. On the basis of ensuring performance, a feature extraction algorithm that pays more attention to calculation speed is proposed. In 2011, Viswanathan (2011) proposed a local corner detection method based on templates and machine learning methods, namely the FAST corner detection method. The FAST algorithm takes the pixel to be detected as the center of the circle. When the grayscale difference between other pixels on the circle with a fixed radius and the pixel at the center of the circle is large enough, the point is considered to be a corner. However, FAST corners do not have direction and scale information, and they are not rotation and scale invariant.

In 2012, Rublee et al. (2012) proposed the oriented FAST and rotated BRIEF (ORB) algorithm based on FAST corner points and BRIEF descriptors. The algorithm first builds an image pyramid on the image, then detects FAST key points and calculates the feature vectors of the key points. ORB's descriptor adopts the fast calculation speed of the binary string feature BRIEF descriptor (Michael et al., 2010), so ORB calculation speed is faster than the fast algorithm with real-time feature detection. In addition, ORB is less affected by noise, has good rotation invariance and scale invariance, and can be applied to real-time SLAM systems. In 2016, Chien et al. (2016) compared and evaluated SIFT, SURF and ORB feature extraction algorithms for VO applications. Through extensive testing on the KITTI dataset (Geiger et al., 2013), it can be concluded that SIFT is the most accurate in extracting features, while ORB has a smaller amount of calculation.