SLAM framework-common solution comparison-EEWORLD

Collect

SLAM is the first problem encountered when entering an unknown environment, whether indoors, outdoors, in the air or underwater. This issue will introduce the basics of SLAM: and the visual SLAM framework.

In recent years, robotics technology has been vigorously developed around the world. People are committed to using robots in practical scenarios: from indoor mobile robots to outdoor cars, aerial and underwater detection robots, etc., all of which have received widespread attention.

In most cases, we will encounter a basic difficulty when studying robots, that is, positioning and mapping, which is the so-called SLAM technology. Without accurate positioning and mapping, a sweeper cannot move autonomously in the room and can only bump into things randomly; a household robot cannot accurately reach a room according to instructions. In addition, in virtual reality and augmented reality technology, without the positioning provided by SLAM, users cannot roam in the scene. In these application areas, people need SLAM to provide spatial positioning information to the application layer and use SLAM maps to complete map construction or scene generation.

sensor

When we talk about SLAM, the first thing we ask about is the sensor. The implementation and difficulty of SLAM are closely related to the form and installation method of the sensor. Sensors are divided into two categories: laser and vision. Vision is further divided into three sub-directions. Let's take a look at the characteristics of each member of this huge family.

1. LiDAR Sensors

LiDARs are the oldest and most researched SLAM sensors. They provide distance information between the robot and obstacles in the surrounding environment. Common LiDARs, such as SICK, Velodyne, and our domestic rplidar, can all be used for SLAM. LiDAR can measure the angle and distance of obstacles around the robot, so as to easily realize SLAM, obstacle avoidance and other functions. The mainstream 2D laser sensor scans obstacles in a plane, which is suitable for robots with planar motion (such as sweepers, etc.) to locate and build a 2D grid map. This kind of map is very practical in robot navigation, because most robots cannot fly in the air or walk up stairs, and are still limited to the ground. In the history of SLAM research, early SLAM research almost all used laser sensors for mapping, and most of them used methods, such as Kalman filters and particle filters. The advantages of lasers are high accuracy, fast speed, low computational complexity, and easy to make real-time SLAM. The disadvantage is that it is expensive. A laser can easily cost tens of thousands of yuan, which will greatly increase the cost of a robot. Therefore, the research on lasers mainly focuses on how to reduce the cost of sensors. The EKF-SLAM theory for lasers is now very mature because it was studied earlier. At the same time, people also have a clearer understanding of the shortcomings of EKF-SLAM, such as the difficulty in representing loops, serious linearization errors, and the need to maintain the covariance matrix of landmark points, resulting in certain space and time overheads, etc.

2. Sensor-Visual SLAM

Visual SLAM is one of the hot topics in SLAM research in the 21st century. On the one hand, vision is very intuitive, which makes people wonder: Why can humans recognize the road through their eyes, but robots can't? On the other hand, due to the increase in processing speed, many visions that were previously considered to be non-real-time can now run at speeds above 10 Hz. The improvement of has also promoted the development of visual SLAM.

In terms of sensors, visual SLAM research is mainly divided into three categories: monocular, binocular (or multi-ocular), and RGBD . There are also special cameras such as fisheye and panoramic cameras, but they are in the minority in research and development. In addition, visual SLAM combined with inertial measurement units (IMUs) is also one of the current research hotspots. In terms of implementation difficulty, we can roughly rank these three methods as: monocular vision > binocular vision > RGBD.

Monocular camera: Monocular SLAM is abbreviated as MonoSLAM, which means that SLAM can be completed with only one camera. The advantage of this is that the sensor is particularly simple and the cost is particularly low, so monocular SLAM is very popular among researchers. Compared with other visual sensors, the biggest problem with monocular is that it cannot accurately obtain depth. This is a double-edged sword.

On the one hand, since the absolute depth is unknown, monocular SLAM cannot obtain the robot's motion trajectory and the actual size of the map. Intuitively speaking, if the trajectory and the room are magnified twice at the same time, the monocular view will appear to be the same. Therefore, monocular SLAM can only estimate a relative depth and solve it in the similarity transformation space Sim(3) instead of the traditional Euclidean space SE(3). If we must solve it in SE(3), we need to use some external means, such as GPS, IMU and other sensors, to determine the scale of the trajectory and the map.

On the other hand, a monocular camera cannot rely on one image to obtain the relative distance of the object in the image from itself. In order to estimate this relative depth, monocular SLAM relies on triangulation in motion to solve the camera motion and estimate the spatial position of the pixel. That is to say, its trajectory and map can only converge after the camera moves. If the camera does not move, the position of the pixel cannot be known. At the same time, the camera movement cannot be pure rotation, which brings some troubles to the application of monocular SLAM. Fortunately, when using SLAM in daily life, the camera will rotate and translate. However, the inability to determine the depth also has an advantage: it makes monocular SLAM unaffected by the size of the environment, so it can be used both indoors and outdoors.

Binocular camera: Compared with monocular, the binocular camera estimates the position of a point in space through the baseline between multiple cameras. Unlike monocular, stereo vision can estimate depth both when moving and when stationary, eliminating many of the troubles of monocular vision. However, the configuration and calibration of binocular or multi-cameras are relatively complex, and their depth range is also limited by the binocular baseline and resolution. Calculating pixel distance through binocular images is a very computationally intensive task, and is now mostly used to complete it.

RGBD: RGBD = RGB + Depth Map RGBD camera is a type of camera that started to emerge around 2010. Its biggest feature is that it can directly measure the distance of each pixel in the image from the camera through structured light or me-of-Flight principle. Therefore, it can provide richer information than traditional cameras, and it does not need to calculate the depth as time-consuming and laborious as monocular or binocular cameras. Currently commonly used RGBD cameras include Kinect/Kinect V2 (developed by Microsoft), Xtion (Asus), etc. However, most RGBD cameras still have many problems such as narrow measurement range, high noise, and small field of view. Due to the limitation of range, it is mainly used for indoor SLAM.

Visual SLAM Framework

Visual SLAM almost always has a basic framework. A SLAM system is divided into four modules (excluding sensor data reading): VO, backend, map construction, and loop closure. Here we briefly introduce the meaning of each module, and then introduce its usage in detail.

Visual Odometry in SLAM Framework

Visual Odometry , or visual odometer. It estimates the relative motion (Ego-motion) of the robot at two moments. In laser SLAM, we can match the current observation with the global map and use P to solve the relative motion. For the camera, it moves in Euclidean space, and we often need to estimate a transformation matrix in three-dimensional space - SE3 or Sim3 (monocular case). Solving this matrix is the core problem of VO, and the solution ideas are divided into feature-based ideas and direct methods that do not use features.

Feature Matching

The feature-based method is currently the mainstream method of VO . For two images, first extract the features in the image, and then calculate the camera's transformation matrix based on the feature matching of the two images. The most commonly used are point features, such as Harris corners, SIFT, SU, and B. If an RGBD camera is used, the camera motion can be directly estimated using feature points of known depth. Given a set of feature points and their pairing relationships, the problem of solving the camera's posture is called the PnP problem (Pepective-N-Point). PnP can be solved using nonlinear optimization to obtain the positional relationship between the two frames.

The method of performing VO without using features is called the direct method . It directly writes all the pixels in the image into a pose estimation equation to find the relative motion between frames. For example, in RGBD SLAM, ICP (Irative Closest Point) can be used to solve the transformation matrix between two point clouds. For monocular SLAM, we can match the pixels between two images, or match the image with a global model. Typical examples of direct methods are SVO and L-SLAM. They use direct methods in monocular SLAM and have achieved good results. At present, the direct method requires more computation than feature VO, and also has higher requirements on the image acquisition rate of the camera.

SLAM framework backend

After VO estimates the inter-frame motion, the robot's trajectory can be obtained in theory. However, the visual odometry, like ordinary odometry, has the problem of cumulative error (Drift). Intuitively speaking, at t1 and t2, the estimated turning angle is 1 degree less than the actual turning angle, so all subsequent trajectories will lose this 1 degree. Over time, the built room may change from a square to a polygon, and the estimated trajectory will also have serious drift. Therefore, in SLAM, the relative motion between frames is also put into a program called the backend for processing.