A multi-level map construction algorithm suitable for dynamic scenes-EEWORLD

Collect

Author: Xinggang Hu

Localization and map building in visual SLAM in dynamic scenes face great challenges. In recent years, many excellent research works have proposed effective solutions to the localization problem. However, there are relatively few excellent works on building long-term consistent maps in dynamic scenes, which seriously hinders the development of map applications. To solve this problem, we designed a multi-level map building system for dynamic scenes. In this system, multi-target tracking, DBS clustering algorithm and depth information are used to correct the results of the target, accurately extract static point clouds, and build dense point cloud maps and octree maps. We proposed a plane map building algorithm specifically for dynamic scenes, which involves the extraction, filtering, data association and fusion optimization of planes in dynamic environments to create a plane map. In addition, an object map building algorithm specifically for dynamic scenes is introduced, including object parameterization, data association and update optimization. Extensive experiments on public datasets and actual scenes verify the accuracy of the multi-level map constructed in this study and the robustness of the proposed algorithm. In addition, by using the constructed object map for dynamic object tracking, we demonstrate the practical application prospects of the algorithm.

Main Contributions

This paper proposes a multi-level map construction algorithm for dynamic scenes, as shown in Figure 1. First, we use YOLOX [8] to obtain the semantic information of the scene, adopt a multi-target tracking algorithm to compensate for missed detection, and use the DBSCAN density clustering algorithm and depth information to further optimize the detection bounding box of potential moving objects. Then, we extract point clouds and planes, and parameterize objects using principal component analysis (PCA) and minimum bounding rectangle. In addition, we filter point clouds, planes, and objects. Next, based on the camera pose provided by our previous study [9], we perform point cloud stitching and fusion, perform data association and update optimization on planes and objects, and then convert the dense point cloud map into an octree map. Finally, a multi-level map is constructed, including dense point cloud maps, octree maps, plane maps, and object maps, thereby enriching the application scenarios of maps. Figure 1 shows the system framework of the multi-level map construction algorithm for dynamic scenes. The effectiveness of our algorithm is fully verified through experiments conducted on publicly available datasets and real-world scenarios.

Figure 1. System framework of the multi-level map construction algorithm for dynamic scenes. The light green part is the input module, which is responsible for inputting RGB images and depth images. The dark green part is the preprocessing module, which is mainly responsible for obtaining and preprocessing semantic information. The yellow, blue and brown modules are map construction modules, which represent the general process of building dense point cloud maps and octree maps, plane maps and object maps respectively. The purple part is the output module, which is responsible for outputting the multi-level map constructed by the map construction module.

The contributions of this paper are summarized as follows:

The point cloud is filtered based on the corrected object detection results, and a clean point cloud map and octree map containing only static elements are constructed.

A method for constructing planar maps in dynamic scenes is proposed to achieve the perception of environmental structure.

We propose a method to build a map of objects in dynamic scenes, enabling SLAM to meet more advanced requirements such as environment understanding, object manipulation, and semantic augmented reality.

To the best of our knowledge, this is the first work to construct a planar map in a dynamic scene, and the first work to accurately parameterize objects and build an accurate and complete lightweight object map.

Contents

Construction of geometric maps

A. Construction of dense point cloud map and octree map

In the presence of semantic prior information, point clouds in the target detection box or semantic mask can be deleted according to the semantic category, thereby constructing a dense point cloud map containing only static factors. However, relying solely on the original semantic results, the "missed detection" and "under-segmentation" problems of semantic information may lead to incomplete removal of dynamic objects. This paper uses YOLOX for semantic information acquisition to solve this problem. In order to solve the "missed detection" problem, this paper uses a multi-target tracking algorithm for missed detection compensation. In order to solve the "under-segmentation" problem, the DBSCAN clustering algorithm is first used to extract foreground points within the bounding box of the potential moving object. Subsequently, the detection box is appropriately expanded based on the depth information of the neighboring pixels along the detection box boundary and the foreground points. In order to avoid errors caused by DBSCAN clustering, we set all four directions of the detection box as extension limits, which are limited to 50 pixels. In the keyframe, the pixels outside the corrected bounding box of the potential moving object in the 3D world coordinate system are extracted and mapped. Then, based on the camera pose provided by our previous study, the point clouds extracted from different keyframes are spliced and fused, and then downsampled through voxel grid filtering. In order to improve storage efficiency and support tasks such as navigation and obstacle avoidance, the point cloud map is converted into an octree map.

B. Construction of a flat map

The PE algorithm [30] is used for plane extraction to obtain the parameters and point cloud of the plane in the current camera coordinate system, and then the edge points of the plane are extracted. Subsequently, the PCL point cloud library is used to perform secondary fitting on the plane to obtain the refined parameters and inliers, and then the outliers of the plane edge points are removed. In this process, the planes are filtered according to various factors such as depth information, inlier ratio, and positional relationship with the target detection frame. After the plane map is initialized, the planes detected in the current frame and the existing planes in the map are traversed to establish data association. However, in complex dynamic scenes, the detected planes often have significant errors and randomness, resulting in the failure of plane data association. With more observations, the two planes that have not been successfully associated will be optimized in the right direction, making subsequent association easier. Therefore, in the local map construction thread, the planes in the map are compared pairwise. If two planes meet the above association conditions, they will be regarded as potentially unassociated. Then, the plane with fewer observations is merged into the plane with more observations and optimized, and then the plane with fewer observations is removed from the map.

Building an object map

A. Object parameterization and data association

Since the objects to be modeled usually belong to the background and are far away from the camera, the extracted map points are usually sparse in number and of poor quality, and it is not feasible to use clustering algorithms for outlier removal. Therefore, dense point clouds are used in each frame for object modeling, and the point clouds are processed using the DBSCAN density clustering algorithm. In the current frame k, for each detected instance, we make an association judgment for each object instance in the map. Motion IoU, projection IoU, 3D-IoU, and non-parametric statistics are common object data association strategies. Despite their limitations, these strategies can complement each other when integrated, resulting in a more powerful, accurate, and versatile object data association algorithm.

Figure 2. Outlier removal of map points. (a) Determine the desktop plane. (b) Remove outliers based on the distance from the point to the plane. (c) Use the isolation forest algorithm to remove outliers.

B Object Update and Optimization

We use dense point clouds and sparse map points to parameterize detection instances and object instances, respectively. This approach makes up for the shortcomings of insufficient map points in a single frame and the significant time consumption of dense point clouds in multiple frames. After successful data association, the map points and parameters will be updated. Subsequently, the distance between the map points of the object and the plane or the plane associated with the object and the isolation forest algorithm are used to remove outliers from these map points, as shown in Figure 2.

experiment

We evaluated the performance of our algorithm on the TUM RGB-D dataset and applied the algorithm for dynamic object tracking in real scenes. The main focus of this study is map construction. Since the test sequence does not provide a ground truth map, the experiment is mainly aimed at qualitatively demonstrating the map construction results. Our algorithm runs on a laptop equipped with i9-12900H, 3060 and 16GB of memory.

Construction of geometric maps

The results of dense point cloud map and octree map construction are shown in Figure 3. It can be observed that the B-SLAM2 algorithm is unable to perform localization and map construction in high-dynamic scenes due to the lack of a module to handle dynamic objects. In low-dynamic scenes, the algorithm retains the point cloud of dynamic objects. Due to the challenges of missed detection in object detection and completely covering the bounding box of potential moving objects, the dense point cloud map constructed by removing the point cloud located within the original potential moving object detection bounding box contains a large number of residual traces of these objects.