A benchmark for occupancy grid awareness with periscopic semantics

Publisher:快乐家庭Latest update time:2024-06-11 Source: elecfans Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

summary

Semantic occupancy grid perception is crucial for autonomous driving, as autonomous vehicles require fine-grained perception of 3D urban scenes. However, existing benchmarks fall short in terms of the diversity of urban scenes and only evaluate forward-looking prediction perception. To comprehensively evaluate surrounding perception algorithms, we propose OpenOccupancy, the first benchmark for surrounding semantic occupancy grid perception. In the OpenOccupancy benchmark, we extend the large-scale nuScenes dataset by adding dense semantic occupancy grid annotations. Previous annotations rely on the overlay of LiDAR point clouds, which results in some occupancy labels being missed due to the sparsity of LiDAR data. To alleviate this problem, we introduce an augmentation and purification (AAP) pipeline that increases the annotation density by about 2 times, involving about 4,000 man-hours of annotation. In addition, camera-, LiDAR-, and multi-modal baseline models are established for the OpenOccupancy benchmark. Furthermore, considering the complexity of surrounding occupancy perception lies in the computational burden of high-resolution 3D prediction, we propose a cascaded occupancy network (CONet) to improve the coarse prediction, which improves the performance by about 30% relative to the baseline model. We hope that the OpenOccupancy benchmark will promote the development of occupancy-aware algorithms.


Main Contributions

Despite the growing interest in semantic occupancy grid perception, most of the related benchmarks are designed for indoor scenes. SemanticKITTI extends occupancy perception to driving scenes, but its dataset is relatively small in size and limited in diversity, which affects the generalization and evaluation of developed occupancy-aware algorithms. In addition, SemanticKITTI only evaluates occupancy grid results for the front view, while surround perception is more critical for safe driving. To address these issues, we propose OpenOccupancy, the first benchmark for surround semantic occupancy perception. Introduced in the OpenOccupancy benchmark is nuScenes-Occupancy, which combines the large-scale nuScenes dataset with dense semantic occupancy annotations.

6498e2c2-3a2e-11ee-9e74-dac502259ad0.png

As shown in Table 1, nuScenes-Occupancy annotates about 40 times and 20 times more scenes and frames than nuScenes-Occupancy. It is worth noting that it is almost unrealistic to directly annotate large-scale occupancy labels manually. Therefore, the Augmenting And Purifying (AAP) process is introduced to efficiently annotate and densely label occupancy labels.

64b7a02c-3a2e-11ee-9e74-dac502259ad0.png

Figure 1: nuScenes-Occupancy provides dense semantic occupancy grid labels for all keyframes in the nuScenes dataset, with the annotated ground truth shown here, with a volume size of (40 × 512 × 512) and a grid size of 0.2 meters.

Figure 1 shows the visualization of dense annotations. To facilitate future research, we establish camera-based, LiDAR-based, and multimodal baselines in the OpenOccupancy benchmark. Experimental results show that camera-based methods perform better on small objects (such as bicycles, pedestrians, motorcycles), while LiDAR-based methods perform better on large structured areas (such as driving surfaces, sidewalks). It is worth noting that the multimodal baseline improves the overall performance of camera-based and LiDAR-based methods by 47% and 29% respectively by adaptively fusing the intermediate features of the two modalities. Considering the computational burden of surrounding occupancy perception, the proposed baseline can only generate low-resolution predictions. To achieve efficient occupancy perception, we propose the Cascaded Occupancy Network (CONet), which builds a coarse-to-fine process on top of the proposed baseline, improving the performance by about 30% relative to the original baseline. The main contributions are summarized as follows:

We propose OpenOccupancy, the first benchmark for ambient occupancy awareness in driving scenarios.

The AAP pipeline was introduced through the nuScenes dataset, which efficiently annotates and densifies the surrounding semantic occupancy labels, resulting in the first dataset nuScenes-Occupancy for surrounding semantic occupancy segmentation.

We establish camera-based, LiDAR-based, and multi-modal baselines on the OpenOccupancy benchmark. In addition, we introduce CONet to reduce the computational burden of high-resolution occupancy prediction, improving the baseline performance by about 30% relative to the baseline.

Main content

Perspective semantic occupancy grid perception

Peripheral semantic occupancy perception refers to generating a complete 3D representation of the scene, including volume occupancy and semantic labels. Unlike the monocular paradigm of front view perception, the periphery occupancy perception algorithm aims to generate semantic occupancy in the driving scene of the surrounding view. Specifically, given a 360-degree input Xi (such as a LiDAR scan or a surrounding view image), the perception algorithm needs to predict the surrounding occupancy label F(Xi) ∈ RD×H×W, where D, H, W are the volume sizes of the entire scene. It is worth noting that the input range of the periphery view is about 5 times larger than the range covered by the front vision sensor. Therefore, the core challenge of periphery occupancy grid perception is to efficiently build a high-resolution occupancy representation.

nuScenes-Occupancy

SemanticKITTI is the first dataset for outdoor occupancy perception, but it lacks diversity in driving scenarios and only evaluates forward perception. In order to create a large-scale environmental occupancy perception dataset, we introduced nuScenes-Occupancy, which adds dense semantic occupancy annotations on the basis of the nuScenes dataset. The authors introduced the AAP (Augmenting And Purifying) process to efficiently annotate and densify occupancy labels.

The entire AAP process is shown in Algorithm 1.

64e757e0-3a2e-11ee-9e74-dac502259ad0.png

As shown in Figure 2, the pseudo-labels are complementary to the initial annotations, while the enhanced and purified labels are denser and more accurate. It is worth noting that there are about 400,000 occupied voxels in each frame in the enhanced and purified annotations, which is about 2 times denser than the initial annotations. In summary, nuScenes-Occupancy contains 28,130 training frames and 6,019 validation frames, with 17 semantic labels assigned to the occupied voxels in each frame.

65176872-3a2e-11ee-9e74-dac502259ad0.png

Figure 2: Comparison between the original annotation, pseudo-annotation, and enhanced purified annotation. The red and blue circles highlight areas where the enhanced annotation is more dense and accurate.

OpenOccupancy Baseline

Most existing occupancy perception methods are designed for forward-looking perception. To extend these methods to surrounding occupancy perception, the input from each camera view needs to be processed separately, which is inefficient. In addition, there may be inconsistencies in the overlapping area of ​​two adjacent outputs. To alleviate these problems, we establish baselines that consistently learn surrounding semantic occupancy from 360-degree inputs (such as LiDAR scans or surround images). Specifically, we propose camera-based, LiDAR-based, and multimodal baselines for the OpenOccupancy benchmark, as shown in Figure 3.

6535c9b6-3a2e-11ee-9e74-dac502259ad0.png

Figure 3: The overall architecture of the three proposed baselines. The LiDAR branch uses a 3D encoder to extract voxelized LiDAR features. The camera branch uses a 2D encoder to learn surround view features, which are then converted to generate 3D camera voxel features. In the multimodal branch, the adaptive fusion module dynamically integrates the features of the two modalities. All three branches use a 3D decoder and an occupancy head to produce semantic occupancy. In the occupancy result map, the red and purple circles indicate that the multimodal branch can generate more complete and accurate predictions.

experiment

In the OpenOccupancy benchmark, the surrounding semantic occupancy perception performance is evaluated based on nuScenes-Occupancy, and comprehensive experiments are conducted on the proposed baseline, CONet, and modern occupancy perception algorithms. All models are trained on 8 A100 GPUs with a batch size of 8 for a total of 24 epochs. Using the OpenOccupancy benchmark, we analyzed the surrounding occupancy perception performance of six modern methods (MonoScene, TPVFormer, 3DSketch, AICNet, LMSCNet, JS3C-Net) as well as the proposed baseline and CONet. From the results in Table 2, we can see that:

The surrounding occupancy awareness paradigm shows better performance compared to single-view methods.

The proposed baseline is adaptable and scalable for ambient occupancy awareness.

The information from camera and LiDAR complements each other and the multimodal baseline significantly improves the performance.

The complexity of periscopic occupancy perception lies in the computational burden of high-resolution 3D prediction, which can be alleviated by the proposed CONet.

657dcea0-3a2e-11ee-9e74-dac502259ad0.png

Visual results are provided here (see Figure 5) to verify that CONet can generate fine occupancy grid results based on coarse predictions.

65a2f45a-3a2e-11ee-9e74-dac502259ad0.png

Figure 5: Visualization of semantic occupancy predictions, row 1 is the periscopic image. Rows 2 and 3 show the coarse and fine occupancy of the camera views generated by the multimodal baseline and multimodal CONet, and row 4 compares their global view predictions.

Summarize

In this paper, we present OpenOccupancy, the first benchmark for semantic occupancy awareness in driving scenarios. Specifically, we introduce nuScenes-Occupancy, which extends the nuScenes dataset with dense semantic occupancy annotations based on the proposed AAP pipeline. We establish camera-based, LiDAR-based, and multimodal baselines on the OpenOccupancy benchmark. In addition, we propose CONet to reduce the computational burden of high-resolution occupancy prediction. Comprehensive experiments are conducted on the OpenOccupancy benchmark and show that the camera-based and LiDAR-based baselines complement each other, while the multimodal baseline further improves the performance by 47% and 29%, respectively. In addition, the proposed CONet improves by about 30% over the baselines with minimal latency overhead. We hope that the OpenOccupancy benchmark will help the development of semantic occupancy awareness in driving scenarios.

[1] [2]
Reference address:A benchmark for occupancy grid awareness with periscopic semantics

Previous article:Analysis of 800v fast charging technology for new energy vehicles
Next article:For 800V platform models, why is the battery pack voltage less than 800V?

Latest Embedded Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号