Current Status and Challenges of Public Datasets for Autonomous Driving-EEWORLD

Collect

With the optimization and upgrading of data collection equipment, autonomous driving data sets are also constantly upgraded and iterated. Major autonomous driving companies and research institutes at home and abroad have successively launched autonomous driving data sets, providing important research materials for the future technological development in the field of autonomous driving. The article "Autonomous Driving Open Source Data System: Current Situation and Future" systematically sorts out the open source data sets for autonomous driving, which is of great significance for promoting the virtuous cycle of the industrial ecology. This article is a review of open source data sets for autonomous driving released by Shanghai Artificial Intelligence Laboratory in conjunction with Shanghai Jiaotong University, Fudan University, Baidu, BYD, Weilai and other units. This review systematically sorts out more than 70 open source autonomous driving data sets at home and abroad for the first time, and summarizes how to build high-quality data sets, the core role of data in the closed-loop system of algorithms, and how to use generative large models to produce data on a large scale. On this basis, an in-depth analysis and discussion is carried out on the characteristics, data scale, and key scientific and technological issues that the third-generation autonomous driving data sets should have in the future.

Overview

As one of the important application areas of artificial intelligence, autonomous driving is expected to reshape the existing traffic and transportation mode, greatly improve traffic efficiency and safety, and have a profound impact on future urban and social development. At present, the domestic intelligent networked vehicle industry has entered the trial and start-up stage of commercialization. Road testing and demonstration application scenarios are becoming more mature, autonomous driving functional technology is accelerating iteration, vehicle networking application scenarios are becoming increasingly rich, and relevant laws and regulations at all levels are accelerating. The introduction of policies has jointly promoted the market into a period of rapid development. On the one hand, autonomous driving technology requires a large amount of data to train algorithm models to identify and understand the road environment, so as to make correct decisions and actions and achieve accurate, stable and safe driving experience. Data construction is crucial to the development of autonomous driving technology. On the other hand, the emergence of large models in natural language processing and general vision fields has further confirmed the importance of massive high-quality data, and inspired the construction of autonomous driving data sets!

Review article structure

Autonomous driving dataset

This review divides the nearly 100 open source datasets into two generations: The first generation of datasets is represented by KITTI, which was proposed in 2012. The input sensor modality consists of a monocular camera and a lidar, and a series of comprehensive perception tasks are proposed. The second generation of datasets is represented by nuScenes and Waymo datasets. The complexity of the sensor modality has increased. Surround view cameras, lidars, positioning information, and high-precision maps have become common components. Downstream tasks are oriented towards comprehensive tasks of perception, mapping, prediction, and path planning.

The complexity of sensor modalities is gradually increasing: surround view cameras, lidar, high-precision maps, ultrasonic radar sensors, GPS, IMU, HD Map, etc.

The size and diversity of data sets are growing: In terms of data richness, the collection time of mainstream autonomous driving data sets has gradually increased from about 10 hours at the beginning to 100 hours. With the evolution of automatic labeling technology and labeling tools, data sets of more than 1,000 hours have appeared in recent years. The diversity of driving scenarios is another key factor in the performance of autonomous driving systems. In order to improve the performance of algorithms in specific scenarios, some data sets are collected in multiple cities on multiple continents.

Dataset tasks extend from perception to prediction and planning: Downstream tasks of datasets such as Cityscapes and Mapillary launched in 2016 focus on dynamic object detection. Datasets such as SemanticKITTI and DrivingStereo launched in 2019 introduced tasks such as semantic segmentation, depth estimation, and optical flow estimation. In traditional prediction and planning modules, numerical calculation, optimization, search and other methods are generally used to solve. Datasets such as nuScenes, Waymo, and Argoverse V2 proposed around 2019 include not only perception tasks but also prediction and planning tasks, making it possible to conduct multiple task studies on the same dataset, while leading the community's trend of end-to-end autonomous driving research under the traditional multi-module paradigm.

Estimation of the impact of open source datasets for autonomous driving

Data algorithm closed-loop system

The modular autonomous driving system includes components such as perception, decision-making, planning, and control, most of which are implemented through data-driven neural network models. For these modules, massive and high-quality data is a necessary condition to ensure the performance of the modules. First of all, the introduction of massive data is necessary to solve various problems in existing autonomous driving systems. The problem that has always existed in autonomous driving engineering is the long-tail problem. The reason for this is that the amount of data for training the model is insufficient, resulting in a small number of cases that have not been learned by the model, and in the model reasoning stage, the model cannot give correct results for these edge scenarios. In addition, for rule-based modules, the existing method is to manually design various rules to make the module output results that conform to the artificial design logic. This method is time-consuming and labor-intensive, and it is difficult to cover all situations, which may cause the autonomous driving system to fail in some unseen scenarios. Using data-driven neural networks to replace these modules is a possible solution. At the same time, in the process of neural network learning, the introduction of data noise will inevitably have a negative impact on the optimization process and reduce model performance. Data quality includes not only the resolution and synchronization of sensor data, but also the accuracy of labels. In these two aspects, any quality problem directly affects the performance and safety of the autonomous driving system. In summary, massive and high-quality data has become an indispensable part of building an autonomous driving system.

A new generation of autonomous driving datasets in the era of big models

The current basic big models have achieved remarkable results in the fields of natural language processing and computer vision, but there are no big models for the vertical field of autonomous driving on the market. Taking the big models in other fields as a reference, the new generation of data sets should at least increase the data volume to be similar to that in other fields in order to enable the big models of autonomous driving. On the premise of ensuring the amount of data, the richness of the scene is more important to the performance of the algorithm. Autonomous driving vehicles will inevitably encounter scenes outside the training data in the real world. The large-scale application of autonomous driving technology will inevitably require the model to be able to make correct behaviors in rare scenes to avoid danger or functional failure. For most traffic scenes, it does not require a very large amount of data to cover them, but more attention should be paid to the long-tail scenes. Because some traffic scenes are very rare, such as car crashes, the lack of data will have a huge impact on the performance of the autonomous driving system.

The first and second generation autonomous driving datasets can no longer meet the development needs of autonomous driving systems, and the construction of a new generation of datasets needs to be put on the agenda. In the era of large models, big data has become an indispensable feature of the new generation of datasets. At the same time, modularly designed autonomous driving systems encounter problems such as high iteration costs and limited performance limits during implementation, and end-to-end autonomous driving architectures are gradually gaining favor in the industry. In addition, multimodal sensors, high-quality annotations, and model logical reasoning capabilities also need to be paid attention to. Based on this, this review summarizes the development goals of the new generation of datasets: multimodal, quality and quantity; end-to-end, decision-oriented; intelligent, logical reasoning.

Outlook for autonomous driving datasets in the era of big models

in conclusion

This review comprehensively reviews the current status and challenges of public datasets for autonomous driving. In view of the data algorithm closed-loop system, combined with the current development trend of large models, the vision and planning of the next generation of autonomous driving datasets are proposed. This review systematically summarizes the datasets used in the development of autonomous driving, and demonstrates the importance of promoting community development through challenges and rankings; it generally analyzes the data algorithm closed-loop system for autonomous driving, and summarizes the role of each important link, and finally demonstrates how to use the data algorithm closed-loop system through application cases.

Reference address：Current Status and Challenges of Public Datasets for Autonomous Driving

Previous article：Silicon carbide is mainly used in electric vehicles
Next article：The implementation principle of 5G network unmanned driving technology

Popular Resources
Popular amplifiers