Construction of a general target perception system for autonomous driving-EEWORLD

Collect

Introduction:

On July 26, the "SenseTime Jueying Autonomous Driving Technology Open Course" jointly planned and launched by SenseTime Jueying and Zhidongxi Open Course was successfully concluded. Dr. Wang Zhe, Director of SenseTime Jueying, gave a live lecture on the theme of "Construction of General Target Perception System for Autonomous Driving".

Dr. Wang Zhe first analyzed the three major challenges in building a general target perception system for autonomous driving. He then explained the construction of SenseTime Jueying's general perception capabilities from the three dimensions of data, algorithms, and computing power, and shared the practical application of SenseTime Jueying's general perception capabilities using a tow truck as an example.

This open class is divided into two parts: the main lecture and the Q&A session. Click to read the original text to watch the full live broadcast replay. This article is a review of Dr. Wang Zhe's lecture

Thank you all for joining today's live broadcast. It is my honor to represent our team and share this live broadcast. This project is called General Object Perception, or GOP.

The topic of today's lecture is "Construction of General Target Perception System for Autonomous Driving", which will be mainly divided into the following four aspects:

1. Challenges in building a universal target perception system

2. Three dimensions of problem solving

3. Actual combat and results

4. What we are doing

1 Challenges in Building a General Purpose Perception System

Let’s start by asking a question: Why do we need general purpose perception?

As can be seen from the above figure, in recent years, as autonomous driving technology continues to be implemented, some mass-produced models have begun to be delivered. In the evolution trend of assisted driving functions of L2+ and L3 intelligent driving cars, the complexity of the scenarios is increasing, from single scenarios to multiple scenarios, from highways to urban areas.

As the coverage of scenarios increases, especially after autonomous vehicles enter urban areas, the challenges they face are also increasing. For example, in urban areas, you may encounter various vehicles performing tasks. In the past, you only needed to know that these were cars, but after entering the city, if you know their respective models and what tasks they are doing, you can better adjust your driving strategy.

In addition, construction scenes are often encountered in urban areas. In various construction scenes, there may be various forms of traffic warning objects, and the perception of these long-tail signs needs to be well covered. There are also various forms of traffic lights in urban areas, including various blocked and truncated traffic lights, or some complex scenes have multiple traffic lights that need to be matched with the map, or some temporary traffic lights, etc. These examples pose a great challenge to the perception algorithm, so it is necessary to identify very rich semantic elements and cover targets of different forms.

We divide the challenges of general object perception into three aspects:

First of all, the goals are open sets, that is, when an autonomous vehicle is driving on the road, the interactive goals are an open set, which means that it is impossible to enumerate or pre-set in advance what kind of objects will be encountered today.

The solutions to this type of problem are: first, are there some online algorithms on the vehicle side that can deal with it, or at least be able to identify such objects and cooperate with multi-sensor solutions for detection and obstacle avoidance? Second, autonomous driving vehicles must be a process of continuous dynamic upgrading and iteration. After encountering an open set of objects, can it iterate a better algorithm in a very short time to solve the detection and tracking problems caused by this type of problem?

The second is that the semantic level of the target will gradually become more refined, from only traffic participants at the beginning to increasingly fine-grained semantic labels.

The third is that the semantic elements of each category must have a long-tail distribution, that is, most targets may look similar, but there are always some that are slightly different. These targets are called long-tail distributions, and these distributions pose greater challenges to perception algorithms.

Let’s look at some examples of these three challenges.

First, the target category is an open set. As shown in the figure above, these are some real scenes that our autonomous driving team encountered during road testing, including rocks falling on the road, pennant strings, tarpaulins on the ground, flying plastic bags, dogs running on the road, and even birds flying in the sky.

Generally speaking, these objects are not targets that are defined or detected by autonomous driving or academia. There are two main difficulties: first, these categories are not exhaustive and are very complex. If you try, you will find that you have no clue and cannot sort out a top-down labeling system to manage and maintain these categories; second, these categories appear relatively rarely, so a very large data base is needed to mine such categories.

The second point is that the semantic level needs to be continuously refined. We know that when we first started to do assisted driving, such as LKA, FCW and other functions, we may only need to simply identify some traffic participants, such as knowing that there are pedestrians, vehicles, and non-motor vehicles in front, and then do some obstacle avoidance or simple vehicle handling tasks. But if there is a higher-level semantic level, such as being able to further distinguish between large freight vehicles, privileged vehicles, and small vehicles in motor vehicles, the driving strategies adopted for each type of vehicle are different, and different models can adopt different avoidance strategies.

Privileged vehicles can be further subdivided. my country's Road Safety Law stipulates four types of privileged vehicles, including police cars, ambulances, fire trucks, and road rescue vehicles. These four types of vehicles can ignore traffic signals and fail to obey traffic rules. Even on roads with speed limits, they can ignore the rules. They are a type of vehicle with relatively strong road rights.

In addition, we have added some categories, such as school buses, which are usually used to pick up and drop off students. So what are the characteristics of a school bus? There is a warning sign near the door for getting off the bus. It is folded when the school bus is driving, and the warning sign will unfold when the students get off the bus, so the shape of the bus itself will change. If the perception algorithm can recognize that it is a school bus, it can make better strategic adjustments to the driving behavior.

The third is that the target shape presents a long-tail distribution. The above picture shows pictures of the same type of vehicles in different shapes, including tractors and other agricultural machinery vehicles, trailers, transport cars, vans, garbage trucks, police cars, etc., as well as some vehicles carrying prominent targets, rescue vehicles, fire trucks, etc. They are all vehicles, but their shapes are very different.

So how to solve these three challenges?

2 Three dimensions of problem solving

The above problems are solved from three dimensions: data dimension, algorithm dimension and computing power dimension.

From the data dimension, we first solve the open set problem. This is a problem that all autonomous driving companies cannot avoid: how to understand and plan all the goals encountered in driving scenarios? Based on the current domain knowledge, we divide the goals that need to be interacted in autonomous driving into four categories.

The four major categories are traffic participants, traffic facilities, animals and other obstacles on the road.

Why is it summarized like this?

Traffic participants are some of the highest-level objects in autonomous driving. We need to be especially careful to avoid collisions with them because these objects are human-involved, usually some intelligent entities, such as bicycles or vehicles driven by people. These objects must avoid personal injury. These traffic participants have a certain rationality and their own value functions. We usually detect and track them, and even downstream modules need to make certain predictions for each traffic participant, including their behavior predictions and trajectory predictions, so these objects have a very high priority.

The second category is the various traffic facilities that appear on the road. We understand that traffic facilities are facilities with certain functional attributes defined in the traffic scene, such as lane lines, traffic lights, traffic platforms, traffic police objects, etc. These traffic facilities generally define the structure of a road and define the drivable area, and can tell everyone what traffic rules there are and what traffic signals need to be followed within this feasible area.

Traffic warning objects are typical. They are obstacles in themselves, such as a water barrier or an ice cream bucket. If they are placed in front of you, you need to go around them. At the same time, they also serve as traffic warning objects, indicating that there must be some construction scenes or accident scenes that need attention near this area. That is, they have some functions in themselves, so they are collectively called traffic facilities.

The third largest category is animals. Why are animals classified as a separate category? We believe that animals are naturally moving objects. In the autonomous driving scenario, moving and non-moving objects are very important for downstream decision-making and planning. If it is a moving object, its speed and state of motion may need to be estimated. Therefore, animals are classified as a separate category. Animals can move, but they do not understand traffic rules and may cross the road at random, so this type of object is different from traffic participants.