How far does autonomous driving fall short of customs clearance on urban roads?-EEWORLD

Collect

"People are used to dividing everything into black and white, but unfortunately, reality is all gray." This sentence written by Liu Cixin is also a true portrayal of the autonomous driving industry. Two schools of thought, focusing on perception and focusing on maps, talk about Taoism in Huashan Mountains, and undercurrents hit the water. But right now, there is no optimal solution to completely rid the car of human intervention.

Because no matter what kind of shortcut, building a smart car is the only way. Especially when autonomous driving scenarios extend from highways to urban roads, improving the vehicle's perception and cognitive capabilities will become increasingly critical.

On the one hand, despite the high lack of information, the map is always changing. For example, the road topology changes in Beijing reached an average of 5.06 points per 100 kilometers in half a year, and there are an average of two road diversion constructions a day in Guangzhou. Only by collecting and transmitting data non-stop can the map be kept fresh. degree; on the other hand, road participants are disordered and random. In addition to vehicles, uncertain factors such as pedestrians and non-motorized vehicles have also become a major test for the advancement of autonomous driving.

Wu Xinzhou, vice president of autonomous driving at Xpeng Motors, once said bluntly, “Compared with high-speed NGP, if we want to use a number to explain it, urban NGP may be more than a hundred times more difficult.” However, to achieve large-scale mass production of autonomous driving, it must go through urban NGP. The road level.

The "9981 Difficulty" of Training Data

Up to now, many cities in China, including Beijing, Chongqing, Wuhan, Shenzhen, Guangzhou, and Changsha, have allowed commercial trial operations of self-driving vehicles in specific areas and specific periods of time. Not long ago, Beijing issued a road test permit for the "unmanned vehicle remote stage".

The unmanned testing of autonomous driving has moved from "people in the co-driver" and "unmanned in the front and people in the back" to the third stage - "remotely outside the vehicle". An eternal theme is to use continuous data to polish the autonomous driving perception model. The model determines the upper limit of functionality, and data is the source driving force. The primary question is, how to obtain more valuable training data at a lower cost and with higher efficiency?

How far does autonomous driving fall short of customs clearance on urban roads?

Image source: Tianfeng Securities

It may sound a bit incredible. Just take data labeling as an example. In the past, the common practice in the industry was to label a single frame of 2D image, that is, labeling one frame per second. However, a real video consists of more than 10 frames of images per second. In other words, there are Many gaps have not been marked, and this part has become a "wasted" resource.

Not only that, as autonomous driving data annotation moves to 4D space (3D space + time dimension), a Clip is equivalent to a short video containing camera and sensor data as the minimum annotation unit, making manual annotation more difficult.

A research report from Tianfeng Securities shows that autonomous driving at the L3 level and above requires a large amount of 3D point cloud data support. It not only requires real-time processing and analysis of data returned by sensors, but also requires a large number of curved lane lines, accumulated consumption and damage, etc. The problem of shape and reflectivity distortion also brings great challenges to the accuracy of recognition.

Therefore, if these discrete frames are expanded into Clip form, the cost of manual annotation and rework will inevitably increase the cost of autonomous driving model training. This is also the key reason why Tesla has gone from outsourcing data annotation to establishing its own manual annotation team to promoting automated annotation. Domestic car companies such as Xiaopeng have also built a fully automatic labeling system, which has increased efficiency by nearly 45,000 times. The labeling task that used to take 2,000 people a year can now be completed in about 16.7 days.

In addition to car companies, autonomous driving companies are also actively trying, including Haomo Zhixing, which launched a large video self-supervision model based on the data intelligence system MANA. To simply understand, use image masks to mask certain areas of the video, give the previous frame, let the model guess the next frame, and learn to perform feature extraction independently.

How far does autonomous driving fall short of customs clearance on urban roads?

Image source: Haomo Zhixing

Then give the fully annotated Clip to the model for fine-tuning. Repeatedly, the accuracy and precision of the model are improved based on deep learning algorithms. Through video self-supervision of large models, HaoMo Zhixing reduced the cost of clip annotation by 98%. At the same time, given that the large model run on the server has higher generalization, after the training is completed and deployed on the vehicle-side automatic driving platform, the prediction ability will be stronger.

However, these alone are not enough. At this stage, autonomous driving’s desire for data is far from over. Rich data distribution is the prerequisite for training and optimizing autonomous driving perception models.

When it comes to building an autonomous driving system, whether it is pre-collection of data by a collection vehicle or data re-injection by a mass-produced vehicle, there are still long development cycles and high costs. Therefore, simulation technology is regarded as an accelerator for autonomous driving development and is widely used in the industry. Usually, autonomous driving systems need to undergo a large number of simulation tests before they are installed in vehicles for mass production.

However, Ai Rui, vice president of technology at HaoMo Zhixing, pointed out that judging from the different characteristics of each sensor, there is still a lot of room for improvement in current simulation technology. For example, the noise floor of lidar is generally lower than that of millimeter-wave radar, and the two have very different requirements for conditions such as rain, snow, and fog, making modeling in the same scene more difficult.

"It's like watching a movie. No matter how well-done CG animation is, it can still be distinguished from real scenes." Compared with over-reliance on simulation technology, Millimeter Zhixing is interested in using low-cost general scene generation to achieve high-cost Advantages of corner cases.

This is also the fundamental reason why HaoMo Zhixing introduced NeRF (Neural Radiance Fields) technology in 3D reconstruction of large models. NeRF is a 3D reconstruction technology that started in 2020. With the feature of being able to synthesize a 360-degree surround viewing angle with just a few pictures, it quickly became popular in the e-commerce field.

How far does autonomous driving fall short of customs clearance on urban roads?

In the field of autonomous driving, NeRF not only helps reconstruct scene data, but also adjusts the corresponding perspective. In this way, vehicle driving in extreme road conditions can be simulated and comprehensive coverage of long-tail scenarios can be achieved. In addition, you can also simulate light adjustments, night effects, etc. to generate the required data.

After adding the data generated by NeRF, Haomo Zhixing reduced the perceptual error rate by at least 30% on the original basis. The more data, the better. The key is not only the vertical “quantity”, but also the horizontal “richness”. Facing the mountain of data, accumulation is the only way out.

Tesla has a fleet of one million, Xpeng has a fleet of 100,000, and Haomo Zhixing relies on the brand scale of Great Wall Motors. By the end of 2022, its cumulative mileage has exceeded 25 million kilometers. There are nearly 20 models equipped with HPilot system, and the monthly installation growth rate exceeds 200%. It is expected that by the first half of 2024, Haomo will complete HPilot’s plan to launch HPilot in 100 cities in China.

Autonomous driving "entering the city" is more difficult to recognize than to perceive

Exercising perception capabilities from big data is the first step to achieve the goal of autonomous driving. Not only that, Tsinghua University professor Deng Zhidong pointed out in an interview with domestic media that one of the core technical difficulties of autonomous driving is how the car understands complex dynamic driving scenarios (DDS) to ensure the safety of autonomous driving.

According to him, human driving is based on cognitive understanding and relies on intelligible visual perception and the brain to achieve decision-making. In contrast, it is difficult for autonomous vehicles to acquire human-level driving perception, prediction and cognitive decision-making capabilities in complex dynamic environments.

Earlier, Haomo Zhixing launched the surround perception algorithm (BEV) based on the transformer model, and gradually applied it to actual roads. However, CEO Gu Weihao also pointed out that after the BEV solution is put on the vehicle, the detection effect of lane lines and common obstacles is relatively good, and the detection range and measurement accuracy under various complex working conditions have also been significantly improved. But there are still some difficult challenges left, especially the problem of stable detection of various special-shaped obstacles on urban roads by visual solutions.

There are generally two solutions: expanding the semantic whitelist. Taking tire identification as an example, it is necessary to collect a large amount of tire information and expand the labeling sample capacity. This method is often time-consuming and laborious; in contrast, a more general method may be able to get twice the result with half the effort. That is, there is no need to understand what the obstacle is. Based on information such as height, it is judged that if it affects traffic, avoid or go around it.

Haimo has launched a large multi-modal mutual supervision model and a large dynamic environment model for this purpose. The former uses the different characteristics of cameras, lidar, millimeter-wave radar and other sensors to supervise each other to identify general obstacles or general structures. The latter is somewhat similar to a large video self-supervision model, and its purpose is to enhance the system's perception capabilities.