Xiaomi's autonomous driving technology: Algorithms

Publisher:TranquilWhisperLatest update time:2024-08-12 Source: elecfans Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

Xiaomi Auto has not released detailed information about Xiaomi's autonomous driving algorithm, but we can get a glimpse of Xiaomi's autonomous driving algorithm through the academic papers released by Xiaomi Auto. At present, Xiaomi Auto has two main academic papers. One is "SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection", with authors from the National University of Singapore and only two from Xiaomi Auto. The other is "SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field", with 8 signed authors, six of whom are from Xiaomi Auto, and two from the School of Software Engineering of Xi'an Jiaotong University, one of whom joined Xiaomi Auto later. The core of both papers is Occupancy of the network, which Lei Jun also mentioned at the Xiaomi Auto launch conference.


The first of these two papers focuses on 3D perception, and the second focuses on 3D scene reconstruction. 3D perception papers are inevitably ranked on the nuScenes test dataset. Most people are not interested in reading difficult and obscure papers, so let's first look at the scores of Xiaomi's two algorithm papers.

d9f5a4c8-fadc-11ee-a297-92fbcf53809c.png

Image source: SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection paper

The NDS score is 58.1, which is quite low. Huawei's TransFusion score in October 2021 was 71.7, and Leapmotor's EA-LSS score was 77.6. However, the latter two are basically based on Bounding-Box, not based on occupied networks, so the comparison is a bit unfair.

da0ddffc-fadc-11ee-a297-92fbcf53809c.png

Image source: SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection paper

Compared with another top-level occupied network structure TPVFormer, there is basically no difference. TPVFormer was proposed by Beihang University.

da2e8cac-fadc-11ee-a297-92fbcf53809c.png

Image source: "SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field" paper

The score of the algorithm in the paper "SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field" ranks first in mIoU among all occupancy network models. mIoU (Mean Intersection over Union) is a standard metric for semantic segmentation. It calculates the intersection over union of two sets, which are the ground truth and the predicted segmentation. The calculation formula is as follows: i represents the ground truth and j represents the predicted value:

da67e6fa-fadc-11ee-a297-92fbcf53809c.png

da6f949a-fadc-11ee-a297-92fbcf53809c.png

Image source: "SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field" paper

The score for 3D scene reconstruction can basically be considered first.

Let’s take a closer look at these two papers.

da77cf52-fadc-11ee-a297-92fbcf53809c.png

Image source: SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection paper

SOGDet combines 3D perception with 3D semantic segmentation occupancy network prediction, mainly to improve the perception of non-road environments and build a complete real 3D scene, so that the autonomous driving decision-making system can better understand the surrounding environment and give correct road planning. The non-road environment includes vegetation (green belts, grass, etc.), sidewalks, terrain and artificial buildings.

da8a82b4-fadc-11ee-a297-92fbcf53809c.png

Image source: SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection paper

There is nothing unique about the network architecture of Xiaomi SOGDet. After all, the network foundation is built by Google and META. At present, the top autonomous driving networks are basically three parts. The backbone part is still based on CNN. There is no way. Transofrmer has too much computational complexity and cannot be used. Everyone still uses ResNet50/100. There are also a few who use Google's ViT, but it cannot be implemented in practice. The multi-head part uses View Transformer for BEV transformation. Here we still use the classic LSS method proposed by NVIDIA, where:

Lift: explicitly estimates the depth distribution of feature points after downsampling the image plane for each camera image, and obtains the view cone (point cloud) containing the image features;

Splat——Distribute the view cones (point clouds) of all cameras into the BEV grid based on the camera’s internal and external parameters, and perform sum-pooling calculations on multiple view cone points in each grid to form a BEV feature map;

Shoot——Use task head to process BEV feature map and output perception results. LSS was proposed in 2020, and many improvements have been made, mainly depth correction (Depth Correction) and depth estimation (Camera-aware Depth Prediction) with camera perception capabilities.

In addition, efficient voxel pooling is proposed to accelerate the BEVDepth method, and multi-frame fusion is proposed to improve the target detection effect and motion speed estimation. Deconvolution and MLP are used at the task level to output semantic segmentation network occupancy or target detection Bounding Box.

Let’s take a look at the paper with a higher content in Xiaomi Automobile, namely “SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field”. This paper is mainly about 3D semantic segmentation occupancy network, so the main indicator is mIoU.

Xiaomi Auto SurroundSDF Network Architecture

daa57eca-fadc-11ee-a297-92fbcf53809c.png

Image source: "SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field" paper

Let me explain SDF briefly. Signed distance field (SDF) is a variant of distance field. It maps the position to its distance to the nearest plane (edge) in 3D (2D) space. Distance field is used in many studies such as image processing, physics and computer graphics. In the context of computer graphics, distance field is usually signed, indicating whether a position is within the grid. Whether 2D or 3D graphics, there are two ways of storage: implicit and explicit. For example, 3D models can use mesh to directly store model data, or they can be represented by SDF, point cloud, neural rendering, and 2D assets (here refers to textures). For example, textures are generally represented directly using parameters such as RGB and HSV, but jagged edges will appear after enlarging the image. Therefore, if you want to obtain high-definition images, you need a larger storage space. At this time, vector representation is needed. SDF is created for this demand, which is what Lei Jun calls ultra-high-resolution vectors. This technology is used in mobile games. The most typical one is the number one mobile game "Genshin Impact", in which facial shadows are made using SDF.

The network architecture of Xiaomi Auto SurroundSDF is different from that in the previous paper only in the final output header. The backbone network, LSS and Voxel are exactly the same.

SurroundSDF aims to address the challenges of vision-based 3D scene understanding in autonomous driving systems. Specifically, it attempts to solve the following problems: Continuity and accuracy: Existing object-free methods fail to construct continuous and accurate obstacle surfaces when predicting the semantics of discrete voxel grids. SurroundSDF achieves continuous perception of 3D scenes from surround images by implicitly predicting the Signed Distance Field (SDF) and the semantic field.

Lack of accurate SDF ground truth: Since it is difficult to obtain accurate SDF ground truth, the paper proposes a new weakly supervised paradigm, called Sandwich Eikonal formulation, which improves the perceived accuracy of the surface by imposing correct and dense constraints on both sides of the surface. The Eikonal equation is a type of nonlinear partial differential equation that needs to be solved when dealing with wave propagation problems. Here is a brief introduction: The Eikonal equation can calculate the propagation time of seismic waves from the source point to any point in space, thereby describing the propagation time field of the wave in the medium; quickly solving the Eikonal equation is of great significance for accelerating the reconstruction of the seismic wave propagation time field and thus reducing the loss of social property caused by earthquake disasters. In the field of image processing, the Eikonal equation is used to calculate the distance field of multiple points, image denoising, and extract the shortest path on discrete and parameterized surfaces.

3D Semantic Segmentation and Continuous 3D Geometric Reconstruction: SurroundSDF aims to simultaneously solve the problems of 3D semantic segmentation and continuous 3D geometric reconstruction in one framework, leveraging the powerful representation capability of SDF.

Long-tail problem and coarse description of 3D scenes: Despite the progress made in 3D object detection algorithms, the long-tail problem and coarse description of 3D scenes remain challenges, requiring a deeper understanding of 3D geometry and semantics.

Tesla's AI Day also proposed "Implicit Neural Representation" (INR). Taking images as an example, the most common way to represent them is as discrete pixels in two-dimensional space. But in the real world, the world we see can be considered continuous, or approximately continuous. Therefore, we can consider using a continuous function to represent the true state of the image. However, we have no way of knowing the exact form of this continuous function, so some people propose to use a neural network to approximate this continuous function, which is INR. In 3D images, videos, and Voxel reconstruction, the INR function maps two-dimensional coordinates to RGB values. For videos, the INR function maps the time t and the image two-dimensional coordinates XY to RGB values. For a three-dimensional shape, the INR function maps the three-dimensional coordinates XYZ to 0 or 1, indicating whether a certain position in space is inside or outside the object. INR is a continuous function. The complexity of the function (network) is proportional to the complexity of the signal, but has nothing to do with the resolution of the signal. For example, if the content of a 16*16 image and a 32*32 image is the same, then the INR will be the same. That is to say, even the lowest resolution can continuously expand the effect of high resolution.

[1] [2]
Reference address:Xiaomi's autonomous driving technology: Algorithms

Previous article:Research on the working mechanism of new energy vehicle drive motor
Next article:Steer-by-wire technology: precise control of path and direction in autonomous driving

Recommended ReadingLatest update time:2024-11-15 17:05

Xiaomi Buds 5 uses Snapdragon Sound technology to create a full-link true lossless audio experience
Recently, Xiaomi held a flagship new product launch conference and officially launched the Xiaomi Buds 5 earphones . The earphones are built on the second-generation Qualcomm® S3 audio platform, supporting Snapdragon Sound™ and lossless audio technology, and have taken sound quality, connection speed, and ga
[Mobile phone portable]
Xiaomi Buds 5 uses Snapdragon Sound technology to create a full-link true lossless audio experience
Latest Embedded Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号