Xiaomi Auto has not released detailed information about Xiaomi's autonomous driving algorithm, but we can get a glimpse of Xiaomi's autonomous driving algorithm through the academic papers released by Xiaomi Auto. At present, Xiaomi Auto has two main academic papers. One is "SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection", with authors from the National University of Singapore and only two from Xiaomi Auto. The other is "SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field", with 8 signed authors, six of whom are from Xiaomi Auto, and two from the School of Software Engineering of Xi'an Jiaotong University, one of whom joined Xiaomi Auto later. The core of both papers is Occupancy of the network, which Lei Jun also mentioned at the Xiaomi Auto launch conference.
The first of these two papers focuses on 3D perception, and the second focuses on 3D scene reconstruction. 3D perception papers are inevitably ranked on the nuScenes test dataset. Most people are not interested in reading difficult and obscure papers, so let's first look at the scores of Xiaomi's two algorithm papers.
Image source: SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection paper
The NDS score is 58.1, which is quite low. Huawei's TransFusion score in October 2021 was 71.7, and Leapmotor's EA-LSS score was 77.6. However, the latter two are basically based on Bounding-Box, not based on occupied networks, so the comparison is a bit unfair.
Image source: SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection paper
Compared with another top-level occupied network structure TPVFormer, there is basically no difference. TPVFormer was proposed by Beihang University.
Image source: "SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field" paper
The score of the algorithm in the paper "SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field" ranks first in mIoU among all occupancy network models. mIoU (Mean Intersection over Union) is a standard metric for semantic segmentation. It calculates the intersection over union of two sets, which are the ground truth and the predicted segmentation. The calculation formula is as follows: i represents the ground truth and j represents the predicted value:
Image source: "SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field" paper
The score for 3D scene reconstruction can basically be considered first.
Let’s take a closer look at these two papers.
Image source: SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection paper
SOGDet combines 3D perception with 3D semantic segmentation occupancy network prediction, mainly to improve the perception of non-road environments and build a complete real 3D scene, so that the autonomous driving decision-making system can better understand the surrounding environment and give correct road planning. The non-road environment includes vegetation (green belts, grass, etc.), sidewalks, terrain and artificial buildings.
Image source: SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection paper
There is nothing unique about the network architecture of Xiaomi SOGDet. After all, the network foundation is built by Google and META. At present, the top autonomous driving networks are basically three parts. The backbone part is still based on CNN. There is no way. Transofrmer has too much computational complexity and cannot be used. Everyone still uses ResNet50/100. There are also a few who use Google's ViT, but it cannot be implemented in practice. The multi-head part uses View Transformer for BEV transformation. Here we still use the classic LSS method proposed by NVIDIA, where:
Lift: explicitly estimates the depth distribution of feature points after downsampling the image plane for each camera image, and obtains the view cone (point cloud) containing the image features;
Splat——Distribute the view cones (point clouds) of all cameras into the BEV grid based on the camera’s internal and external parameters, and perform sum-pooling calculations on multiple view cone points in each grid to form a BEV feature map;
Shoot——Use task head to process BEV feature map and output perception results. LSS was proposed in 2020, and many improvements have been made, mainly depth correction (Depth Correction) and depth estimation (Camera-aware Depth Prediction) with camera perception capabilities.
In addition, efficient voxel pooling is proposed to accelerate the BEVDepth method, and multi-frame fusion is proposed to improve the target detection effect and motion speed estimation. Deconvolution and MLP are used at the task level to output semantic segmentation network occupancy or target detection Bounding Box.
Let’s take a look at the paper with a higher content in Xiaomi Automobile, namely “SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field”. This paper is mainly about 3D semantic segmentation occupancy network, so the main indicator is mIoU.
Xiaomi Auto SurroundSDF Network Architecture
Image source: "SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field" paper
Let me explain SDF briefly. Signed distance field (SDF) is a variant of distance field. It maps the position to its distance to the nearest plane (edge) in 3D (2D) space. Distance field is used in many studies such as image processing, physics and computer graphics. In the context of computer graphics, distance field is usually signed, indicating whether a position is within the grid. Whether 2D or 3D graphics, there are two ways of storage: implicit and explicit. For example, 3D models can use mesh to directly store model data, or they can be represented by SDF, point cloud, neural rendering, and 2D assets (here refers to textures). For example, textures are generally represented directly using parameters such as RGB and HSV, but jagged edges will appear after enlarging the image. Therefore, if you want to obtain high-definition images, you need a larger storage space. At this time, vector representation is needed. SDF is created for this demand, which is what Lei Jun calls ultra-high-resolution vectors. This technology is used in mobile games. The most typical one is the number one mobile game "Genshin Impact", in which facial shadows are made using SDF.
The network architecture of Xiaomi Auto SurroundSDF is different from that in the previous paper only in the final output header. The backbone network, LSS and Voxel are exactly the same.
SurroundSDF aims to address the challenges of vision-based 3D scene understanding in autonomous driving systems. Specifically, it attempts to solve the following problems: Continuity and accuracy: Existing object-free methods fail to construct continuous and accurate obstacle surfaces when predicting the semantics of discrete voxel grids. SurroundSDF achieves continuous perception of 3D scenes from surround images by implicitly predicting the Signed Distance Field (SDF) and the semantic field.
Lack of accurate SDF ground truth: Since it is difficult to obtain accurate SDF ground truth, the paper proposes a new weakly supervised paradigm, called Sandwich Eikonal formulation, which improves the perceived accuracy of the surface by imposing correct and dense constraints on both sides of the surface. The Eikonal equation is a type of nonlinear partial differential equation that needs to be solved when dealing with wave propagation problems. Here is a brief introduction: The Eikonal equation can calculate the propagation time of seismic waves from the source point to any point in space, thereby describing the propagation time field of the wave in the medium; quickly solving the Eikonal equation is of great significance for accelerating the reconstruction of the seismic wave propagation time field and thus reducing the loss of social property caused by earthquake disasters. In the field of image processing, the Eikonal equation is used to calculate the distance field of multiple points, image denoising, and extract the shortest path on discrete and parameterized surfaces.
3D Semantic Segmentation and Continuous 3D Geometric Reconstruction: SurroundSDF aims to simultaneously solve the problems of 3D semantic segmentation and continuous 3D geometric reconstruction in one framework, leveraging the powerful representation capability of SDF.
Long-tail problem and coarse description of 3D scenes: Despite the progress made in 3D object detection algorithms, the long-tail problem and coarse description of 3D scenes remain challenges, requiring a deeper understanding of 3D geometry and semantics.
Tesla's AI Day also proposed "Implicit Neural Representation" (INR). Taking images as an example, the most common way to represent them is as discrete pixels in two-dimensional space. But in the real world, the world we see can be considered continuous, or approximately continuous. Therefore, we can consider using a continuous function to represent the true state of the image. However, we have no way of knowing the exact form of this continuous function, so some people propose to use a neural network to approximate this continuous function, which is INR. In 3D images, videos, and Voxel reconstruction, the INR function maps two-dimensional coordinates to RGB values. For videos, the INR function maps the time t and the image two-dimensional coordinates XY to RGB values. For a three-dimensional shape, the INR function maps the three-dimensional coordinates XYZ to 0 or 1, indicating whether a certain position in space is inside or outside the object. INR is a continuous function. The complexity of the function (network) is proportional to the complexity of the signal, but has nothing to do with the resolution of the signal. For example, if the content of a 16*16 image and a 32*32 image is the same, then the INR will be the same. That is to say, even the lowest resolution can continuously expand the effect of high resolution.
Previous article:Research on the working mechanism of new energy vehicle drive motor
Next article:Steer-by-wire technology: precise control of path and direction in autonomous driving
Recommended ReadingLatest update time:2024-11-15 17:05
- Popular Resources
- Popular amplifiers
- Multimodal perception parameterized decision making for autonomous driving
- Mission-oriented wireless communications for cooperative sensing in intelligent unmanned systems
- Evaluating Roadside Perception for Autonomous Vehicles: Insights from Field Testing
- Investigation of occupancy perception in autonomous driving: An information fusion perspective
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- LED chemical incompatibility test to see which chemicals LEDs can be used with
- Application of ARM9 hardware coprocessor on WinCE embedded motherboard
- What are the key points for selecting rotor flowmeter?
- LM317 high power charger circuit
- A brief analysis of Embest's application and development of embedded medical devices
- Single-phase RC protection circuit
- stm32 PVD programmable voltage monitor
- Introduction and measurement of edge trigger and level trigger of 51 single chip microcomputer
- Improved design of Linux system software shell protection technology
- What to do if the ABB robot protection device stops
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Download from the Internet--ARM Getting Started Notes
- Learn ARM development(22)
- Learn ARM development(21)
- Learn ARM development(20)
- Learn ARM development(19)
- Learn ARM development(14)
- Learn ARM development(15)
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- msp430f5529 uart pwm adc
- Please help me how to install the PIC software EPOCH. The csdn is very vague and I have never seen the operation described in the installation instructions.
- Understanding NB-IoT technology
- 3DH Model
- What kind of products need a dedicated shutdown discharge circuit?
- [First Round] Interview Questions for Embedded Engineers
- How to prevent PCB board from bending and warping during reflow oven
- Fully automatic high pressure steam sterilization controller
- The data of STM32 ADC is tampered when using DMA mode
- ECG ten electrodes and 12 leads