Sensor fusion is an important topic in many perception systems, such as autonomous driving and robotics. Transformer-based detection heads and CNN-based feature encoders (which extract features from raw sensor data) have become one of the highest performing multi-sensor fusion frameworks for 3D detection, ranking among the best on many datasets.
This paper provides a recent literature review of transformer-based 3D object detection tasks, focusing mainly on sensor fusion, introducing the basics of visual transformers (ViT), and briefly discussing several non-transformer-based, less dominant approaches to sensor fusion for autonomous driving. Finally, the role of transformers in the field of sensor fusion is summarized and future research directions in this area are proposed.
For more information, please refer to: https://github.com/ApoorvRoboticist/Transformers-SensorFusion
Sensor fusion is the integration of perception data from different information sources. By utilizing complementary information captured by different sensors, fusion helps reduce the uncertainty of state estimation and makes 3D object detection tasks more robust. Target attributes are not equally identifiable in different modes, so it is necessary to utilize different modes and extract complementary information from them. For example, lidar can better locate potential objects, radar can better estimate the speed of objects in the scene, and last but not least, cameras can classify objects through dense pixel information.
Why is sensor fusion difficult?
Sensor data from different modalities often have large differences in data distribution, in addition to differences in the coordinate space of each sensor. For example, LiDAR is in Cartesian coordinate space, Radar is in polar coordinate space, and images are in perspective space. The spatial misalignment introduced by different coordinate systems makes it difficult to merge these modalities together. Another problem with multimodal input is that when the ML network can be fed with camera and LiDAR, there is the problem of asynchronous timelines.
The overall architecture of the existing sensor fusion model is shown above, with a transformer-based head (green), a CNN-based feature extractor (blue) for predicting 3D bird’s-eye view (BEV) bounding boxes (yellow blocks), and each sensor has intermediate BEV features (purple blocks). The sensor fusion is set up to receive inputs from multi-view cameras, lidars, and radars.
While CNNs can be used to capture global context within a single modality, it is non-trivial to extend this to multiple modalities and accurately model the interactions between pairwise features. To overcome this limitation, the global contextual reasoning about the 2D scene is integrated directly into the feature extraction layer of the modality using the transformer’s attention mechanism. Recent advances in sequence modeling and audio-visual fusion have shown that Transformer-based architectures are very effective in modeling information interactions in sequence or cross-modal data.
Field Background
Fusion level: Recently, multi-sensor fusion has attracted more and more interest in the 3D detection community. Existing methods can be divided into detection-level, proposal-level, and point-level fusion methods, depending on how early or late the different modalities (i.e., camera, radar, lidar, etc.) are fused!
Detection-level or late fusion has become the simplest form of fusion, as each modality can process its own BEV detections individually, which can then be post-processed using the Hungarian matching algorithm and Kalman filtering to aggregate and remove duplicate detections. However, this approach cannot take advantage of the fact that each sensor can also contribute to different attributes in a single bounding box prediction. CLOCS can fuse the results of lidar-based 3D object detection and 2D detection tasks. It operates on the two output candidates before non-maximum suppression and uses the geometric consistency between the two sets of predictions to eliminate false positives (FPs), because the same FP is likely to be detected simultaneously in different modes.
Point-level, also known as early fusion, enhances the LiDAR point cloud with camera features, in which a transformation matrix is used to find hard correlations between LiDAR points and images. However, this approach suffers from point sparsity and sometimes even slight errors in the calibration parameters of the two sensors.
Proposal-level, or deep, fusion is currently the most studied approach in the literature, and advances in transformers [5, 6, 7] unlock the possibility of how intermediate features can interact, despite the fact that they come from different sensors. Representative work proposed by MV3D extracts initial bounding boxes from LiDAR features and iteratively optimizes them using image features. BEVFusion generates camera-based BEV features, as emphasized in [10, 11, 12, 13]. Camera and LiDAR modalities are connected in BEV space, and the BEV decoder is used to predict a 3D box as the final output. In TransFuser, the BEV representations of single-view images and LiDAR are fused on various intermediate feature maps by transformers in the encoder. This results in a 512-dimensional feature vector output from the encoder, which constitutes a compact representation of local and global context.
In addition, this paper feeds the output into GRU (Gated Recurrent Unit) and predicts differentiable ego-route points using L1 regression loss. In addition to being multimodal, the 4D network [16] also adds the time dimension as the fourth dimension to the problem. First, the temporal features of the camera and lidar are extracted separately [17]. Adding different contexts of image representation, the paper collects three levels of image features, namely high-resolution images, low-resolution images, and videos. Then, the cross-modal information is fused using a transformation matrix to obtain the 2D context of a given 3D pillar center, which is defined by the center point (xo, yo, zo) of the BEV grid cell!
Transformer-based fusion network background
The method can be divided into three steps:
1. Apply a neural network-based backbone to extract spatial features from all modalities individually;
2. Iteratively refine a small set of learned embeddings (target query/proposal) in the transformer module to generate a set of 3D box predictions;
3. Calculate loss;
The architecture is shown in Figure 1.
(1) Backbone
Camera: Multiple camera images are fed to the backbone (e.g., ResNet-101) and FPN, and features are obtained;
LiDAR: Usually use voxelnet with 0.1m voxel size or PointPill with 0.2m pillar size to encode the points. After 3D backbone and FPN, multi-scale BEV feature maps are obtained.
Radar: Convert location, intensity, and speed into features via MLP!
(2) Query Initialization
In the seminal work [5], a sparse query is learned as a network parameter and is representative of the entire training data. This type of query requires longer time, i.e. more sequential decoder layers (usually 6) to iteratively converge to the actual 3D objects in the scene. However, recently, input-dependent queries [20] have been proposed as a better initialization strategy. This strategy can reduce a 6-layer transformer decoder to even a single decoder layer, Transfusion uses a center heatmap as a query, and BEVFormer introduces a dense query as an isometric BEV grid!
(3) Transformers Decoder
To refine the object proposal, repeated modules of Transformer decoders are used sequentially in the ViT model, where each block consists of a self-attention layer and a cross-attention layer. Self-attention between the target query performs pairwise reasoning between different object candidates. Based on the learned attention mechanism, the cross-attention between the target query and the feature map aggregates relevant context into the target query. Due to the huge feature size, the cross-attention is the slowest step in the chain, but techniques to reduce the attention window have been proposed [24]. After these sequential decoders, the d-dimensional refined query is decoded independently and the FFN layer is as follows [14]. The FFN predicts the center offset δx, δy from the query location, the bounding box height z, the dimensions l, w, h as log(l), log(w), log(h), the yaw angle α as sin(α) and cos(α), the velocity as vx, vy, and finally, predicts the class probability for each of the K semantic classes.
(4) Loss calculation
Through the Hungarian algorithm, set-based prediction and matching between GT are used, where the matching cost is defined as:
Transformer-based sensor fusion
TransFusion: solves the modality misalignment problem through soft correlation of features. The first decoder layer constitutes generating a sparse query from LiDAR BEV features. The second decoder layer enriches the LiDAR query with image features with soft correlation by exploiting the local inductive bias with cross attention only around the bounding box decoded from the query. They also have an image-guided query initialization layer!
FUTR3D: Closely related to [6], it is robust to any number of sensor modalities. MAFS (Modality Agnostic Feature Sampler) accepts a 3D Query and collects features from multi-view cameras, high-resolution lidar, low-resolution lidar, and radar. Specifically, it first decodes the query to obtain a 3D coordinate, which is then used as an anchor to iteratively collect features from all modalities. BEV features are used for lidar and camera, but for radar, the top k nearest radar points are selected in MAFS. For each Query i, all these features F are concatenated as follows, where Φ is the MLP layer:
Previous article:Pure vision VS lidar, which intelligent driving route is better?
Next article:Huawei releases new high-precision 4D millimeter-wave radar solution
- Popular Resources
- Popular amplifiers
- A new chapter in Great Wall Motors R&D: solid-state battery technology leads the future
- Naxin Micro provides full-scenario GaN driver IC solutions
- Interpreting Huawei’s new solid-state battery patent, will it challenge CATL in 2030?
- Are pure electric/plug-in hybrid vehicles going crazy? A Chinese company has launched the world's first -40℃ dischargeable hybrid battery that is not afraid of cold
- How much do you know about intelligent driving domain control: low-end and mid-end models are accelerating their introduction, with integrated driving and parking solutions accounting for the majority
- Foresight Launches Six Advanced Stereo Sensor Suite to Revolutionize Industrial and Automotive 3D Perception
- OPTIMA launches new ORANGETOP QH6 lithium battery to adapt to extreme temperature conditions
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions
- TDK launches second generation 6-axis IMU for automotive safety applications
- LED chemical incompatibility test to see which chemicals LEDs can be used with
- Application of ARM9 hardware coprocessor on WinCE embedded motherboard
- What are the key points for selecting rotor flowmeter?
- LM317 high power charger circuit
- A brief analysis of Embest's application and development of embedded medical devices
- Single-phase RC protection circuit
- stm32 PVD programmable voltage monitor
- Introduction and measurement of edge trigger and level trigger of 51 single chip microcomputer
- Improved design of Linux system software shell protection technology
- What to do if the ABB robot protection device stops
- CGD and Qorvo to jointly revolutionize motor control solutions
- CGD and Qorvo to jointly revolutionize motor control solutions
- Keysight Technologies FieldFox handheld analyzer with VDI spread spectrum module to achieve millimeter wave analysis function
- Infineon's PASCO2V15 XENSIV PAS CO2 5V Sensor Now Available at Mouser for Accurate CO2 Level Measurement
- Advanced gameplay, Harting takes your PCB board connection to a new level!
- Advanced gameplay, Harting takes your PCB board connection to a new level!
- A new chapter in Great Wall Motors R&D: solid-state battery technology leads the future
- Naxin Micro provides full-scenario GaN driver IC solutions
- Interpreting Huawei’s new solid-state battery patent, will it challenge CATL in 2030?
- Are pure electric/plug-in hybrid vehicles going crazy? A Chinese company has launched the world's first -40℃ dischargeable hybrid battery that is not afraid of cold
- [Anxinke NB-IoT Development Board EC-01F-Kit] 1. Unboxing and Hardware Appreciation
- TI C66x DSP instruction set jump instruction B
- November 24 live broadcast review: NXP's embedded human machine interface solution detailed explanation (including video playback, ppt, Q&A)
- TL437x-EVM Evaluation Board Test Manual (1)
- EEWORLD University - Understanding and comparing high-speed analog-to-digital (ADC) and digital-to-analog converter (DAC) converter architectures
- [Gizwits Gokit3 Review] + Networking and lighting up RGB lights
- Ask about copper plating and splitting power planes
- High Voltage Impedance Tuning Quick Guide
- K210 face recognition environment construction process
- Charge up! The latest generation of ACF UCC28782 ultra-small fast charging adapter!