Analysis of Transformer-based Autonomous Driving Sensor Fusion Technology-EEWORLD

Collect

Sensor fusion is an important topic in many perception systems, such as autonomous driving and robotics. Transformer-based detection heads and CNN-based feature encoders (which extract features from raw sensor data) have become one of the highest performing multi-sensor fusion frameworks for 3D detection, ranking among the best on many datasets.

This paper provides a recent literature review of transformer-based 3D object detection tasks, focusing mainly on sensor fusion, introducing the basics of visual transformers (ViT), and briefly discussing several non-transformer-based, less dominant approaches to sensor fusion for autonomous driving. Finally, the role of transformers in the field of sensor fusion is summarized and future research directions in this area are proposed.

For more information, please refer to: https://github.com/ApoorvRoboticist/Transformers-SensorFusion

Sensor fusion is the integration of perception data from different information sources. By utilizing complementary information captured by different sensors, fusion helps reduce the uncertainty of state estimation and makes 3D object detection tasks more robust. Target attributes are not equally identifiable in different modes, so it is necessary to utilize different modes and extract complementary information from them. For example, lidar can better locate potential objects, radar can better estimate the speed of objects in the scene, and last but not least, cameras can classify objects through dense pixel information.

Why is sensor fusion difficult?

Sensor data from different modalities often have large differences in data distribution, in addition to differences in the coordinate space of each sensor. For example, LiDAR is in Cartesian coordinate space, Radar is in polar coordinate space, and images are in perspective space. The spatial misalignment introduced by different coordinate systems makes it difficult to merge these modalities together. Another problem with multimodal input is that when the ML network can be fed with camera and LiDAR, there is the problem of asynchronous timelines.

The overall architecture of the existing sensor fusion model is shown above, with a transformer-based head (green), a CNN-based feature extractor (blue) for predicting 3D bird’s-eye view (BEV) bounding boxes (yellow blocks), and each sensor has intermediate BEV features (purple blocks). The sensor fusion is set up to receive inputs from multi-view cameras, lidars, and radars.

While CNNs can be used to capture global context within a single modality, it is non-trivial to extend this to multiple modalities and accurately model the interactions between pairwise features. To overcome this limitation, the global contextual reasoning about the 2D scene is integrated directly into the feature extraction layer of the modality using the transformer’s attention mechanism. Recent advances in sequence modeling and audio-visual fusion have shown that Transformer-based architectures are very effective in modeling information interactions in sequence or cross-modal data.

Field Background

Fusion level: Recently, multi-sensor fusion has attracted more and more interest in the 3D detection community. Existing methods can be divided into detection-level, proposal-level, and point-level fusion methods, depending on how early or late the different modalities (i.e., camera, radar, lidar, etc.) are fused!

Detection-level or late fusion has become the simplest form of fusion, as each modality can process its own BEV detections individually, which can then be post-processed using the Hungarian matching algorithm and Kalman filtering to aggregate and remove duplicate detections. However, this approach cannot take advantage of the fact that each sensor can also contribute to different attributes in a single bounding box prediction. CLOCS can fuse the results of lidar-based 3D object detection and 2D detection tasks. It operates on the two output candidates before non-maximum suppression and uses the geometric consistency between the two sets of predictions to eliminate false positives (FPs), because the same FP is likely to be detected simultaneously in different modes.

Point-level, also known as early fusion, enhances the LiDAR point cloud with camera features, in which a transformation matrix is used to find hard correlations between LiDAR points and images. However, this approach suffers from point sparsity and sometimes even slight errors in the calibration parameters of the two sensors.

Proposal-level, or deep, fusion is currently the most studied approach in the literature, and advances in transformers [5, 6, 7] unlock the possibility of how intermediate features can interact, despite the fact that they come from different sensors. Representative work proposed by MV3D extracts initial bounding boxes from LiDAR features and iteratively optimizes them using image features. BEVFusion generates camera-based BEV features, as emphasized in [10, 11, 12, 13]. Camera and LiDAR modalities are connected in BEV space, and the BEV decoder is used to predict a 3D box as the final output. In TransFuser, the BEV representations of single-view images and LiDAR are fused on various intermediate feature maps by transformers in the encoder. This results in a 512-dimensional feature vector output from the encoder, which constitutes a compact representation of local and global context.

In addition, this paper feeds the output into GRU (Gated Recurrent Unit) and predicts differentiable ego-route points using L1 regression loss. In addition to being multimodal, the 4D network [16] also adds the time dimension as the fourth dimension to the problem. First, the temporal features of the camera and lidar are extracted separately [17]. Adding different contexts of image representation, the paper collects three levels of image features, namely high-resolution images, low-resolution images, and videos. Then, the cross-modal information is fused using a transformation matrix to obtain the 2D context of a given 3D pillar center, which is defined by the center point (xo, yo, zo) of the BEV grid cell!

Transformer-based fusion network background

The method can be divided into three steps:

1. Apply a neural network-based backbone to extract spatial features from all modalities individually;

2. Iteratively refine a small set of learned embeddings (target query/proposal) in the transformer module to generate a set of 3D box predictions;

3. Calculate loss;

The architecture is shown in Figure 1.

(1) Backbone

Camera: Multiple camera images are fed to the backbone (e.g., ResNet-101) and FPN, and features are obtained;

LiDAR: Usually use voxelnet with 0.1m voxel size or PointPill with 0.2m pillar size to encode the points. After 3D backbone and FPN, multi-scale BEV feature maps are obtained.

Radar: Convert location, intensity, and speed into features via MLP!

(2) Query Initialization

In the seminal work [5], a sparse query is learned as a network parameter and is representative of the entire training data. This type of query requires longer time, i.e. more sequential decoder layers (usually 6) to iteratively converge to the actual 3D objects in the scene. However, recently, input-dependent queries [20] have been proposed as a better initialization strategy. This strategy can reduce a 6-layer transformer decoder to even a single decoder layer, Transfusion uses a center heatmap as a query, and BEVFormer introduces a dense query as an isometric BEV grid!

(3) Transformers Decoder

To refine the object proposal, repeated modules of Transformer decoders are used sequentially in the ViT model, where each block consists of a self-attention layer and a cross-attention layer. Self-attention between the target query performs pairwise reasoning between different object candidates. Based on the learned attention mechanism, the cross-attention between the target query and the feature map aggregates relevant context into the target query. Due to the huge feature size, the cross-attention is the slowest step in the chain, but techniques to reduce the attention window have been proposed [24]. After these sequential decoders, the d-dimensional refined query is decoded independently and the FFN layer is as follows [14]. The FFN predicts the center offset δx, δy from the query location, the bounding box height z, the dimensions l, w, h as log(l), log(w), log(h), the yaw angle α as sin(α) and cos(α), the velocity as vx, vy, and finally, predicts the class probability for each of the K semantic classes.

(4) Loss calculation

Through the Hungarian algorithm, set-based prediction and matching between GT are used, where the matching cost is defined as:

Transformer-based sensor fusion

TransFusion: solves the modality misalignment problem through soft correlation of features. The first decoder layer constitutes generating a sparse query from LiDAR BEV features. The second decoder layer enriches the LiDAR query with image features with soft correlation by exploiting the local inductive bias with cross attention only around the bounding box decoded from the query. They also have an image-guided query initialization layer!

FUTR3D: Closely related to [6], it is robust to any number of sensor modalities. MAFS (Modality Agnostic Feature Sampler) accepts a 3D Query and collects features from multi-view cameras, high-resolution lidar, low-resolution lidar, and radar. Specifically, it first decodes the query to obtain a 3D coordinate, which is then used as an anchor to iteratively collect features from all modalities. BEV features are used for lidar and camera, but for radar, the top k nearest radar points are selected in MAFS. For each Query i, all these features F are concatenated as follows, where Φ is the MLP layer:

CMT: The cross-modal transformer encodes 3D coordinates into multimodal tokens through coordinate encoding. The query from the position-guided query generator is used to interact with the multimodal tokens in the transformer decoder, and then the target parameters are pre-set. Point-based query denoising is further introduced to accelerate training convergence by introducing local priors.

UVTR: Unifying Voxel based Representation with Transformer unifies multimodal representations in voxel space for accurate and robust unimodal or cross-modal 3D detection. The modality-specific space is first designed to represent different inputs in voxel space without high compression to alleviate semantic ambiguity and achieve spatial connection. This is a more complex and information-dense representation compared to other BEV methods. For the image voxel space, the perspective features are transformed to a predefined space through view transformation, and a CNN-based voxel encoder is introduced for multi-view feature interaction. For the point voxel space, 3D points can be naturally converted to voxels. Sparse convolutions are used on these voxel features to aggregate spatial information. Compared with images, the semantic ambiguity in the z direction is greatly reduced by the precise position in the point cloud!

LIFT: LiDAR image fusion transformer can align 4D spatiotemporal cross-sensor information. In contrast to [16], it takes advantage of the comprehensive utilization of sequential multimodal data. For sequential data processing, the prior of vehicle pose is used to eliminate the influence of self-motion between temporal data. The paper encodes lidar frames and camera images into sparse BEV grid features and proposes a sensor time 4D attention module to capture mutual correlation!

DeepInteraction takes a slightly different approach compared to others, as previous approaches are structurally limited due to their inherent limitations, as a large amount of imperfect information is fused into a unified representation, which may reduce a large part of the modality-specific representation strength, as shown in [3, 9]. Instead of deriving a fused single BEV representation, they learn and maintain two modality-specific representations to enable inter-modality interaction, which can spontaneously achieve information exchange and modality-specific advantages. The authors call it a multiple-input multiple-output (MIMO) structure, which takes it as input and produces two refined representations as output. This paper includes a DETR3D-like query that is sequentially updated from LiDAR and visual features, with sequential cross-attention layers in the transformer-based decoder layer.

Autoalign: Instead of establishing deterministic correspondences for sensor projection matrices as in other approaches, the paper uses a learnable alignment graph to model the mapping relationship between images and point clouds. This mapping enables the model to automatically align non-uniform features in a dynamic data-driven manner, and they use a cross-attention module to adaptively aggregate pixel-level image features for each voxel.