Autonomous driving will be smarter in the future and can predict the actions of other vehicles-EEWORLD

Collect

At the just-concluded CVPR, isee, Peking University, UCLA, and MIT jointly released a research result called Multi-Agent Tensor Fusion (MATF). The model encodes the past trajectories and scenes of multiple agents into a multi-agent tensor, and then applies convolutional fusion to capture the interactions between multiple agents while retaining the spatial structure and scenes of the agents. The model uses adversarial loss to learn random predictions. Experiments on datasets of highway scenes and pedestrian congestion scenes show that the model has achieved state-of-the-art prediction accuracy.

Driving is a social activity. Consider this impressive multi-agent social interaction in this scene (with a headache-inducing roundabout):

Drivers are driving in a complex scenario while remaining largely safe. It is remarkable that human drivers can maintain high-probability traffic safety when driving or interacting closely with other road users in the same environment and when they cannot fully understand the driving intentions of other vehicles. So how do human drivers accomplish this feat?

Social prediction is an essential part of driving

Human drivers use their social intelligence to predict how other traffic participants’ future actions will depend on their interactions with themselves and the scene. By predicting the trajectories of nearby traffic participants, drivers can proactively plan safe interactions and minimize other emergency responses such as braking when an unexpected situation is about to occur.

However, a human driver can never predict with complete certainty what trajectory another vehicle will execute. A human driver is often in a situation where they are thinking, “Will he yield?” “Will he suddenly speed up?” “How slow will he go?”

Learn to predict

The researchers developed a neural network architecture that can learn from large-scale data to make probabilistic predictions about other trajectories. The researchers' approach only considers training data collected during driving, generalizing as much as possible across environments, scenarios, and types of vehicles and agents (trucks, cars, buses, motorcycles, bicycles, pedestrians, etc.).

iess, together with Peking University, University of Southern California, and Massachusetts Institute of Technology, developed a new method called Multi-Agent Tensor Fusion (MATF). By aligning scene features and agent trajectory features in a multi-agent tensor (MAT) representation, it combines the advantages of spatial and agent-centric representations, as shown below. MAT encoding naturally handles scenes with different numbers of agents through convolution operations, and predicts the computational complexity of the trajectories of all agents in the scene is linear. GAN training allows MATF to learn to predict the distribution of trajectories that capture the uncertainty of how the situation will develop. MATF learns to predict joint trajectories, which can explain interactive behaviors such as deceleration and avoidance between vehicles.

Here is a detailed description of the MATF architecture. The MATF architecture first encodes all relevant information about the scene, and then processes the past trajectories of each agent using a recurrent neural network to encode all relevant information about each agent. The network then spatially aligns the scene and agent features into a multi-agent tensor, preserving all local and non-local spatial relations in the scene. Multi-agent tensor fusion is then performed using the learned fully convolutional mappings to obtain the fused multi-agent tensor as the final encoding of the multi-agent driven scene. The convolutional mapping is the same for each agent, it captures the spatial relationships and interactions between all agents, and is applicable to all agents in the scene at the same time. The MATF method then learns probabilistic decoding information from the fused multi-agent tensor to produce predicted trajectories that are sensitive to scene features and the trajectories of surrounding agents.

We use a conditional Generative Adversarial Network (GAN) training technique to learn a probability distribution over trajectories given a MATF encoding. GANs allow learning high-fidelity generative models that capture the distribution of observed data. In a driving context, the modes of the distribution correspond to different maneuvers that a vehicle or pedestrian may perform, such as following a lane/path and changing lanes/paths. The distribution around each mode corresponds to how the maneuver is performed, such as fast, slow, aggressive, cautious, etc. GANs naturally capture both types of variability. Importantly, our GAN algorithm trains the model to generate articulated trajectories that account for interactions between vehicles, such as yielding and collision avoidance.

in conclusion

The researchers first applied their model to learn to predict vehicle trajectories (where large-scale driving data was collected by isee). The figure below shows five scenarios, with each vehicle's past trajectory shown in a different color, followed by 100 sampled future trajectories. The ground truth trajectories are shown in black, and the lane centers are shown in gray. (a) shows a complex scenario involving five vehicles; MATF accurately predicts the trajectories and velocity profiles of all vehicles. In (b), MATF correctly predicts that the red vehicle will complete the lane change. In (c), MATF captures the uncertainty of whether the red vehicle will take the highway exit. In (d), once the purple vehicle passes the highway exit, MATF predicts that it will not pass the highway exit. In (e), MATF fails to accurately predict the ground truth trajectory of the red vehicle; however, it predicts that the vehicle will initiate a lane change maneuver in a small number of sampled trajectories, reflecting the low prior probability of spontaneous lane changes learned from the dataset.

Next, the researchers applied their model to learning to predict the trajectories of pedestrians and multiple other types of agents from the Stanford Drone Dataset, a large, state-of-the-art dataset containing trajectories of pedestrians, cyclists, skateboarders, carts, cars, and buses traveling around a university campus. In the figure below, the blue line shows the past trajectory, the red line shows the ground truth trajectory, and the green line shows the predicted trajectory. The trajectories of all the agents shown in the figure were jointly predicted by a forward pass through the network. The model predicts: (1) two agents entering the roundabout from the top will exit from the left; (2) an agent coming from the left on the pathway above the roundabout turns left and moves toward the top of the image; and (3) a speed bump slows down at the doorway of the building above and on the right side of the roundabout. Another interesting but failed example (4) shows an agent at the top right corner of the roundabout turning right to move to the top of the image; the model predicts the turn but fails to accurately predict the turn angle.