End-to-end Contextual Perception and Prediction with Interaction Transformer
Lingyun Luke Li, Bin Yang, Ming Liang, Wenyuan Zeng, Mengye Ren, Sean, Segal, Raquel Urtasun

TL;DR
This paper introduces the Interaction Transformer, a novel end-to-end neural network architecture that models actor interactions for improved 3D object detection and motion prediction in self-driving scenarios, achieving real-time performance and state-of-the-art results.
Contribution
We propose the Interaction Transformer, a new recurrent neural network architecture that explicitly models spatial-temporal interactions for better perception and prediction in autonomous driving.
Findings
Outperforms state-of-the-art on ATG4D and nuScenes datasets.
Significantly reduces predicted trajectory collisions.
Operates in real-time with end-to-end training.
Abstract
In this paper, we tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving. Towards this goal, we design a novel approach that explicitly takes into account the interactions between actors. To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture, which we call the Interaction Transformer. Importantly, our model can be trained end-to-end, and runs in real-time. We validate our approach on two challenging real-world datasets: ATG4D and nuScenes. We show that our approach can outperform the state-of-the-art on both datasets. In particular, we significantly improve the social compliance between the estimated future trajectories, resulting in far fewer collisions between the predicted actors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Is All You Need · Multi-Head Attention · Byte Pair Encoding · Label Smoothing · Dropout · Residual Connection · Softmax
