DONUT: A Decoder-Only Model for Trajectory Prediction
Markus Knoche, Daan de Geus, Bastian Leibe

TL;DR
DONUT introduces a decoder-only autoregressive model for trajectory prediction in autonomous driving, outperforming existing methods by unrolling trajectories and using an overprediction strategy for better future anticipation.
Contribution
The paper presents a novel decoder-only architecture for trajectory prediction, inspired by language models, with an overprediction strategy to enhance forecasting accuracy.
Findings
Outperforms encoder-decoder baselines on Argoverse 2 benchmark
Achieves state-of-the-art results in single-agent motion forecasting
Demonstrates improved iterative prediction consistency
Abstract
Predicting the motion of other agents in a scene is highly relevant for autonomous driving, as it allows a self-driving car to anticipate. Inspired by the success of decoder-only models for language modeling, we propose DONUT, a Decoder-Only Network for Unrolling Trajectories. Unlike existing encoder-decoder forecasting models, we encode historical trajectories and predict future trajectories with a single autoregressive model. This allows the model to make iterative predictions in a consistent manner, and ensures that the model is always provided with up-to-date information, thereby enhancing performance. Furthermore, inspired by multi-token prediction for language modeling, we introduce an 'overprediction' strategy that gives the model the auxiliary task of predicting trajectories at longer temporal horizons. This allows the model to better anticipate the future and further improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
