RedMotion: Motion Prediction via Redundancy Reduction
Royden Wagner, Omer Sahin Tas, Marvin Klemp, Carlos Fernandez,, Christoph Stiller

TL;DR
RedMotion is a transformer-based motion prediction model that leverages redundancy reduction through environment representation learning, achieving superior semi-supervised performance and competitive results in autonomous driving benchmarks.
Contribution
It introduces a novel redundancy reduction approach for environment embeddings in motion prediction, combining transformer decoding and self-supervised learning.
Findings
Outperforms PreTraM, Traj-MAE, and GraphDINO in semi-supervised settings.
Achieves competitive results in the Waymo Motion Prediction Challenge.
Provides an open-source implementation for reproducibility.
Abstract
We introduce RedMotion, a transformer model for motion prediction in self-driving vehicles that learns environment representations via redundancy reduction. Our first type of redundancy reduction is induced by an internal transformer decoder and reduces a variable-sized set of local road environment tokens, representing road graphs and agent data, to a fixed-sized global embedding. The second type of redundancy reduction is obtained by self-supervised learning and applies the redundancy reduction principle to embeddings generated from augmented views of road environments. Our experiments reveal that our representation learning approach outperforms PreTraM, Traj-MAE, and GraphDINO in a semi-supervised setting. Moreover, RedMotion achieves competitive results compared to HPTR or MTR++ in the Waymo Motion Prediction Challenge. Our open-source implementation is available at:…
Peer Reviews
Decision·Submitted to ICLR 2024
This paper proposes a label-free method for pre-training a map encoder. I think the approach is novel, and I think map-only encoders are increasingly relevant for motion prediction in that, generally speaking, self-driving companies have a stream of motion data, and if the map encoding is fixed, training on new motion data as it comes in should be much more parameter efficient compared to training from scratch every time. I also appreciate the 3D visuals of the motion predictions in the paper, w
I think the main weakness of this paper is that I'm not convinced that both contributions claimed by the paper are valid. Specifically, the first claimed contribution is that the authors design an architecture that reduces the variable-length set of map objects to an encoding of fixed-length. I think the authors should clarify how this encoding is different from "latent query attention" used in Wayformer, which also reduces a variable-length set of objects to an encoding of fixed-length. This su
* The paper presents solid contributions on representation learning for autonomous driving scenes in the motion prediction context. * The general explanation of the obtained embeddings for the road map is clear and easy to understand. The experiments with local attention provide * The presented results are competitive with the state-of-the-art even though the method focused on the pre-trained representation instead of actually achieving SOTA results on motion prediction benchmarks. That shows a
1. One potential weakness of the paper is the lack of evaluation of the strategy in a closed loop fashion. While motion prediction benchmarks are still used it has been show that motion prediction alone tends to have poorer quality when evaluated in a closed loop. Examples of that are NuPlan [1] evaluation and the more recent waymo closed loop prediction benchmarks. 2. For the ablations I missed acquiring a bit more understanding on the relevance of RBT versus just a simple tokenization. It wo
1. It is important in the autonomous driving field to reduce model size to speed up.
1. As a work focused on efficiency, no statistics about inference memory footprint or inference time are reported. I suggest the authors report these related statistics and compare with open-sourced classic baselines like MTR/QCNet. Otherwise, the claimed advantage of reduction can not be known. 2. Wrong experiment setting. The authors state that *we use 100% of the training data for pre-training and fine-tune on only 12.5%.*. However, the data used for pretraining is already annotated data for
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTraffic Prediction and Management Techniques · Autonomous Vehicle Technology and Safety · Video Surveillance and Tracking Methods
MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Adam · Absolute Position Encodings · Residual Connection · Byte Pair Encoding · Linear Layer
