MoST: Multi-modality Scene Tokenization for Motion Prediction
Norman Mu, Jingwei Ji, Zhenpei Yang, Nate Harada, Haotian Tang, Kan, Chen, Charles R. Qi, Runzhou Ge, Kratarth Goel, Zoey Yang, Scott Ettinger,, Rami Al-Rfou, Dragomir Anguelov, Yin Zhou

TL;DR
This paper introduces MoST, a novel multi-modality scene tokenization method that encodes scene elements using pre-trained models, improving motion prediction accuracy and robustness over traditional symbolic and raw sensor approaches.
Contribution
MoST proposes a scene tokenization framework leveraging pre-trained models for open-vocabulary scene encoding, enhancing interpretability and performance in motion prediction.
Findings
Significant performance improvements on Waymo dataset
Efficient encoding with a few hundred tokens
Compatibility with transformer architectures
Abstract
Many existing motion prediction approaches rely on symbolic perception outputs to generate agent trajectories, such as bounding boxes, road graph information and traffic lights. This symbolic representation is a high-level abstraction of the real world, which may render the motion prediction model vulnerable to perception errors (e.g., failures in detecting open-vocabulary obstacles) while missing salient information from the scene context (e.g., poor road conditions). An alternative paradigm is end-to-end learning from raw sensors. However, this approach suffers from the lack of interpretability and requires significantly more training resources. In this work, we propose tokenizing the visual world into a compact set of scene elements and then leveraging pre-trained image foundation models and LiDAR neural networks to encode all the scene elements in an open-vocabulary manner. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Video Analysis and Summarization
MethodsSparse Evolutionary Training
