MoST: Multi-modality Scene Tokenization for Motion Prediction

Norman Mu; Jingwei Ji; Zhenpei Yang; Nate Harada; Haotian Tang; Kan; Chen; Charles R. Qi; Runzhou Ge; Kratarth Goel; Zoey Yang; Scott Ettinger,; Rami Al-Rfou; Dragomir Anguelov; Yin Zhou

arXiv:2404.19531·cs.CV·May 1, 2024·1 cites

MoST: Multi-modality Scene Tokenization for Motion Prediction

Norman Mu, Jingwei Ji, Zhenpei Yang, Nate Harada, Haotian Tang, Kan, Chen, Charles R. Qi, Runzhou Ge, Kratarth Goel, Zoey Yang, Scott Ettinger,, Rami Al-Rfou, Dragomir Anguelov, Yin Zhou

PDF

Open Access

TL;DR

This paper introduces MoST, a novel multi-modality scene tokenization method that encodes scene elements using pre-trained models, improving motion prediction accuracy and robustness over traditional symbolic and raw sensor approaches.

Contribution

MoST proposes a scene tokenization framework leveraging pre-trained models for open-vocabulary scene encoding, enhancing interpretability and performance in motion prediction.

Findings

01

Significant performance improvements on Waymo dataset

02

Efficient encoding with a few hundred tokens

03

Compatibility with transformer architectures

Abstract

Many existing motion prediction approaches rely on symbolic perception outputs to generate agent trajectories, such as bounding boxes, road graph information and traffic lights. This symbolic representation is a high-level abstraction of the real world, which may render the motion prediction model vulnerable to perception errors (e.g., failures in detecting open-vocabulary obstacles) while missing salient information from the scene context (e.g., poor road conditions). An alternative paradigm is end-to-end learning from raw sensors. However, this approach suffers from the lack of interpretability and requires significantly more training resources. In this work, we propose tokenizing the visual world into a compact set of scene elements and then leveraging pre-trained image foundation models and LiDAR neural networks to encode all the scene elements in an open-vocabulary manner. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Video Analysis and Summarization

MethodsSparse Evolutionary Training