S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation

Yichen Xie; Runsheng Xu; Tong He; Jyh-Jing Hwang; Katie Luo; Jingwei Ji; Hubert Lin; Letian Chen; Yiren Lu; Zhaoqi Leng; Dragomir Anguelov; Mingxing Tan

arXiv:2505.24139·cs.CV·June 4, 2025

S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation

Yichen Xie, Runsheng Xu, Tong He, Jyh-Jing Hwang, Katie Luo, Jingwei Ji, Hubert Lin, Letian Chen, Yiren Lu, Zhaoqi Leng, Dragomir Anguelov, Mingxing Tan

PDF

TL;DR

S4-Driver introduces a self-supervised, scalable motion planning approach for autonomous driving that leverages a novel 3D visual representation from multi-view, multi-frame inputs without human annotations.

Contribution

It proposes a new sparse volume strategy to convert 2D visual features into 3D space within a multimodal LLM, enhancing trajectory prediction in autonomous driving.

Findings

01

Outperforms existing supervised methods on nuScenes and Waymo datasets.

02

Requires no human annotations for training.

03

Demonstrates scalability with large unannotated driving logs.

Abstract

The latest advancements in multi-modal large language models (MLLMs) have spurred a strong renewed interest in end-to-end motion planning approaches for autonomous driving. Many end-to-end approaches rely on human annotations to learn intermediate perception and prediction tasks, while purely self-supervised approaches--which directly learn from sensor inputs to generate planning trajectories without human annotations often underperform the state of the art. We observe a key gap in the input representation space: end-to-end approaches built on MLLMs are often pretrained with reasoning tasks in 2D image space rather than the native 3D space in which autonomous vehicles plan. To this end, we propose S4-Driver, a scalable self-supervised motion planning algorithm with spatio-temporal visual representation, based on the popular PaLI multimodal large language model. S4-Driver uses a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.