Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving
Linhan Wang, Zichong Yang, Chen Bai, Guoxiang Zhang, Xiaotong Liu, Xiaoyin Zheng, Xiao-Xiao Long, Chang-Tien Lu, Cheng Lu

TL;DR
Drive-JEPA introduces a novel framework combining video pretraining and multimodal trajectory distillation to significantly improve end-to-end autonomous driving performance, achieving state-of-the-art results on benchmark datasets.
Contribution
It adapts V-JEPA for driving video pretraining and proposes a proposal-centric planner with trajectory distillation for better multimodal behavior learning.
Findings
Outperforms prior methods by 3 PDMS in perception-free setting
Achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2 datasets
Sets new state-of-the-art in end-to-end driving benchmarks
Abstract
End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Robotic Path Planning Algorithms · Human Pose and Action Recognition
