Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving

Linhan Wang; Zichong Yang; Chen Bai; Guoxiang Zhang; Xiaotong Liu; Xiaoyin Zheng; Xiao-Xiao Long; Chang-Tien Lu; Cheng Lu

arXiv:2601.22032·cs.CV·January 30, 2026

Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving

Linhan Wang, Zichong Yang, Chen Bai, Guoxiang Zhang, Xiaotong Liu, Xiaoyin Zheng, Xiao-Xiao Long, Chang-Tien Lu, Cheng Lu

PDF

Open Access 2 Datasets

TL;DR

Drive-JEPA introduces a novel framework combining video pretraining and multimodal trajectory distillation to significantly improve end-to-end autonomous driving performance, achieving state-of-the-art results on benchmark datasets.

Contribution

It adapts V-JEPA for driving video pretraining and proposes a proposal-centric planner with trajectory distillation for better multimodal behavior learning.

Findings

01

Outperforms prior methods by 3 PDMS in perception-free setting

02

Achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2 datasets

03

Sets new state-of-the-art in end-to-end driving benchmarks

Abstract

End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutonomous Vehicle Technology and Safety · Robotic Path Planning Algorithms · Human Pose and Action Recognition