Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

Matthew Strong; Wei-Jer Chang; Quentin Herau; Jiezhi Yang; Yihan Hu; Chensheng Peng; Wei Zhan

arXiv:2602.22091·cs.CV·March 6, 2026

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

Matthew Strong, Wei-Jer Chang, Quentin Herau, Jiezhi Yang, Yihan Hu, Chensheng Peng, Wei Zhan

PDF

Open Access

TL;DR

This paper introduces a label-free, teacher-guided learning framework that leverages unposed in-the-wild videos to develop comprehensive autonomous driving representations, outperforming traditional methods on planning and perception tasks.

Contribution

It proposes a novel multi-modal, self-supervised approach that learns a unified pseudo-4D representation from raw videos without annotations or pose information.

Findings

01

Outperforms multi-camera and LiDAR baselines with a single monocular camera on NAVSIM

02

Effective transfer to downstream autonomous driving planning tasks

03

Strong performance on semantic, geometric, and motion prediction benchmarks

Abstract

Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Autonomous Vehicle Technology and Safety · Generative Adversarial Networks and Image Synthesis