Back to the Features: DINO as a Foundation for Video World Models

Federico Baldassarre; Marc Szafraniec; Basile Terver; Vasil Khalidov; Francisco Massa; Yann LeCun; Patrick Labatut; Maximilian Seitzer; Piotr Bojanowski

arXiv:2507.19468·cs.CV·July 28, 2025

Back to the Features: DINO as a Foundation for Video World Models

Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, Piotr Bojanowski

PDF

TL;DR

DINO-world is a versatile video world model that predicts future frames in DINOv2's latent space, demonstrating strong performance across diverse scenes and enabling action-conditioned planning.

Contribution

It introduces a pre-trained image encoder-based video world model trained on large-scale uncurated videos, advancing future prediction and planning capabilities.

Findings

01

Outperforms previous models on video prediction benchmarks

02

Shows strong understanding of intuitive physics

03

Enables action-conditioned planning in latent space

Abstract

We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.