Latent Video Prediction Learns Better World Models
Ali J Alrasheed, Aryan Yazdan Parast, Basim Azam, James Bailey, Naveed Akhtar

TL;DR
This paper systematically evaluates latent video prediction models as world models, demonstrating their robustness and ability to encode physical and temporal information better than other models.
Contribution
It provides the first comprehensive analysis of latent video models across multiple robustness axes, highlighting their advantages over pixel-based models.
Findings
Latent prediction models degrade more gracefully under corruption.
They preserve class structure under occlusion better.
They encode the arrow of time and physical cues without pixel reconstruction.
Abstract
Self-supervised video models are increasingly framed as world models, yet their evaluation remains largely confined to a single top-1 accuracy score on clean benchmarks. This leaves a major gap in comprehending their potential as world models. We present the first systematic study addressing this gap, analyzing four matched-capacity frontier video foundation models, V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2, across five robustness axes relevant to their deployment as video world models: feature discriminability, corruption robustness, fine-grained discrimination, occlusion robustness, and sensitivity to temporal direction. Our evaluations establish that across all five axes, latent-prediction models form a distinct and consistent profile. They degrade more gracefully under pixel corruption, preserve usable class structure rather than mere geometric stability under occlusion,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
