VFMF: World Modeling by Forecasting Vision Foundation Model Features
Gabrijel Boduljak, Yushi Lan, Christian Rupprecht, Andrea Vedaldi

TL;DR
This paper introduces a generative world forecasting model that predicts future states in vision foundation model feature space using autoregressive flow matching, improving accuracy and interpretability over deterministic methods.
Contribution
It proposes a novel generative forecasting approach in VFM feature space with autoregressive flow matching, addressing uncertainty and enhancing prediction quality.
Findings
Outperforms regression-based methods in accuracy and sharpness
Produces diverse and interpretable future predictions
Effective across multiple output modalities such as segmentation and depth
Abstract
Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Multimodal Machine Learning Applications
