VFMF: World Modeling by Forecasting Vision Foundation Model Features

Gabrijel Boduljak; Yushi Lan; Christian Rupprecht; Andrea Vedaldi

arXiv:2512.11225·cs.CV·December 15, 2025

VFMF: World Modeling by Forecasting Vision Foundation Model Features

Gabrijel Boduljak, Yushi Lan, Christian Rupprecht, Andrea Vedaldi

PDF

Open Access 1 Models

TL;DR

This paper introduces a generative world forecasting model that predicts future states in vision foundation model feature space using autoregressive flow matching, improving accuracy and interpretability over deterministic methods.

Contribution

It proposes a novel generative forecasting approach in VFM feature space with autoregressive flow matching, addressing uncertainty and enhancing prediction quality.

Findings

01

Outperforms regression-based methods in accuracy and sharpness

02

Produces diverse and interpretable future predictions

03

Effective across multiple output modalities such as segmentation and depth

Abstract

Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Gabrijel/vfmf
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Multimodal Machine Learning Applications