Bootstrapping Action-Grounded Visual Dynamics in Unified Vision-Language Models

Yifu Qiu; Yftah Ziser; Anna Korhonen; Shay B. Cohen; Edoardo M. Ponti

arXiv:2506.06006·cs.CV·February 13, 2026

Bootstrapping Action-Grounded Visual Dynamics in Unified Vision-Language Models

Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti

PDF

Open Access 1 Repo 3 Models

TL;DR

This paper explores how vision-language models can predict future visual states from actions, proposing a novel approach using inverse dynamics prediction to bootstrap forward dynamics prediction, leading to improved image editing performance.

Contribution

It introduces a method to leverage inverse dynamics prediction for bootstrapping forward dynamics prediction in vision-language models, enhancing their ability to generate plausible future frames.

Findings

01

Achieved competitive image editing performance with state-of-the-art models.

02

Improved image editing accuracy by 7-13% over existing methods.

03

Obtained the best human evaluation scores across Aurora-Bench subsets.

Abstract

Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP), effectively captioning the action between frames, is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yfqiu-nlp/vlm-world-model
jaxOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition