Bootstrapping Action-Grounded Visual Dynamics in Unified Vision-Language Models
Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti

TL;DR
This paper explores how vision-language models can predict future visual states from actions, proposing a novel approach using inverse dynamics prediction to bootstrap forward dynamics prediction, leading to improved image editing performance.
Contribution
It introduces a method to leverage inverse dynamics prediction for bootstrapping forward dynamics prediction in vision-language models, enhancing their ability to generate plausible future frames.
Findings
Achieved competitive image editing performance with state-of-the-art models.
Improved image editing accuracy by 7-13% over existing methods.
Obtained the best human evaluation scores across Aurora-Bench subsets.
Abstract
Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP), effectively captioning the action between frames, is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
