AVID: Adapting Video Diffusion Models to World Models
Marc Rigter, Tarun Gupta, Agrin Hilmkil, Chao Ma

TL;DR
This paper introduces AVID, a method to adapt pretrained, unlabelled video diffusion models into action-conditioned world models for decision-making tasks, especially in robotics, without needing access to the original model parameters.
Contribution
AVID proposes a novel adapter-based approach to fine-tune pretrained video diffusion models for action-conditioned world modeling without access to their parameters.
Findings
AVID outperforms existing baselines in video game and robotics data.
Pretrained video models can be effectively adapted for embodied AI tasks.
AVID enables realistic action-conditioned video generation from unlabelled models.
Abstract
Large-scale generative models have achieved remarkable success in a number of domains. However, for sequential decision-making problems, such as robotics, action-labelled data is often scarce and therefore scaling-up foundation models for decision-making remains a challenge. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to optimize decision-making in downstream tasks. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closed-source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper has a nice motivation: how does one adapt existing foundation models in order to add an action conditioning to them, so as to make it more relevant and useful for embodied robotics applications - The paper writing is clear; first the limitations of prior work are built up and then a solution is proposed
- The way the paper starts with the motivation near L51-52 is a bit misleading. The paper actually cannot fix the issues in L51-52 because they still assume access to the internal inference pipeline of these closed-source model, because if I understand it correctly, this method needs access to a diffusion model's noise prediction at each of the N reverse diffusion steps that happens at inference. For closed source models, this information is not available. - The performance gain in the quantitat
1. The paper is well written and easy to follow 2. The main idea of training a lightweight adapter for action-labeled domains is reasonable. It balances finetuning efficiency and task performance. 3. Baseline comparisons are comprehensive. Authors compared to many alternative baselines to demonstrate effectiveness of their approach. Authors provide qualitative visualizations for quality of generated videos and usefulness of learned masks.
1. The presentation in Section 3.2 is a little unclear. It is hard to connect analysis about limitations of previous work [1] to motivations of the proposed approach 2. The novelty is somewhat limited. The main difference from previous work is to have domain-specific adapter output an element-wise mask that is used to combine noise predictions from pre-trained model and adapter. 3. The experimental domains are only two datasets within action-conditioned world modeling [1] Yang, Mengjiao, et al.
1. The authors propose a novel method to condition pre-trained video diffusion models on action sequences without access to the pre-trained model's parameters. 2. The authors mathematically highlight the limitations of the adaptation method proposed in "Probabilistic Adaptation of Text-to-Video Models" and this other approach. 3. The authors demonstrate that their adaptation method has better action consistency compared to the other approach, using a new metric that they introduce. 4. The auth
1. In Table 2, Action conditioned diffusion has a better Action Error Ratio compared to the proposed approach for all three (small, medium, large) variants. While the authors do note this as a limitation, this needs to be explained/investigated more. If it is better to just train an action conditioned diffusion model from scratch why should there be a need to adapt pre-trained models ? 2. Instead of using the action embedding to just scale and shift the t-th frame feature, have the authors expl
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
MethodsDiffusion · Adapter
