AVID: Adapting Video Diffusion Models to World Models

Marc Rigter; Tarun Gupta; Agrin Hilmkil; Chao Ma

arXiv:2410.12822·cs.CV·November 26, 2024

AVID: Adapting Video Diffusion Models to World Models

Marc Rigter, Tarun Gupta, Agrin Hilmkil, Chao Ma

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces AVID, a method to adapt pretrained, unlabelled video diffusion models into action-conditioned world models for decision-making tasks, especially in robotics, without needing access to the original model parameters.

Contribution

AVID proposes a novel adapter-based approach to fine-tune pretrained video diffusion models for action-conditioned world modeling without access to their parameters.

Findings

01

AVID outperforms existing baselines in video game and robotics data.

02

Pretrained video models can be effectively adapted for embodied AI tasks.

03

AVID enables realistic action-conditioned video generation from unlabelled models.

Abstract

Large-scale generative models have achieved remarkable success in a number of domains. However, for sequential decision-making problems, such as robotics, action-labelled data is often scarce and therefore scaling-up foundation models for decision-making remains a challenge. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to optimize decision-making in downstream tasks. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closed-source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

- The paper has a nice motivation: how does one adapt existing foundation models in order to add an action conditioning to them, so as to make it more relevant and useful for embodied robotics applications - The paper writing is clear; first the limitations of prior work are built up and then a solution is proposed

Weaknesses

- The way the paper starts with the motivation near L51-52 is a bit misleading. The paper actually cannot fix the issues in L51-52 because they still assume access to the internal inference pipeline of these closed-source model, because if I understand it correctly, this method needs access to a diffusion model's noise prediction at each of the N reverse diffusion steps that happens at inference. For closed source models, this information is not available. - The performance gain in the quantitat

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper is well written and easy to follow 2. The main idea of training a lightweight adapter for action-labeled domains is reasonable. It balances finetuning efficiency and task performance. 3. Baseline comparisons are comprehensive. Authors compared to many alternative baselines to demonstrate effectiveness of their approach. Authors provide qualitative visualizations for quality of generated videos and usefulness of learned masks.

Weaknesses

1. The presentation in Section 3.2 is a little unclear. It is hard to connect analysis about limitations of previous work [1] to motivations of the proposed approach 2. The novelty is somewhat limited. The main difference from previous work is to have domain-specific adapter output an element-wise mask that is used to combine noise predictions from pre-trained model and adapter. 3. The experimental domains are only two datasets within action-conditioned world modeling [1] Yang, Mengjiao, et al.

Reviewer 03Rating 6Confidence 3

Strengths

1. The authors propose a novel method to condition pre-trained video diffusion models on action sequences without access to the pre-trained model's parameters. 2. The authors mathematically highlight the limitations of the adaptation method proposed in "Probabilistic Adaptation of Text-to-Video Models" and this other approach. 3. The authors demonstrate that their adaptation method has better action consistency compared to the other approach, using a new metric that they introduce. 4. The auth

Weaknesses

1. In Table 2, Action conditioned diffusion has a better Action Error Ratio compared to the proposed approach for all three (small, medium, large) variants. While the authors do note this as a limitation, this needs to be explained/investigated more. If it is better to just train an action conditioned diffusion model from scratch why should there be a need to adapt pre-trained models ? 2. Instead of using the action embedding to just scale and shift the t-th frame feature, have the authors expl

Code & Models

Repositories

microsoft/causica
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization

MethodsDiffusion · Adapter