VideoAgent: Self-Improving Video Generation
Achint Soni, Sreyas Venkataraman, Abhranil Chandra, Sebastian, Fischmeister, Percy Liang, Bo Dai, Sherry Yang

TL;DR
VideoAgent introduces a self-improving framework for robotic video generation that reduces hallucinations and improves task success by refining generated videos through external feedback and environment data collection.
Contribution
It presents a novel self-conditioning consistency method enabling inference-time refinement of generated videos for robotic control tasks.
Findings
Significantly reduces hallucination in generated videos.
Boosts success rate of robotic manipulation tasks.
Effective in refining real-robot videos.
Abstract
Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. Instead of directly executing the generated video plan, VideoAgent first refines the…
Peer Reviews
Decision·Submitted to ICLR 2025
1. This paper proposes several novel techniques, such as self-conditioning consistency and incorporating VLM feedback, that improve the quality of the generated video plans through iterative refinement 2. The proposed method adopts a self-improving loop by finetuning the video models with additional successful trajectories collected online 3. The authors provide extensive evaluation results as well as experiment details, which back up the efficacy of the proposed method 4. This paper is well-mot
1. The loop of collecting successful data through environment interaction and finetuning video models might incur large computational overhead, and the improvement seems to become marginal after two online iterations in Figure 4. 2. [nitpick] I believe it is essential to reduce hallucinations and improve the quality of video plans for better decision-making performance and I appreciate the overall contribution. However, in the Metaworld example (Table 1) provided in this work, simply replanning
1. The idea of VideoAgent is novel; I haven’t seen an iterative approach to improving video generation quality before. 2. Obtaining feedback on videos from VLM is feasible.
1. The writing section can be strengthened. The distinction between the consistency model and DDPM is vague in this paper, yet it is crucial. Given that I believe DDPM + DDIM can achieve the same results (referring to the implementation of video generation models and video refinement models), it’s even more necessary to explain the necessity and motivation for using the consistency model. Specific issues can be referenced in the questions section. 2. The experimental section of the paper is rel
The paper presents a novel method which uses feedback from VLMs to refine generated video plans. It introduces a self-improving consistency model which predicts the clean video from a generated video and feedback from the VLM. The proposed method can be continuously improved through online fine-tuning. Experiment results indicate that the proposed method effectively enhances video generation and improves performance on downstream manipulation tasks. The paper also includes ablation studies to ex
1. The paper leverages a self-conditioning consistency loss (Eqn. 7) for video refinement. The first term in Eqn. 7 is a diffusion loss and the second term is for consistency. The reason why including the second term in Eqn. 7 is not very clear. It seems that it encourages generating consistent $x^{(0)}$ from different $\hat x_{i}$. Is it possible to provide more explanation on why the consistency loss is necessary to be included? It would be great to include an ablation study which compares the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
