Anchored Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models
Mariam Hassan, Bastien Van Delft, Wuyang Li, Alexandre Alahi

TL;DR
This paper introduces Anchored Video Generation, a modular approach that improves text-to-video synthesis by decoupling scene construction and temporal animation, leading to better scene consistency, efficiency, and control.
Contribution
The paper proposes a novel three-stage pipeline that separates reasoning, composition, and temporal synthesis, significantly enhancing video quality and efficiency in text-to-video models.
Findings
Sets new state-of-the-art on T2V CompBench benchmark
Improves all tested models on VBench2
Reduces sampling steps by 70% without performance loss
Abstract
State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model's inability to construct a semantically correct or logically consistent initial frame. We introduce Anchored Video Generation (AVG), a modular pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation
