Anchored Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

Mariam Hassan; Bastien Van Delft; Wuyang Li; Alexandre Alahi

arXiv:2512.16371·cs.CV·March 26, 2026

Anchored Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

Mariam Hassan, Bastien Van Delft, Wuyang Li, Alexandre Alahi

PDF

Open Access

TL;DR

This paper introduces Anchored Video Generation, a modular approach that improves text-to-video synthesis by decoupling scene construction and temporal animation, leading to better scene consistency, efficiency, and control.

Contribution

The paper proposes a novel three-stage pipeline that separates reasoning, composition, and temporal synthesis, significantly enhancing video quality and efficiency in text-to-video models.

Findings

01

Sets new state-of-the-art on T2V CompBench benchmark

02

Improves all tested models on VBench2

03

Reduces sampling steps by 70% without performance loss

Abstract

State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model's inability to construct a semantically correct or logically consistent initial frame. We introduce Anchored Video Generation (AVG), a modular pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation