A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency
Do Xuan Long, Yale Song, Min-Yen Kan, Tomas Pfister, Long T. Le

TL;DR
A$^2$RD is a novel agentic autoregressive diffusion framework that enhances long video synthesis by self-improvement and consistency mechanisms, outperforming existing methods on new benchmarks.
Contribution
The paper introduces A$^2$RD, a new architecture for long video synthesis that decouples creativity from consistency, with a self-improving cycle and a challenging benchmark LVBench-C.
Findings
A$^2$RD outperforms baselines by up to 30% in consistency.
Achieves 20% improvement in narrative coherence.
Human evaluations show better motion and transition smoothness.
Abstract
Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present ARD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. ARD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
