A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency

Do Xuan Long; Yale Song; Min-Yen Kan; Tomas Pfister; Long T. Le

arXiv:2605.06924·cs.CV·May 11, 2026

A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency

Do Xuan Long, Yale Song, Min-Yen Kan, Tomas Pfister, Long T. Le

PDF

TL;DR

A$^2$RD is a novel agentic autoregressive diffusion framework that enhances long video synthesis by self-improvement and consistency mechanisms, outperforming existing methods on new benchmarks.

Contribution

The paper introduces A$^2$RD, a new architecture for long video synthesis that decouples creativity from consistency, with a self-improving cycle and a challenging benchmark LVBench-C.

Findings

01

A$^2$RD outperforms baselines by up to 30% in consistency.

02

Achieves 20% improvement in narrative coherence.

03

Human evaluations show better motion and transition smoothness.

Abstract

Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A $^{2}$ RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A $^{2}$ RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.