Streaming Autoregressive Video Generation via Diagonal Distillation
Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, Weiyang Liu

TL;DR
This paper introduces Diagonal Distillation, a novel method for real-time streaming video generation that improves temporal coherence and reduces latency by leveraging asymmetric step scheduling and optical flow modeling.
Contribution
It proposes Diagonal Distillation, which better exploits temporal context and aligns noise prediction during chunk generation, enabling efficient high-quality video synthesis.
Findings
Generates 5-second videos at 31 FPS in 2.61 seconds.
Achieves 277.3x speedup over undistilled models.
Improves motion coherence and reduces error propagation.
Abstract
Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed design is novel and well-motivated. The diagonal denoising idea is conceptually elegant, aligning the temporal and diffusion-step dimensions in a unified framework. This bridges autoregressive conditioning with step-efficient diffusion distillation. 2. This work addresses critical problems in streaming generation: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (exposure bias). 3. The
1. Limited novelty in the distillation objective. While the diagonal scheduling and forcing mechanism are new, the underlying distillation objective remains close to prior work Self Forcing with Distribution Matching Distillation. The conceptual leap may be seen as an engineering refinement rather than a fundamentally new learning principle. 2. Some format flaws, e.g., order of Table 1, 2 and 3. Caption font is too small. The resolution seems to be very different in Figure 7.
* Timely problem: addresses online/streaming latency, not just offline T2V. * Coherent design: diagonal schedule + noisy conditioning + flow loss form a simple, compatible recipe. * Strong practicality: low first-frame latency, high FPS, and straightforward cache reuse. * Empirical support: consistent speedups with minimal quality loss; informative ablations on step allocation and losses. * Clarity: figures and narrative make the training–inference mismatch and diagonal rationale intuitive.
* Longer Videos Test: While 45 seconds is impressive, many streaming use cases (e.g., live streams) require minutes of content. Does error accumulation reemerge for 1–5 minute videos, and if so, can the diagonal strategy be extended (e.g., adaptive step resets)? * Insufficient Analysis of Step Allocation Heuristics: A quantitative comparison of more step sequences (beyond the 6 evaluated) would clarify how step allocation impacts the quality-efficiency frontier. For example, does 5422222 yield b
1. The diagonal allocation of denoising effort across time (many steps early->few later) is a simple but appealing scheduling concept for AR diffusion models, explicitly exploiting temporal priors accumulated early. The Diagonal Forcing mechanism, feeding noised previous-chunk states (rather than clean frames) as the KV cache, targets exposure bias in a way that is tailored to AR diffusion, not borrowed wholesale from image distillation. 2. Adding Flow Distribution Matching to align motion dist
1. Early sections mention a “diagonal attention mechanism operating jointly across time and denoising steps,” but the implementation centers on scheduling (step counts per chunk) plus conditioning with a noised KV cache. It’s unclear whether there is any architectural change to attention patterns (e.g., block-sparse or strided attention over (time × step) axes) beyond cache reuse. If there is special attention, the paper needs explicit architecture diagrams and tensor shapes; if not, the phrase
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Coding and Compression Technologies · Image and Video Quality Assessment · Advanced Vision and Imaging
