Playing with Transformer at 30+ FPS via Next-Frame Diffusion
Xinle Cheng, Tianyu He, Jiayi Xu, Junliang Guo, Di He, Jiang Bian

TL;DR
This paper introduces Next-Frame Diffusion, a transformer-based autoregressive video model that achieves real-time generation at over 30 FPS by combining innovations like consistency distillation and speculative sampling, improving efficiency and quality.
Contribution
The work presents a novel autoregressive diffusion transformer for video that enables real-time generation through efficient inference techniques and parallel sampling strategies.
Findings
Achieves over 30 FPS video generation on A100 GPU.
Outperforms autoregressive baselines in visual quality and sampling efficiency.
Introduces video-specific consistency distillation and speculative sampling methods.
Abstract
Autoregressive video models offer distinct advantages over bidirectional diffusion models in creating interactive video content and supporting streaming applications with arbitrary duration. In this work, we present Next-Frame Diffusion (NFD), an autoregressive diffusion transformer that incorporates block-wise causal attention, enabling iterative sampling and efficient inference via parallel token generation within each frame. Nonetheless, achieving real-time video generation remains a significant challenge for such models, primarily due to the high computational cost associated with diffusion sampling and the hardware inefficiencies inherent to autoregressive generation. To address this, we introduce two innovations: (1) We extend consistency distillation to the video domain and adapt it specifically for video models, enabling efficient inference with few sampling steps; (2) To fully…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Well-writing: The paper is well-written, clear, and concise, effectively communicating its ideas without unnecessary complexity. - Compelling scalability analysis: The model's performance improves with increased size, indicating potential for further gains with larger models and datasets.
- **Concerns on incremental novelty.** The proposed method appears to integrate existing techniques such as consistency distillation and speculative decoding, which may limit its conceptual novelty and suggest an incremental contribution. - **Limited data validation and strong assumptions in speculative decoding.** The assumption that actions remain identical during speculative decoding may be overly restrictive; I expect evaluation on more complex datasets to better validate the robustness and
1. Clean combination of block-causal attention with diffusion-based frame-wise prediction, plus an explicit extension of consistency distillation to video. The speculative-sampling idea for action-conditioned video is straightforward but practical. 2. Results table transparently shows the speed–quality trade-off: e.g., NFD (310M) FVD 212 at 6.15 FPS vs NFD+ (310M) FVD 227 at 31.14 FPS. 3. The paper is easy to follow, with an architectural figure and explicit training/sampling equations, plus a p
1. Novelty relative to prior “frame-wise” diffusion variants appears incremental. The work builds on a now-active line of next-frame diffusion with parallel per-frame token generation and noise-perturbed contexts (e.g., Diffusion Forcing, CausVid/self-forcing-style distillations). While the paper positions its sCM extension and speculative sampling as first, readers may see these as natural adaptations rather than conceptual leaps. A head-to-head with Diffusion Forcing and CausVid under identica
- The paper is **well-structured** and **clearly written**, making the main ideas easy to follow. - The **methodology and motivation** are clearly articulated. - The **experimental comparisons** are straightforward and provide solid empirical evidence for the proposed improvements.
- The **novelty** of the contributions is limited. Both *consistency distillation* and *speculative sampling* are existing ideas that have been explored in prior works. - The paper primarily adapts these known techniques to the *next-frame diffusion* setting, resulting in a contribution that feels **incremental rather than groundbreaking**. - The paper would benefit from a deeper analysis or theoretical insight to strengthen its originality.
1. **Targeted Engineering for Acceleration:** The paper presents specific engineering designs aimed at accelerating video diffusion models, which is a significant contribution to the field. 2. **State-of-the-Art Performance:** The proposed model achieves state-of-the-art visual quality and sampling efficiency when compared to baseline methods, as demonstrated by the quantitative results.
1. **Lack of Qualitative Comparison:** Despite achieving strong numerical results, the paper lacks a qualitative comparison with baseline methods, such as a user study. Visual examples are crucial for a comprehensive evaluation of generative models. 2. **Hybrid Contribution:** The paper's contributions appear to be a mix of engineering and algorithmic improvements. However, the engineering enhancements are not explored to their full potential, and the algorithmic advancements are largely ad
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntegrated Circuits and Semiconductor Failure Analysis · Surface and Thin Film Phenomena
