Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation
Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, Lu Jiang

TL;DR
This paper introduces a novel autoregressive adversarial post-training method that transforms large-scale video diffusion models into efficient, real-time, interactive video generators capable of streaming high-resolution videos at 24fps.
Contribution
The paper presents a new training paradigm combining autoregressive and adversarial techniques to enable real-time, interactive video generation from pre-trained diffusion models.
Findings
Achieves 24fps streaming at 736x416 resolution on a single H100.
Supports up to 1440 frames (about a minute) at 1280x720 resolution on 8 H100 GPUs.
Reduces error accumulation in long video generation through student-forcing training.
Abstract
Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Human Pose and Action Recognition
