QVGen: Pushing the Limit of Quantized Video Generative Models
Yushi Huang, Ruihao Gong, Jing Liu, Yifu Ding, Chengtao Lv, Haotong Qin, Jun Zhang

TL;DR
QVGen introduces a quantization-aware training framework for video diffusion models, enabling high-quality, low-bit inference with reduced computational costs, outperforming existing methods in efficiency and quality.
Contribution
The paper presents a novel QAT framework with auxiliary modules and a rank-decay strategy, achieving near full-precision quality at 4-bit quantization for video DMs.
Findings
Achieves full-precision quality at 4-bit quantization.
Outperforms existing quantization methods on multiple SOTA video DMs.
Significant improvements in Dynamic Degree and Scene Consistency metrics.
Abstract
Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., 4-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules () to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of , we…
Peer Reviews
Decision·ICLR 2026 Poster
The work addresses the challenging and critical task of efficient, high-fidelity video generation under ultra-low-bit quantization, an area with clear importance for practical deployment. Provides a regret-based convergence analysis (see Theorem 3.1, Page 4) linking gradient norm to QAT performance, justifying the introduction of $\Phi$. The auxiliary module ($\Phi$) is elegantly conceived and is integrated with a flexible, theoretically justified rank-decay scheme, allowing benefits during tr
While the use of singular value decomposition is effective (Fig. 4, Section 3.2), the alternative strategies (Sparse, Residual Quantization) examined in Table 6 are somewhat strawman/naive and do not fully explore more sophisticated structured pruning or adaptive fading that could yield competitive trade-offs. There is little discussion of possible pathological cases where the SVD approach might fail, particularly if singular spectrum decays slowly. The key result (Theorem 3.1, Page 4) relies o
1. The work could be highly impactful for the community of quantized video generation models due to its state-of-the-art performance. 2. The analysis of the importance of reducing the gradient norm is valid and motivating for the proposed method. 3. Although the method first introduces full-precision parameters, the authors devise effective solutions to reduce the rank to even 0, which means eliminating the need for additional full-precision storage. From the results, such a two-stage pipeline i
I don't find so many weaknesses, but would like to list some minor points below: 1. It seems that the method is not tailored for video diffusion models and has potential for other models, like image generation and image backbone. The authors are encouraged to conduct experiments on these widely adopted benchmarks. 2. It is encouraged to include another baseline of fine-tuning the model using the same data under full precision, which is useful to reflect the effect introduced by additional data a
1. The first full QAT method for video generation models I'm aware of. 2. Experiments are conducted on four SOTA open-source video DMs (CogVideoX and Wan), with parameter scales from 1.3B to 14B, providing broad coverage. 3. It validates practical efficiency gains and demonstrates orthogonality with other acceleration techniques like SVG. 4. The provided experimental materials are comprehensive, and the ablation studies are extensive.
1. Since some other QAT methods are trained using only LoRA, a comparison of training time and memory (GPU VRAM) requirements against these methods should be provided for a comprehensive assessment of algorithm efficiency. 2. Quantization-related initialization settings should be specified, such as the choice of quantizer (e.g., granularity, symmetric/asymmetric) and which layers, if any, are not quantized. 3. In Fig.3, why the inital training loss of the proposed method is bettern than Q-DM?
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
MethodsDiffusion
