DVD-Quant: Data-free Video Diffusion Transformers Quantization
Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Haotong Qin, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang

TL;DR
DVD-Quant introduces a data-free quantization method for Video Diffusion Transformers, significantly speeding up models while preserving quality, by using innovative calibration and adaptive bit-width techniques.
Contribution
It presents the first data-free post-training quantization framework for Video DiTs, overcoming calibration and performance issues of prior methods.
Findings
Achieves approximately 2× speedup over full-precision models.
Enables W4A4 PTQ for Video DiTs without quality loss.
Maintains high visual fidelity after quantization.
Abstract
Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment. While post-training quantization (PTQ) presents a promising approach to accelerate Video DiT models, existing methods suffer from two critical limitations: (1) dependence on computation-heavy and inflexible calibration procedures, and (2) considerable performance deterioration after quantization. To address these challenges, we propose DVD-Quant, a novel Data-free quantization framework for Video DiTs. Our approach integrates three key innovations: (1) Bounded-init Grid Refinement (BGR) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction, as well as (3) -Guided Bit Switching (-GBS) for adaptive bit-width allocation. Extensive experiments across…
Peer Reviews
Decision·ICLR 2026 Poster
- A systematic analysis reveals three key insights, motivating the following solutions to quantization. These finds are valuable for the future research. - Strong low-bit performance without retraining is achieved - Broad applicability to video DiTs: Designed around Video DiT characteristics (e.g., temporal variations), making it more suitable than generic PTQ baselines for large-scale video generation models. - Modular and complementary components: Each module targets a distinct bottleneck (wei
Although the better accuracy–deployability trade-off is achieved, I remain to wonder - Generalization beyond Video DiTs (Hunyuan): The framework leverages timestep dynamics typical of Video DiTs. It is unclear how well it transfers to other generative models, like Wanx or even image generator. - Effect of various details: Choices like rotation block size, scaling granularity (per-channel vs per-tensor), and quantization granularity (weight group size) can materially affect outcomes; the method m
1. The motivation behind the proposed BGR method is clear and intuitively appealing. 2. The paper is well written, with clear explanations and easy-to-follow reasoning.
1. While rotation techniques have been extensively applied in LLM quantization, the paper only discusses QuaRot and lacks a broader comparison or discussion with several relevant works such as **DuQuant**, **RoSTE**, and **ResQ**. 2. The proposed auto-scaling rotation mechanism appears similar to the approach adopted in DuQuant, which combines SmoothQuant with rotation—this overlap should be clarified. 3. The paper lacks comparisons with several state-of-the-art baselines, such as **SVDQuant**.
1. Successfully enabling W4A4 PTQ for complex Video DiT models is a notable achievement, as this extreme quantization level typically causes existing methods to fail completely. 2. Comprehensive and Synergistic Approach: The three proposed components (BGR, ARQ, δ-GBS) address distinct and well-motivated challenges (weight distribution, activation outliers, temporal redundancy) and are shown to work effectively together. 3. Strong Empirical Validation: The paper includes extensive experiments o
1. Hyperparameter Sensitivity: The performance of the adaptive δ-GBS mechanism depends on a threshold δ. While mentioned, the paper does not deeply explore the sensitivity of the results to this value or provide a robust method for selecting it across different models or tasks. 2. Limited Model Scope: While tested on HunyuanVideo and briefly on Wan2.1, it's unclear how generalizable the method is to the wider family of DiT-based models (e.g., Latte, Sora's architecture) or other diffusion tasks
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Signal Denoising Methods · Advanced Image Processing Techniques · Computer Graphics and Visualization Techniques
