DVD-Quant: Data-free Video Diffusion Transformers Quantization

Zhiteng Li; Hanxuan Li; Junyi Wu; Kai Liu; Haotong Qin; Linghe Kong; Guihai Chen; Yulun Zhang; Xiaokang Yang

arXiv:2505.18663·cs.CV·March 9, 2026

DVD-Quant: Data-free Video Diffusion Transformers Quantization

Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Haotong Qin, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

DVD-Quant introduces a data-free quantization method for Video Diffusion Transformers, significantly speeding up models while preserving quality, by using innovative calibration and adaptive bit-width techniques.

Contribution

It presents the first data-free post-training quantization framework for Video DiTs, overcoming calibration and performance issues of prior methods.

Findings

01

Achieves approximately 2× speedup over full-precision models.

02

Enables W4A4 PTQ for Video DiTs without quality loss.

03

Maintains high visual fidelity after quantization.

Abstract

Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment. While post-training quantization (PTQ) presents a promising approach to accelerate Video DiT models, existing methods suffer from two critical limitations: (1) dependence on computation-heavy and inflexible calibration procedures, and (2) considerable performance deterioration after quantization. To address these challenges, we propose DVD-Quant, a novel Data-free quantization framework for Video DiTs. Our approach integrates three key innovations: (1) Bounded-init Grid Refinement (BGR) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction, as well as (3) $δ$ -Guided Bit Switching ( $δ$ -GBS) for adaptive bit-width allocation. Extensive experiments across…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

- A systematic analysis reveals three key insights, motivating the following solutions to quantization. These finds are valuable for the future research. - Strong low-bit performance without retraining is achieved - Broad applicability to video DiTs: Designed around Video DiT characteristics (e.g., temporal variations), making it more suitable than generic PTQ baselines for large-scale video generation models. - Modular and complementary components: Each module targets a distinct bottleneck (wei

Weaknesses

Although the better accuracy–deployability trade-off is achieved, I remain to wonder - Generalization beyond Video DiTs (Hunyuan): The framework leverages timestep dynamics typical of Video DiTs. It is unclear how well it transfers to other generative models, like Wanx or even image generator. - Effect of various details: Choices like rotation block size, scaling granularity (per-channel vs per-tensor), and quantization granularity (weight group size) can materially affect outcomes; the method m

Reviewer 02Rating 4Confidence 5

Strengths

1. The motivation behind the proposed BGR method is clear and intuitively appealing. 2. The paper is well written, with clear explanations and easy-to-follow reasoning.

Weaknesses

1. While rotation techniques have been extensively applied in LLM quantization, the paper only discusses QuaRot and lacks a broader comparison or discussion with several relevant works such as **DuQuant**, **RoSTE**, and **ResQ**. 2. The proposed auto-scaling rotation mechanism appears similar to the approach adopted in DuQuant, which combines SmoothQuant with rotation—this overlap should be clarified. 3. The paper lacks comparisons with several state-of-the-art baselines, such as **SVDQuant**.

Reviewer 03Rating 6Confidence 2

Strengths

1. Successfully enabling W4A4 PTQ for complex Video DiT models is a notable achievement, as this extreme quantization level typically causes existing methods to fail completely. 2. Comprehensive and Synergistic Approach: The three proposed components (BGR, ARQ, δ-GBS) address distinct and well-motivated challenges (weight distribution, activation outliers, temporal redundancy) and are shown to work effectively together. 3. Strong Empirical Validation: The paper includes extensive experiments o

Weaknesses

1. Hyperparameter Sensitivity: The performance of the adaptive δ-GBS mechanism depends on a threshold δ. While mentioned, the paper does not deeply explore the sensitivity of the results to this value or provide a robust method for selecting it across different models or tasks. 2. Limited Model Scope: While tested on HunyuanVideo and briefly on Wan2.1, it's unclear how generalizable the method is to the wider family of DiT-based models (e.g., Latte, Sora's architecture) or other diffusion tasks

Code & Models

Repositories

lhxcs/dvd-quant
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Signal Denoising Methods · Advanced Image Processing Techniques · Computer Graphics and Visualization Techniques