DLLMQuant: Quantizing Diffusion-based Large Language Models
Chen Xu, Dawei Yang

TL;DR
This paper introduces DLLMQuant, a specialized post-training quantization framework for diffusion-based large language models, addressing their unique challenges to improve efficiency without sacrificing accuracy.
Contribution
DLLMQuant presents three novel techniques—TMAS, IA-AQ, and CGQ—that tailor quantization to DLLMs' dynamic masking, iterative generation, and bidirectional attention mechanisms.
Findings
Achieves significant accuracy retention after quantization.
Enhances efficiency of DLLMs with minimal performance loss.
Addresses core quantization challenges specific to DLLMs.
Abstract
Diffusion-based large language models (DLLMs) have shown promise for non-autoregressive text generation, but their deployment is constrained by large model sizes and heavy computational costs. Post-training quantization (PTQ), a widely used method for compressing and accelerating Large Language Models (LLMs), suffers from severe accuracy degradation and reduced generalization performance when directly applied to DLLMs (e.g., AWQ suffers a 16% accuracy drop on LLADA under W4A4). This paper explores how DLLMs' key mechanisms - dynamic masking, iterative generation, bidirectional attention - clash with quantization. We identify three core issues: 1) Iterative generation and dynamic masking ratios lead to distinct token distributions across decoding steps, which are not adequately captured by existing PTQ calibration methods; 2) Quantization errors are accumulated and amplified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
