AQUATIC-Diff: Additive Quantization for Truly Tiny Compressed Diffusion Models
Adil Hasan, Thomas Peyrin

TL;DR
This paper introduces AQUATIC-Diff, a novel additive vector quantization method that significantly reduces the size and computational requirements of diffusion models, enabling more efficient and accessible media generation.
Contribution
It applies codebook-based additive vector quantization to diffusion models, achieving state-of-the-art low-bit compression and FLOPs savings with broad hardware support.
Findings
Lowered sFID by 1.92 points at W4A8 compared to full-precision.
Achieved best results for FID, sFID, and ISC at W2A8.
Demonstrated FLOPs savings on arbitrary hardware.
Abstract
Significant investments have been made towards the commodification of diffusion models for generation of diverse media. Their mass-market adoption is however still hobbled by the intense hardware resource requirements of diffusion model inference. Model quantization strategies tailored specifically towards diffusion models have been useful in easing this burden, yet have generally explored the Uniform Scalar Quantization (USQ) family of quantization methods. In contrast, Vector Quantization (VQ) methods, which operate on groups of multiple related weights as the basic unit of compression, have seen substantial success in Large Language Model (LLM) quantization. In this work, we apply codebook-based additive vector quantization to the problem of diffusion model compression. Our resulting approach achieves a new Pareto frontier for the extremely low-bit weight quantization on the standard…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The paper applies codebook-based additive vector quantization to diffusion models for the first time, adapting techniques previously used for LLM quantization. The method achieves unprecedented compression levels, including the first successful W1.5A8 quantization. The approach allows for a dynamic trade-off between quantization-time GPU hours and inference-time savings, combining benefits of both Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT).
The paper primarily focuses on the LDM-4 ImageNet model. What about others? While the paper mentions that the most time-consuming stage is highly parallelizable, it doesn't provide a detailed analysis of the computational requirements for the quantization process compared to existing methods - AWQ, GPTQ, QuaRot [1], TesseraQ [2]. Main concern for me is what is the main contribution here? I feel that such quantization tricks already exist for LLM domain. So what have the authors found or are sp
* The FLOPs reduction achieved by AQUATIC-Diff is noteworthy. Unlike traditional weight-only quantization, which primarily reduces bitwise operations (BOPs), reducing FLOPs can directly decrease latency on off-the-shelf hardware. * The authors have included their code with the submission to ensure reproducibility.
* The paper is difficult to follow, with overly long sentences (e.g., in the abstract) and inconsistent citation formatting. Figures are also hard to interpret due to short captions, and there is a missing reference (Line 485). * The paper's technical contribution and novelty appear limited. It primarily applies vector quantization to diffusion models with quantization-aware fine-tuning. Techniques like Convolutional Kernel-Aware Quantization and Layer Heterogeneity-Aware Quantization (LAQ) seem
1. This work utilizes the emerging techniques from recent quantization works. 2. I like how authors report multiple aspects of evaluation, such as algorithm runtime, FIDs, IS, and precision. They are being honest with all metrics.
* My major concern about this work is it is too incremental from existing works. I am not against combination works (meaning applying method from A to B). However, I expect to have some interesting observations or insights when applying other works into Diffusion models. For example: 1. It is intuitive that vector quantization may obtain better performance than uniform quantization. But does the author verify the weight distribution in diffusion models? From the experiments, I did not observe
1. The problem being addressed is important. 2. The paper contains code. 3. The method shows good empirical results in Table 2.
1. The method aims to compress the model, but it is not clear if this translates to any benefit during inference, after accounting for the method overhead (e.g., it has more scales). The only part that seems relevant to this issue is Table 5. Table 5 claims to have latency and FLOPs results, but unfortunately I don't see any latency results. Also, why is the baseline 32/32? Why not 16/16 or 8/8, which should have near 32/32 image quality? From the reported FLOPs I can estimate the proposed metho
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Advanced Neuroimaging Techniques and Applications · Image and Video Quality Assessment
