TL;DR
PermuQuant introduces a channel reordering technique based on statistical similarity to improve post-training quantization of diffusion models, significantly reducing error and memory footprint.
Contribution
It proposes a novel channel sorting method for per-group quantization that adapts based on calibration data, enhancing low-bit model performance.
Findings
Reduces quantization error across multiple diffusion models.
Achieves up to 1.8× speedup and 3.5× memory reduction.
Outperforms existing PTQ methods in quality and efficiency.
Abstract
Large-scale visual generative models have achieved remarkable performance. However, their high computational and memory costs make deployment challenging in resource-constrained scenarios, such as interactive applications and personal single-GPU usage. Post-training quantization (PTQ) offers a practical solution by compressing pretrained models without expensive retraining. However, existing PTQ methods still suffer from severe quality degradation under extremely low-bit settings. In this paper, we identify channel ordering as an important but underexplored factor in per-group quantization. In this setting, each contiguous group shares one quantization scale. When channels with very different statistics are placed in the same group, the scale can be dominated by outliers and cause large quantization errors. Based on this observation, we propose PermuQuant, a simple and effective PTQ…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
