LoRaQ: Optimized Low Rank Approximation for 4-bit Quantization
Yann Bouquet, Alireza Khodamoradi, Sophie Y\'ang Shen, Kristof Denolf, Mathieu Salzmann

TL;DR
LoRaQ introduces a data-free, low-rank approximation method for 4-bit quantization, enabling fully sub-16 bit models that outperform existing approaches in resource-constrained deployment of diffusion transformers.
Contribution
LoRaQ removes the need for high-precision auxiliary branches and data-dependent calibration, achieving fully quantized models with improved performance over state-of-the-art methods.
Findings
LoRaQ outperforms existing methods on Pixart-Σ and SANA datasets.
It enables a fully sub-16 bit pipeline for diffusion transformers.
Mixed-precision configurations like W8A8, W6A6, and W4A8 yield superior results.
Abstract
Post-training quantization (PTQ) is essential for deploying large diffusion transformers on resource-constrained hardware, but aggressive 4-bit quantization significantly degrades generative performance. Low-rank approximation methods have emerged as a promising solution by appending auxiliary linear branches to restore performance. However, current state-of-the-art approaches assume these branches must retain high precision (W16A16) and rely on heavy, data-dependent calibration for initialization. We challenge both limitations with LoRaQ (Low-Rank Approximated Quantization), a simple, data-free calibration approach that optimizes quantization error compensation. By overcoming the need for high-precision branches, LoRaQ enables the first fully sub-16 bit pipeline, allowing the low-rank branch itself to be quantized. We demonstrate that, at equal memory overhead, LoRaQ outperforms the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
