TL;DR
LoRAQuant introduces a mixed-precision quantization technique for LoRA adapters in large language models, enabling ultra-low bitwidth compression while maintaining or improving task performance.
Contribution
It presents a novel post-training quantization method that reparameterizes LoRA adapters with SVD to focus on important components for efficient low-bit quantization.
Findings
Achieves lower bitwidths than existing methods
Maintains or improves performance across multiple tasks
Effective on models like LLaMA and Mistral
Abstract
Low-Rank Adaptation (LoRA) has become a popular technique for parameter-efficient fine-tuning of large language models (LLMs). In many real-world scenarios, multiple adapters are loaded simultaneously to enable LLM customization for personalized user experiences or to support a diverse range of tasks. Although each adapter is lightweight in isolation, their aggregate cost becomes substantial at scale. To address this, we propose LoRAQuant, a mixed-precision post-training quantization method tailored to LoRA. Specifically, LoRAQuant reparameterizes each adapter by singular value decomposition (SVD) to concentrate the most important information into specific rows and columns. This makes it possible to quantize the important components to higher precision, while quantizing the rest to ultra-low bitwidth. We conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Clear motivation and simple SVD-based mixed-precision design. 2. Strong low-bit results with minimal accuracy loss. 3. Comprehensive experiments and ablations across multiple LLMs and tasks.
1. LoRA adapters are already small, so quantizing them offers little real-world benefit. For the multi-Lora case, usually the Lora adapters can be saved offline and loaded during inference. The proposed work can be only useful when multiple Lora adapters have to be uploaded simultaneously at the same time. It would be better for authors to discover more on where the proposed approach can have an impact on. 2. In the scenario that multiple Lora adapters have to be loaded simultaneously, there can
1). The paper is clearly written and easy to follow. 2). The ablation studies are clear and well presented.
1). The technical contribution has limited novelty. Essentially, the paper applies the same algorithm as SVDQuant to LoRA weights. Other quantization techniques mentioned, such as 1-bit or 2-bit quantizers and optimizations for reducing quantization error, are standard. 2). The evaluation misses an important baseline—SVDQuant. 3). The performance improvements are limited, especially compared to other quantization methods with similar average bits. 4). The evaluation is conducted only on relat
1. Instead of applying generic quantization, LoraQuant propose to reparameterize LoRA through SVD, leveraging the inherent low-rank structure to guide mixed-precision assignment. 2. The proposed LoraQuant method reparameterizes $BA$ into $US^{1/2}$, $S^{1/2}V^T$, where the singular values naturally encode the importance of each latent direction. 2. LoraQuant achieves competitive or superior accuracy under <2 bits on average, significantly outperforming standard baselines such as RTN, GPTQ, PB-LL
1. While the paper focuses on compressing LoRA adapters, it is unclear how significant the overall memory savings are when the base model remains dominant. For example, when the LoRA rank is 64, the adapter typically constitutes only about 2–3% of the base model parameters. In such cases, quantizing only the LoRA part may have marginal benefits compared to compressing or quantizing the base model itself to an extreme degree. 2. The proposed straight-through optimization requires about 100 steps
- Clear motivation: Addresses the practical issue of memory overhead when serving multiple LoRAs simultaneously. - Systematic evaluation: The experiments cover several models and tasks, with quantitative comparisons against GPTQ, PB-LLM, and BiLLM. - Empirical rigor: The ablation studies are comprehensive, isolating the impact of SVD splitting, optimization, and dynamic precision allocation.
- Limited novelty: The main technical components, SVD, mixed-precision quantization, and straight-through optimization, are all well-established techniques in the quantization and model compression literature. - Similar SVD-based quantization (e.g., SVDQuant, PiSSA) already exist. - Mixed-precision binarization has been explored in PB-LLM and BiLLM. - Using SVD to rank component importance is standard practice in low-rank adaptation. - Narrow scope of contribution: The method applies k
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
