AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning

Changhai Zhou; Shiyang Zhang; Yuhua Zhou; Qian Qiao; Jun Gao; Cheng Jin; Kaizhou Qin; Weizhong Zhang

arXiv:2602.22268·cs.LG·February 27, 2026

AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning

Changhai Zhou, Shiyang Zhang, Yuhua Zhou, Qian Qiao, Jun Gao, Cheng Jin, Kaizhou Qin, Weizhong Zhang

PDF

Open Access

TL;DR

AutoQRA introduces a joint optimization framework that simultaneously tunes quantization bit-width and LoRA rank per layer, significantly improving efficient fine-tuning of large language models under memory constraints.

Contribution

It proposes a novel two-stage optimization method combining evolutionary search and Bayesian optimization for joint quantization and low-rank adaptation.

Findings

01

Achieves near full-precision fine-tuning performance

02

Reduces memory footprint comparable to uniform 4-bit methods

03

Effectively compensates for quantization noise during training

Abstract

Quantization followed by parameter-efficient fine-tuning has emerged as a promising paradigm for downstream adaptation under tight GPU memory constraints. However, this sequential pipeline fails to leverage the intricate interaction between quantization bit-width and LoRA rank. Specifically, a carefully optimized quantization allocation with low quantization error does not always translate to strong fine-tuning performance, and different bit-width and rank configurations can lead to significantly varying outcomes under the same memory budget. To address this limitation, we propose AutoQRA, a joint optimization framework that simultaneously optimizes the bit-width and LoRA rank configuration for each layer during the mixed quantized fine-tuning process. To tackle the challenges posed by the large discrete search space and the high evaluation cost associated with frequent fine-tuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Compression Techniques · Advanced Neural Network Applications