FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching

Hongyaoxing Gul; Lijuan Hu; Shuzi Niu; Fangfang Liu

arXiv:2601.05684·cs.LG·January 12, 2026

FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching

Hongyaoxing Gul, Lijuan Hu, Shuzi Niu, Fangfang Liu

PDF

Open Access

TL;DR

FLRQ introduces a fast, flexible low-rank quantization method for large language models, significantly improving quantization quality and efficiency by adaptively selecting optimal ranks for each layer without costly fine-tuning.

Contribution

The paper proposes FLRQ, a novel low-rank quantization approach that quickly identifies optimal ranks for each layer, reducing computational overhead and enhancing model compression.

Findings

01

Achieves state-of-the-art quantization quality.

02

Demonstrates superior efficiency over existing methods.

03

Robust across diverse models and datasets.

Abstract

Traditional post-training quantization (PTQ) is considered an effective approach to reduce model size and accelerate inference of large-scale language models (LLMs). However, existing low-rank PTQ methods require costly fine-tuning to determine a compromise rank for diverse data and layers in large models, failing to exploit their full potential. Additionally, the current SVD-based low-rank approximation compounds the computational overhead. In this work, we thoroughly analyze the varying effectiveness of low-rank approximation across different layers in representative models. Accordingly, we introduce \underline{F}lexible \underline{L}ow-\underline{R}ank \underline{Q}uantization (FLRQ), a novel solution designed to quickly identify the accuracy-optimal ranks and aggregate them to achieve minimal storage combinations. FLRQ comprises two powerful components, Rank1-Sketch-based Flexible…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis