FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching
Hongyaoxing Gul, Lijuan Hu, Shuzi Niu, Fangfang Liu

TL;DR
FLRQ introduces a fast, flexible low-rank quantization method for large language models, significantly improving quantization quality and efficiency by adaptively selecting optimal ranks for each layer without costly fine-tuning.
Contribution
The paper proposes FLRQ, a novel low-rank quantization approach that quickly identifies optimal ranks for each layer, reducing computational overhead and enhancing model compression.
Findings
Achieves state-of-the-art quantization quality.
Demonstrates superior efficiency over existing methods.
Robust across diverse models and datasets.
Abstract
Traditional post-training quantization (PTQ) is considered an effective approach to reduce model size and accelerate inference of large-scale language models (LLMs). However, existing low-rank PTQ methods require costly fine-tuning to determine a compromise rank for diverse data and layers in large models, failing to exploit their full potential. Additionally, the current SVD-based low-rank approximation compounds the computational overhead. In this work, we thoroughly analyze the varying effectiveness of low-rank approximation across different layers in representative models. Accordingly, we introduce \underline{F}lexible \underline{L}ow-\underline{R}ank \underline{Q}uantization (FLRQ), a novel solution designed to quickly identify the accuracy-optimal ranks and aggregate them to achieve minimal storage combinations. FLRQ comprises two powerful components, Rank1-Sketch-based Flexible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis
