FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs
Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla

TL;DR
FineQuant introduces a weight-only quantization technique for large language models that reduces memory usage and speeds up inference with minimal accuracy loss, without requiring additional fine-tuning.
Contribution
The paper presents a novel heuristic approach for fine-grained weight-only quantization applicable to both dense and MoE models, enhancing efficiency without extra training.
Findings
Achieves up to 3.65x throughput improvement on large models
Maintains minimal accuracy degradation with quantization
Supports efficient GPU GEMMs for quantized models
Abstract
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment due to their substantial memory requirements. Furthermore, the latest generative models suffer from high inference costs caused by the memory bandwidth bottleneck in the auto-regressive decoding process. To address these issues, we propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs. To ensure minimal quality degradation, we introduce a simple and effective heuristic approach that utilizes only the model weights of a pre-trained model. This approach is applicable to both Mixture-of-Experts (MoE) and dense models without requiring additional fine-tuning. To demonstrate the effectiveness of our proposed method, we first analyze the challenges and issues associated with LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
