FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only   Quantization for LLMs

Young Jin Kim; Rawn Henry; Raffy Fahim; Hany Hassan Awadalla

arXiv:2308.09723·cs.LG·August 22, 2023·2 cites

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla

PDF

Open Access

TL;DR

FineQuant introduces a weight-only quantization technique for large language models that reduces memory usage and speeds up inference with minimal accuracy loss, without requiring additional fine-tuning.

Contribution

The paper presents a novel heuristic approach for fine-grained weight-only quantization applicable to both dense and MoE models, enhancing efficiency without extra training.

Findings

01

Achieves up to 3.65x throughput improvement on large models

02

Maintains minimal accuracy degradation with quantization

03

Supports efficient GPU GEMMs for quantized models

Abstract

Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment due to their substantial memory requirements. Furthermore, the latest generative models suffer from high inference costs caused by the memory bandwidth bottleneck in the auto-regressive decoding process. To address these issues, we propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs. To ensure minimal quality degradation, we introduce a simple and effective heuristic approach that utilizes only the model weights of a pre-trained model. This approach is applicable to both Mixture-of-Experts (MoE) and dense models without requiring additional fine-tuning. To demonstrate the effectiveness of our proposed method, we first analyze the challenges and issues associated with LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques