WKVQuant: Quantizing Weight and Key/Value Cache for Large Language   Models Gains More

Yuxuan Yue; Zhihang Yuan; Haojie Duanmu; Sifan Zhou; Jianlong Wu,; Liqiang Nie

arXiv:2402.12065·cs.LG·February 21, 2024·2 cites

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu,, Liqiang Nie

PDF

Open Access

TL;DR

WKVQuant is a post-training quantization framework that effectively reduces memory usage in large language models by quantizing weights and key/value caches, maintaining high accuracy and efficiency.

Contribution

It introduces a novel 2D quantization strategy and past-only quantization for attention, improving upon existing methods for LLM quantization.

Findings

01

Memory savings comparable to weight-activation quantization

02

Performance approaches weight-only quantization

03

Effective quantization of key/value caches

Abstract

Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers. We critically analyze the existing quantization approaches, identifying their limitations in balancing the accuracy and efficiency of the quantized LLMs. To advance beyond these limitations, we propose WKVQuant, a PTQ framework especially designed for quantizing weights and the key/value (KV) cache of LLMs. Specifically, we incorporates past-only quantization to improve the computation of attention. Additionally, we introduce two-dimensional quantization strategy to handle the distribution of KV cache, along with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques