KLLM: Fast LLM Inference with K-Means Quantization
Xueying Wu, Baijun Zhou, Zhihui Gao, Yuzhe Fu, Qilin Zheng, Yintao He, and Hai Li

TL;DR
KLLM introduces an efficient LLM inference accelerator that leverages K-Means quantization for weights and activations, reducing memory and computation demands while addressing challenges like outliers and non-uniform data structure.
Contribution
The paper presents KLLM, a novel accelerator with index-based computation and a lightweight outlier detection engine for effective K-Means quantized LLM inference.
Findings
Achieves high accuracy with non-uniform K-Means quantization.
Reduces inference latency and memory footprint.
Effectively handles activation outliers during online inference.
Abstract
Large language model (LLM) inference poses significant challenges due to its intensive memory and computation demands. Weight and activation quantization (WAQ) offers a promising solution by reducing both memory footprint and arithmetic complexity. Traditional WAQ designs rely on uniform integer quantization for hardware efficiency, but often suffer from significant model performance degradation at low precision. In contrast, K-Means quantization, a non-uniform technique, achieves higher accuracy by aligning with the Gaussian-like distributions of weights and activations in LLMs. However, two key challenges prevent the efficient deployment of K-Means-based WAQ designs for LLM inference: (1) The non-uniform structure of K-Means-quantized data precludes direct execution on low-precision compute units, necessitating dequantization and floating-point matrix multiplications (MatMuls) during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
