KLLM: Fast LLM Inference with K-Means Quantization

Xueying Wu; Baijun Zhou; Zhihui Gao; Yuzhe Fu; Qilin Zheng; Yintao He; and Hai Li

arXiv:2507.23035·cs.LG·September 11, 2025

KLLM: Fast LLM Inference with K-Means Quantization

Xueying Wu, Baijun Zhou, Zhihui Gao, Yuzhe Fu, Qilin Zheng, Yintao He, and Hai Li

PDF

TL;DR

KLLM introduces an efficient LLM inference accelerator that leverages K-Means quantization for weights and activations, reducing memory and computation demands while addressing challenges like outliers and non-uniform data structure.

Contribution

The paper presents KLLM, a novel accelerator with index-based computation and a lightweight outlier detection engine for effective K-Means quantized LLM inference.

Findings

01

Achieves high accuracy with non-uniform K-Means quantization.

02

Reduces inference latency and memory footprint.

03

Effectively handles activation outliers during online inference.

Abstract

Large language model (LLM) inference poses significant challenges due to its intensive memory and computation demands. Weight and activation quantization (WAQ) offers a promising solution by reducing both memory footprint and arithmetic complexity. Traditional WAQ designs rely on uniform integer quantization for hardware efficiency, but often suffer from significant model performance degradation at low precision. In contrast, K-Means quantization, a non-uniform technique, achieves higher accuracy by aligning with the Gaussian-like distributions of weights and activations in LLMs. However, two key challenges prevent the efficient deployment of K-Means-based WAQ designs for LLM inference: (1) The non-uniform structure of K-Means-quantized data precludes direct execution on low-precision compute units, necessitating dequantization and floating-point matrix multiplications (MatMuls) during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.