GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models

Maanas Taneja; Purab Shingvi

arXiv:2601.04719·cs.LG·January 9, 2026

GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models

Maanas Taneja, Purab Shingvi

PDF

Open Access

TL;DR

This paper presents GPU-accelerated INT8 quantization techniques for compressing KV caches in large language models, significantly reducing memory usage and accelerating inference with minimal accuracy loss.

Contribution

It introduces four CUDA kernel variants for INT8 quantization, achieving high speedups and memory reduction in large-scale LLM inference.

Findings

01

4× memory reduction with minimal accuracy loss

02

Up to 1,694× speedup over CPU implementations

03

Reconstruction error below 0.004 and attention error below 0.1

Abstract

The key-value (KV) cache in large language models presents a significant memory bottleneck during inference, growing linearly with sequence length and often exceeding the memory footprint of model weights themselves. We implement and evaluate GPU-accelerated INT8 quantization for KV cache compression, achieving 4 $\times$ memory reduction with minimal accuracy degradation. We develop four CUDA kernel variants -- naive, tiled, coarsened, and vectorized -- and benchmark them across realistic workload sizes up to 1 billion elements. Our vectorized kernel achieves up to 1,694 $\times$ speedup over CPU baselines while maintaining reconstruction error below 0.004 and attention score error below 0.1 even for 8K-dimensional heads. These results demonstrate that INT8 quantization provides a practical approach for reducing memory pressure in LLM inference with negligible computational overhead…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Big Data and Digital Economy