GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models
Maanas Taneja, Purab Shingvi

TL;DR
This paper presents GPU-accelerated INT8 quantization techniques for compressing KV caches in large language models, significantly reducing memory usage and accelerating inference with minimal accuracy loss.
Contribution
It introduces four CUDA kernel variants for INT8 quantization, achieving high speedups and memory reduction in large-scale LLM inference.
Findings
4× memory reduction with minimal accuracy loss
Up to 1,694× speedup over CPU implementations
Reconstruction error below 0.004 and attention error below 0.1
Abstract
The key-value (KV) cache in large language models presents a significant memory bottleneck during inference, growing linearly with sequence length and often exceeding the memory footprint of model weights themselves. We implement and evaluate GPU-accelerated INT8 quantization for KV cache compression, achieving 4 memory reduction with minimal accuracy degradation. We develop four CUDA kernel variants -- naive, tiled, coarsened, and vectorized -- and benchmark them across realistic workload sizes up to 1 billion elements. Our vectorized kernel achieves up to 1,694 speedup over CPU baselines while maintaining reconstruction error below 0.004 and attention score error below 0.1 even for 8K-dimensional heads. These results demonstrate that INT8 quantization provides a practical approach for reducing memory pressure in LLM inference with negligible computational overhead…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Big Data and Digital Economy
