KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs

Zunhai Su; Kehong Yuan

arXiv:2508.04257·cs.CL·August 7, 2025

KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs

Zunhai Su, Kehong Yuan

PDF

TL;DR

This paper investigates the mechanisms of attention sinks in large language models during inference, analyzes their impact on KV cache quantization, and introduces KVSink, a method that improves preservation of attention sinks, enhancing model efficiency and accuracy.

Contribution

It provides a detailed understanding of attention sinks in LLMs and proposes KVSink, a novel, low-overhead method for better preservation of attention sinks during KV cache quantization.

Findings

01

KVSink outperforms the Preserve-First-N strategy in preserving attention sinks.

02

Applying KVSink to KVQuant improves perplexity and reduces outliers.

03

Understanding of attention sink dynamics enhances quantization strategies.

Abstract

Key-Value (KV) cache quantization has become a widely adopted optimization technique for efficient large language models (LLMs) inference by reducing KV cache memory usage and mitigating memory-bound constraints. Recent studies have emphasized the importance of preserving the original precision of KVs for the first few tokens to ensure the protection of attention sinks. While this approach has proven effective in mitigating performance degradation, its underlying principles remain insufficiently understood. Moreover, it fails to address the recent discovery that attention sinks can emerge beyond the initial token positions. In this work, we elucidate the underlying mechanisms of attention sinks during inference by examining their role in the cross-layer evolution of extreme activation outliers. Additionally, we provide a comprehensive analysis of the interplay between attention sinks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.