KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding
Luohe Shi, Zuchao Li, Lefei Zhang, Guoming Liu, Baoyuan Qi, Hai Zhao

TL;DR
This paper introduces KV-Latent, a method to reduce KV cache size in Transformer-based LLMs by down-sampling vectors into a latent space, enhancing inference efficiency with minimal additional training.
Contribution
The paper proposes a novel KV-Latent paradigm that significantly reduces KV cache size and improves inference speed with less than 1% extra training, and enhances Rotary Positional Embedding stability.
Findings
Significant reduction in KV cache footprint.
Improved inference speed with minimal training overhead.
Enhanced stability of Rotary Positional Embedding.
Abstract
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during inference has emerged as a primary efficiency bottleneck, both in aspects of memory consumption and data transfer bandwidth limitations. To address these challenges, we propose a paradigm called KV-Latent. By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed, only with a small amount of extra training, less than 1\% of pre-training takes. Besides, we enhanced the stability of Rotary Positional Embedding applied on lower-dimensional vectors by modifying its frequency sampling mechanism, avoiding noise introduced by higher frequencies while retaining position…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression · Caching and Content Delivery
MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer
