KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding

Luohe Shi; Zuchao Li; Lefei Zhang; Guoming Liu; Baoyuan Qi; Hai Zhao

arXiv:2507.11273·cs.CL·July 16, 2025

KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding

Luohe Shi, Zuchao Li, Lefei Zhang, Guoming Liu, Baoyuan Qi, Hai Zhao

PDF

Open Access 1 Repo

TL;DR

This paper introduces KV-Latent, a method to reduce KV cache size in Transformer-based LLMs by down-sampling vectors into a latent space, enhancing inference efficiency with minimal additional training.

Contribution

The paper proposes a novel KV-Latent paradigm that significantly reduces KV cache size and improves inference speed with less than 1% extra training, and enhances Rotary Positional Embedding stability.

Findings

01

Significant reduction in KV cache footprint.

02

Improved inference speed with minimal training overhead.

03

Enhanced stability of Rotary Positional Embedding.

Abstract

Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during inference has emerged as a primary efficiency bottleneck, both in aspects of memory consumption and data transfer bandwidth limitations. To address these challenges, we propose a paradigm called KV-Latent. By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed, only with a small amount of extra training, less than 1\% of pre-training takes. Besides, we enhanced the stability of Rotary Positional Embedding applied on lower-dimensional vectors by modifying its frequency sampling mechanism, avoiding noise introduced by higher frequencies while retaining position…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shiluohe/kv-latent
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression · Caching and Content Delivery

MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer