LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Dachuan Shi; Yonggan Fu; Xiangchi Yuan; Zhongzhi Yu; Haoran You; Sixu Li; Xin Dong; Jan Kautz; Pavlo Molchanov; Yingyan (Celine) Lin

arXiv:2507.14204·cs.LG·July 22, 2025

LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan (Celine) Lin

PDF

TL;DR

LaCache introduces a ladder-shaped KV cache and iterative compression to improve long-range context handling in LLMs, enabling efficient, memory-constrained continuous generation without sacrificing performance.

Contribution

LaCache presents a novel, training-free KV cache structure with dynamic compression, enhancing long-range modeling and continuous generation in LLMs under fixed memory budgets.

Findings

01

Significantly improves long-range dependency capturing in LLMs.

02

Enables continuous generation without out-of-memory errors.

03

Validated across multiple models and benchmarks.

Abstract

Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck. In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.