Latent-Condensed Transformer for Efficient Long Context Modeling
Zeng You, Yaofo Chen, Qiuwu Chen, Ying Sun, Shuhai Zhang, Yingjian Li, Yaowei Wang, Mingkui Tan

TL;DR
The paper introduces Latent-Condensed Attention (LCA), a novel method that efficiently condenses long context information in large language models by operating within the latent space, reducing computation and cache without extra parameters.
Contribution
LCA enables joint optimization of context condensation and sparse attention, extending to various architectures, with theoretical guarantees and practical speedups in long context modeling.
Findings
LCA achieves up to 2.5× speedup in prefilling.
LCA reduces KV cache by 90% at 128K context length.
LCA maintains competitive performance with existing methods.
Abstract
Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA's compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA's latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
