CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation
Kun-Hui Lee, Eunhwan Park, Donghoon Han, Seung-Hoon Na

TL;DR
CacheFocus is a novel method that improves retrieval-augmented generation by dynamically re-positioning cached keys, enabling efficient long-input processing without additional training, and outperforming existing approaches on large datasets.
Contribution
It introduces CacheFocus, a training-free cache re-positioning technique that enhances length normalization and reduces inference latency for long-context LLMs.
Findings
Outperforms existing methods on Natural Questions and TriviaQA datasets.
Maintains performance with input lengths exceeding 4K tokens.
Effectively manages long-text generation without degradation.
Abstract
Large Language Models (LLMs) excel across a variety of language tasks yet are constrained by limited input lengths and high computational costs. Existing approaches\textemdash such as relative positional encodings (e.g., RoPE, ALiBi) and sliding window mechanisms\textemdash partially alleviate these issues but often require additional training or suffer from performance degradation with longer inputs. In this paper, we introduce \textbf{\textit{CacheFocus}}, a method that enhances length normalization and reduces inference latency without any further training. Our approach leverages query-independent, offline caching to efficiently reuse a Context KV Cache Store. We address the amplification of abnormal token distributions problem by re-positioning cached keys and introducing Layer-Adaptive Cache Pruning to discard low-relevance caches during pre-filling. Additionally, our Adaptive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Algorithms and Data Compression · Advanced Data Storage Technologies
MethodsPruning
