TL;DR
This paper introduces attention-state memory, a training-free method that externalizes long prefixes into a lightweight memory, improving efficiency and accuracy in long context generation for large language models.
Contribution
It proposes a novel externalized memory approach for long prefixes that enhances efficiency and accuracy without additional training.
Findings
Improves accuracy over in-context learning with 1K-8K memory budgets.
Reduces attention latency by 1.36x at 8K memory.
Surpasses full-attention RAG performance on NBA benchmark with only 20% memory.
Abstract
Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
