Context Memorization for Efficient Long Context Generation

Yasuyuki Okoshi,Hao Mark Chen,Guanxi Lu,Hongxiang Fan,Masato Motomura,Daichi Fujiki

arXiv:2605.18226·cs.CL·May 19, 2026

Context Memorization for Efficient Long Context Generation

Yasuyuki Okoshi,Hao Mark Chen,Guanxi Lu,Hongxiang Fan,Masato Motomura,Daichi Fujiki

PDF

1 Repo

TL;DR

This paper introduces attention-state memory, a training-free method that externalizes long prefixes into a lightweight memory, improving efficiency and accuracy in long context generation for large language models.

Contribution

It proposes a novel externalized memory approach for long prefixes that enhances efficiency and accuracy without additional training.

Findings

01

Improves accuracy over in-context learning with 1K-8K memory budgets.

02

Reduces attention latency by 1.36x at 8K memory.

03

Surpasses full-attention RAG performance on NBA benchmark with only 20% memory.

Abstract

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yasu0001/AttentionMemory
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.