TL;DR
This paper introduces ILRe, a novel context compression method for causal language models that significantly reduces computational complexity and memory usage while maintaining or improving performance in long-context scenarios.
Contribution
ILRe proposes a new intermediate layer retrieval technique that streamlines long-context processing without additional training or operator modifications.
Findings
Reduces prefill complexity from O(L^2) to O(L)
Cuts memory footprint to a fraction of full context requirements
Achieves near full-context performance with 180x speedup on long inputs
Abstract
Large Language Models (LLMs) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate decoder layer offline, encodes context by streaming chunked prefill only up to that layer, and recalls tokens by the attention scores between the input query and full key cache in that specified layer. In particular, we propose a multi-pooling kernels allocating strategy in the token recalling process to maintain the completeness of semantics. Our approach not only reduces the prefilling complexity from to and trims the memory…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Compared with using the full context, ILRE significantly reduces TTFT. 2. It achieves performance improvements on RULER and LongBench.
Overall, I believe this paper does not present a fundamentally novel contribution compared with previous methods (e.g., SnapKV). The addition of techniques such as max pooling and multiple kernel sizes lacks sufficient innovation. Moreover, the writing is confusing to me. 1. Figure 1 lacks a comparison with baselines (such as StreamingLLM and SnapKV). 2. The setup in Figure 2 is unclear. What does Recall mean here? It seems that the authors placed some key information in the appendix (for examp
* The paper presents a training-free method to compress tokens and thus reduce the memory and compute used for long-context scenario, which demonstrates promising performance on the dataset tested. * The method is intuitive and presented clearly. The discovery of using one layer's attention score to perform compression is interesting.
1. The benchmarks are limited. While the authors have conducted evaluation on LongBench, datasets in LongBench primarily consists of relatively short context (<10K), which is relatively easy to prefill. As the paper aims to compress extremely long text, it will be good to evaluate on [Infini-Bench](https://arxiv.org/pdf/2402.13718), which consists of context of >100K. 2. The baselines are also limited. While the paper has included two KV cache compression baselines (SnapKV and StreamingLLM), I b
1. This work provides comprehensive experimental data on its context compression methods.
1. Method Effectiveness: Many previous studies have shown that lossy context compression leads to significant information loss. As seen in Table 6, when DCA is removed, the performance of pure ILRe extrapolation on the ruler drops rapidly. This raises doubts as to whether the improvements on the ruler are mainly due to DCA. 2. Unclear Writing Structure: The paper lacks an intuitive introduction to the core ideas behind the method, instead jumping straight into the details, which makes it difficu
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
