CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill
Bradley McDanel, Steven Li, Harshit Khaitan

TL;DR
This paper introduces CLAA, a method that aggregates token importance scores across layers in long-context LLMs, significantly accelerating inference by reducing the Time-to-First-Token without sacrificing accuracy.
Contribution
The paper proposes a novel cross-layer attention aggregation technique that stabilizes token importance estimation and improves inference speed in long-context LLMs.
Findings
CLAA reduces Time-to-First-Token by up to 39%.
Aggregating scores across layers improves token importance stability.
Existing heuristics show high variance in token rankings across layers.
Abstract
The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper identifies and empirically demonstrates that existing single-layer methods suffer from significant layer-wise instability (Figure 2), which is a common limitation that impacts their effectiveness. The Answer-Informed Oracle is an important contribution that enables principled comparison of token ranking heuristics independently from architectural details. This could become a valuable tool for future research. 2. CLAA's design is straightforward - maximum aggregation across layers w
1. The paper is missing critical comparisons with several recent and relevant methods. Specifically, the following papers also focus on improving TTFT but are not shown in the baselines: * Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers * Compressing Context to Enhance Inference Efficiency of Large Language Models * LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference 2. The core technical contribution (max aggregation ac
1. The author proposes a principle way of checking the validity of a token estimation method, upon which they design a new way of aggregating the attention mechanism to conduct token dropping. 2. The work compares many important baselines both from a quality and efficiency perspective. 3. LongBench, RULER, and Niah are used for long context evaluations, as well as both prefill TTFT and KV cache / memory profiling, which is very comprehensive.
1. Prior works have shown that token pruning can become less effective for shorter, standard tasks, which should be included for completeness. 2. The model focuses on the 8B model, and should ideally include other sizes to further prove its practicality. 3. The method seems more like an incremental approach from prior works that use attention as the main metrics for token estimation, the main novel part is how to aggregate the attention scores more robustly compared to GemFilter. 4. RULER re
- Novel and principled evaluation framework: The Answer-Informed Oracle is a clever contribution that provides a ground-truth way to evaluate token ranking quality independent of architectural differences. The idea of using attention from the actual generated answer to score prompt tokens is intuitive and well-motivated. This makes it much easier to compare different heuristics fairly, which has been a real problem in this area. - Strong empirical results with comprehensive evaluation: The exper
- Limited model coverage in experiments: The paper only evaluates on Llama-3.2-3B, Llama-3.1-8B, and Mistral-Nemo-12B. It would be more convincing to see results on other model families like Qwen3 or larger models (e.g., 30B+ scale). Different architectures might have different attention patterns, and larger models might show different layer-wise stability characteristics. The current evaluation leaves some doubt about whether these findings generalize broadly across the model landscape. - Missi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks
