Knowing When to Stop: Efficient Context Processing via Latent Sufficiency Signals
Roy Xie, Junlin Wang, Paul Rosu, Chunyuan Deng, Bolun Sun, Zihao Lin, Bhuwan Dhingra

TL;DR
This paper introduces a dynamic context cutoff method for large language models, enabling them to self-terminate processing once sufficient information is obtained, improving efficiency and accuracy.
Contribution
The paper uncovers inherent sufficiency signals in attention heads and leverages them for self-termination, a novel approach to context efficiency in LLMs.
Findings
3.4% accuracy improvement over baselines
1.33x token reduction on average
Superior performance compared to other methods at similar token reductions
Abstract
Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient when the information required to answer a query is localized within the context. We present dynamic context cutoff, a novel method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode "sufficiency signals" -- detectable through lightweight classifiers -- that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B-70B) demonstrate 3.4% accuracy improvement while achieving 1.33x token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need
