TL;DR
DySCO is a training-free decoding algorithm that enhances long-context reasoning in language models by dynamically adjusting attention to task-relevant tokens using retrieval heads, leading to significant performance improvements.
Contribution
It introduces DYSCO, a novel, training-free decoding method that leverages retrieval heads to improve long-context reasoning in existing language models.
Findings
DYSCO improves performance on long-context reasoning benchmarks by up to 25%.
The method is applicable to any off-the-shelf language model.
Dynamic attention rescaling and retrieval-head guided selection are key to its effectiveness.
Abstract
Understanding and reasoning over long contexts is a crucial capability for language models (LMs). Although recent models support increasingly long context windows, their accuracy often deteriorates as input length grows. In practice, models often struggle to keep attention aligned with the most relevant context throughout decoding. In this work, we propose DYSCO, a novel decoding algorithm for improving long-context reasoning. DYSCO leverages retrieval heads--a subset of attention heads specialized for longcontext retrieval--to identify task-relevant tokens at each decoding step and explicitly up-weight them. By doing so, DYSCO dynamically adjusts attention during generation to better utilize relevant context. The method is training-free and can be applied directly to any off-the-shelf LMs. Across multiple instruction-tuned and reasoning models, DYSCO consistently improves performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
