Less Is More: Fast and Accurate Reasoning with Cross-Head Unified Sparse Attention

Lijie Yang; Zhihao Zhang; Arti Jain; Shijie Cao; Baihong Yuan; Yiwei Chen; Zhihao Jia; Ravi Netravali

arXiv:2508.07101·cs.CL·April 29, 2026

Less Is More: Fast and Accurate Reasoning with Cross-Head Unified Sparse Attention

Lijie Yang, Zhihao Zhang, Arti Jain, Shijie Cao, Baihong Yuan, Yiwei Chen, Zhihao Jia, Ravi Netravali

PDF

1 Repo

TL;DR

LessIsMore introduces a training-free sparse attention method that enhances long-horizon reasoning efficiency by maintaining stable, globally shared token importance, leading to faster decoding without accuracy loss.

Contribution

It presents a novel, training-free sparse attention mechanism that enforces cross-head unified token selection and stable context preservation for improved reasoning performance.

Findings

01

Matches or improves accuracy with fewer attended tokens.

02

Achieves up to 1.6x decoding speedup and 1.72x faster sparse attention.

03

Demonstrates effectiveness across multiple models and benchmarks.

Abstract

Large reasoning models achieve strong performance through test-time scaling, but this incurs substantial computational overhead due to long decoding from short prompts. While sparse attention can reduce latency and memory usage, existing methods often degrade reasoning accuracy because selection errors accumulate over long generation horizons, or require costly retraining. We introduce LessIsMore, a training-free sparse attention mechanism for long-horizon reasoning. Our key insight is that token importance in reasoning is global and stable: critical tokens are largely shared across attention heads and remain stable over decoding steps. Guided by this structure, LessIsMore enforces cross-head unified token selection and preserves recent context via a stable recency window, yielding a globally consistent token set that can be reused across layers. Across multiple model families and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DerrickYLJ/LessIsMore
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.