Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage

Junhao Hu; Fangze Li; Mingtao Xu; Feifan Meng; Shiju Zhao; Tiancheng Hu; Ting Peng; Anmin Liu; Wenrui Huang; Chenxu Liu; Ziyue Hua; Tao Xie

arXiv:2601.03043·cs.CL·April 21, 2026

Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage

Junhao Hu, Fangze Li, Mingtao Xu, Feifan Meng, Shiju Zhao, Tiancheng Hu, Ting Peng, Anmin Liu, Wenrui Huang, Chenxu Liu, Ziyue Hua, Tao Xie

PDF

TL;DR

This paper reveals that sparse-attention algorithms can increase inference complexity in LLMs due to information loss, and proposes an early-stopping method to significantly reduce token usage with minimal accuracy loss.

Contribution

It uncovers the paradoxical 'Less is Less' phenomenon in sparse attention and introduces an early-stopping algorithm to mitigate this issue effectively.

Findings

01

Sparse attention can increase end-to-end complexity due to information loss.

02

The proposed early-stopping algorithm reduces token consumption by up to 90%.

03

Accuracy degradation remains below 2% with the new method.

Abstract

Large language models (LLMs) demonstrate strong capabilities across a wide range of complex tasks and are increasingly deployed at scale, placing significant demands on inference efficiency. Prior work typically decomposes inference into prefill and decode stages, with the decode stage dominating total latency. To reduce time and memory complexity in the decode stage, a line of work introduces sparse-attention algorithms. In this paper, we show, both empirically and theoretically, that sparse attention can paradoxically increase end-to-end complexity: information loss often induces significantly longer sequences, a phenomenon we term ``Less is Less'' (Lil). To mitigate the Lil problem, we propose an early-stopping algorithm that detects the threshold where information loss exceeds information gain during sparse decoding. Our early-stopping algorithm reduces token consumption by up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.