Accelerating Prefilling via Decoding-time Contribution Sparsity

Zhiyuan He; Yike Zhang; Chengruidong Zhang; Huiqiang Jiang; Yuqing Yang; Lili Qiu

arXiv:2507.21526·cs.CL·April 22, 2026

Accelerating Prefilling via Decoding-time Contribution Sparsity

Zhiyuan He, Yike Zhang, Chengruidong Zhang, Huiqiang Jiang, Yuqing Yang, Lili Qiu

PDF

1 Repo

TL;DR

This paper introduces TriangleMix, a method that exploits decoding-time contribution sparsity in large language models to significantly accelerate attention computation without substantial performance loss.

Contribution

It proposes a training-free static attention pattern, TriangleMix, combining dense and sparse attention to reduce overhead during prefilling in LLMs.

Findings

01

Triangle attention achieves 15.3x speedup on 128K inputs.

02

TriangleMix preserves nearly lossless performance compared to dense attention.

03

Combining TriangleMix with dynamic sparsity yields an additional 6-19% TTFT reduction.

Abstract

Large Language Models (LLMs) incur quadratic attention complexity with input length, creating a major time bottleneck in the prefilling stage. Existing acceleration methods largely exploit attention score sparsity by estimating blocks with high attention scores and applying dynamic sparse attention. In this work, we identify another untapped form of sparsity in the prefilling stage, namely decoding-time contribution sparsity, where many attention blocks exhibit nontrivial attention scores during prefilling yet contribute negligibly to subsequent decoding, as indicated by gradient-based analysis. Building on this observation, we propose TriangleMix, a training-free static attention pattern that uses dense attention in a subset of layers and switches to Triangle attention in the others. Extensive experiments show that TriangleMix preserves nearly lossless performance relative to dense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://aka.ms/TriangleMix
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.