TL;DR
This paper introduces PBS-Attn, a permutation-based block-sparse attention method that improves efficiency and accuracy in long-context language models, achieving up to 2.75x speedup over full attention.
Contribution
The paper proposes a novel permutation-based block-sparse attention mechanism that enhances block-level sparsity and computational efficiency in large language models.
Findings
PBS-Attn outperforms existing block-sparse methods in accuracy.
Achieves up to 2.75x speedup in long-context prefilling.
Consistently matches full attention baseline in experiments.
Abstract
Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we…
Peer Reviews
Decision·Submitted to ICLR 2026
1. **Clear formulation:** The formulation of the attention computation and the analysis of permutation properties are clearly presented. 2. **Competitive results:** PBS-Attn achieves competitive performance across four sparse attention baselines. Although it does not consistently achieve the best score on every benchmark, it attains the highest average performance overall.
1. **Relation to prior work:** The paper lacks sufficient discussion of its relation and distinction from previous research. The idea of using permutation to better aggregate sparse attention scores has been explored in prior works such as [1]. The authors are encouraged to highlight the key differences and contributions relative to these methods. 2. **Global importance score computation:** The rationale for using the last query block to compute global importance is not clearly explained. It rem
1. Novel idea with solid theory: The paper introduces a new optimization axis for sparse attention: token permutation. It builds on formal proofs of permutation invariance in attention, making the approach conceptually sound and mathematically rigorous. 2. Strong empirical results: PBS-Attn achieves up to 2.75× speedup with minimal accuracy loss, showing consistent gains across two major long-context LLMs and benchmarks. 3. Orthogonal contribution: The method complements existing block-selection
1. Incomplete evaluation: Missing key long-context benchmarks such as InfiniteBench and RULER, which limits understanding of the method’s scalability and robustness across diverse context lengths. 2. Unclear GQA handling: It remains unclear whether GQA heads share the same permutation pattern or whether the permutation is based on query heads rather than key-value heads. 3. Limited generality of the scoring method: The query-aware key permutation relies on the final queries, which may not perfor
1. The paper presents a well-motivated idea grounded in a clear theoretical foundation. 2. The segmented permutation framework is plug-and-play and agnostic to the block selection algorithm. The approach is modular, supporting extensions and integration with existing block-sparse attention methods 3. PBS-Attn achieves near-full-attention accuracy with substantial runtime savings, outperforming recent baselines such as FlexPrefill and XAttention.
1. The method currently targets only the prefill stage. Its applicability to decoding or training phases is not explored. 2. The paper asserts “minimal performance degradation,” but from Table 1 and Table 2, there are domains or tasks (e.g., Qwen-2.5-7B-1M on LongBench, Code and Few-Shot Learning categories) where PBS-Attn performs slightly below the full attention baseline. No qualitative or error analyses are provided to identify failure modes or classes of inputs for which the approach may un
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
