HSR-Enhanced Sparse Attention Acceleration
Bo Chen, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song

TL;DR
This paper presents a novel method using Half-Space Reporting to accelerate sparse attention in large language models, significantly reducing computation time for long-context tasks with minimal error.
Contribution
It introduces a HSR-based approach to efficiently identify active attention entries, enabling faster sparse attention computation in LLMs for long contexts.
Findings
Achieves $O(mn^{4/5})$ time complexity for generation decoding.
Reduces prompt prefilling time to $O(mn^{1 - 1/\lfloor d/2\rfloor} + mn^{4/5})$.
Introduces negligible error in Softmax attention computations.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. We introduce a novel approach to accelerate attention computation in LLMs, particularly for long-context scenarios. We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention (with activation, ), to significantly reduce the running time complexity. Our method employs a Half-Space Reporting (HSR) data structure to identify non-zero or ``massively activated'' entries in the attention matrix. We present theoretical analyses for two key scenarios: generation decoding and prompt prefilling. Our approach achieves a running time of significantly faster than…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The integration of HSR for sparse attention significantly contributes to reducing computational costs in attention mechanisms. 2. The paper rigorously proves the effectiveness of the approach, including detailed bounds for approximation errors and sparse matrix management.
1. While the theoretical basis is strong, the paper does not fully explore the practicality of implementing HSR across various LLMs, especially in comparison with other baseline methods on a wider range of benchmarks. 2. The empirical results lack details on latency and memory usage in LLM settings, which are crucial for assessing real-world efficiency. 3. The absence of accessible code for implementation makes it challenging to independently verify the method’s performance.
1. The complexity is smaller than the original $O(mn)$. 2. The method is accurate for ReLU attention.
1. The presentation is not clear. For example, it is hard to see the difference between Algorithm 2 and Algorithm 3. 2. The assumptions in the theoretical results are not justified. They may not be relevant in practical settings. 3. No experimental results to demonstrate that the proposed method can speed up the computation of attention in practice
1. Addressing a Timely and Important Problem. The paper tackles the critical issue of optimizing sparse attention mechanisms in Large Language Models (LLMs), specifically focusing on both ReLU and Softmax attention sparsity. 2. Leveraging the HSR Data Structure. By leveraging the Half-Space Reporting (HSR) data structure, the paper reduces computational complexity in sparse attention and activation. 3. Theoretical Analysis. It provides rigorous theoretical proofs, ensuring the proposed methods a
1. Very Limited evaluation and lack of comparisons. 2. No cost and quatitative analysis of HSR. see my questions for details.
Using the HSR data structure to detect non-zero entries in the attention mechanism is novel and interesting. Since it can detect exact non-zero entries, we can reconstruct ReLU attention and show negligible errors (as far as the author claims, but I have concerns about the evaluation) in softmax attention.
1. Lack of proper downstream task evaluation. - I do not think the perplexity evaluation shows real-world performance. Also, the Y-axis range of Figure 2 is way too large while considering a 0.1 difference in perplexity is significant in downstream tasks. I suggest to evaluate the method in InfiniteBench (https://github.com/OpenBMB/InfiniteBench) or something more realistic. 2. Lack of latency evaluation (Latency improvement claim may be marginal) - Can you provide wall-clock latency in GPU?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Detection and Scintillator Technologies · Advanced MRI Techniques and Applications · Atomic and Subatomic Physics Research
MethodsAttention Is All You Need · Softmax · *Communicated@Fast*How Do I Communicate to Expedia?
