TL;DR
This paper introduces Dynamic Hierarchical Sparse Attention (DHSA), a novel, data-driven method for long-context modeling in resource-limited on-device LLMs that improves efficiency and accuracy without retraining.
Contribution
DHSA adaptively segments sequences and computes importance scores dynamically, outperforming static and heuristic methods in accuracy and efficiency for long-context LLMs.
Findings
DHSA reduces latency by 20-60% and memory by 35%.
DHSA achieves 6-18% higher accuracy than baseline methods.
DHSA matches dense attention accuracy with lower computational cost.
Abstract
The quadratic cost of attention hinders the scalability of long-context LLMs, especially in resource-constrained settings. Existing static sparse methods such as sliding windows or global tokens utilizes the sparsity of attention to reduce the cost of attention, but poorly adapts to the content-dependent variations in attention due to their staticity. While previous work has proposed several dynamic approaches to improve flexibility, they still depend on predefined templates or heuristic mechanisms. Such strategies reduce generality and prune tokens that remain contextually important, limiting their accuracy across diverse tasks. To tackle these bottlenecks of existing methods for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining. Our proposed DHSA adaptively…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- The dynamic partitioning of tokens is a novel and effective idea. - The accuracy evaluation results look promising.
- DHSA requires training an MLP layer to predict chunk boundaries, making it harder to deploy than existing methods. - The efficiency evaluation is not very comprehensive.
1. This paper tries to tackle an important problem of how to improve the efficiency of LLM inference in long context by leveraging sparsity in attention. 2. A clear hierarchical routing formulation with a concrete sparse attention pipeline. The design and implementation details are explained well. 3. The paper reports accuracy improvements over existing static sparse attention baselines and lower latency over full dense attention.
1. Missing comparisons to other more recent dynamic sparse baselines. Current baselines are mostly static patterns on static template. 2. Missing upper bound analysis with oracle top-k baseline to show how close the number of tokens selected is to the optimal choice. Missing latency comparison with baselines other than dense attention. 3. Still not clear why dynamic chunking is needed if there is an accurate way to estimate the contribution of each chunk to the overall attention. 4. Not clear
* Efficient long-context handling: Matches dense attention accuracy while cutting prefill latency by 20–60% and peak memory usage by 35% at 8K context, and scales to 100K context on a single 24 GB GPU (where dense kernels fail). * Input-adaptive sparsity: Avoids rigid static patterns or heuristics; dynamically predicts attention sparsity via data-driven chunking and similarity, adapting to diverse tasks/inputs. * Easy integration: Functions as a drop-in module for standard decoder-only Transform
* Hyperparameter dependence: Its performance relies on hyperparameters like the number of chunks and preserved keys, whose optimal settings vary across models, tasks, and hardware, lacking adaptive allocation strategies. * Boundary predictor constraints: The boundary detector requires training on specific datasets (e.g., Long Data Collections) and may need adjustments for diverse text types, introducing potential generalization gaps. * Hardware adaptability limitations: While tested on NVIDIA
This paper introduces a technically elegant and well motivated solution to one of the most critical bottlenecks in modern LLMs efficient long-context inference. The proposed DHSA framework combines dynamic boundary detection with hierarchical sparsity prediction, achieving strong accuracy along with efficiency trade-offs across tasks such as LongBench and Needle-in-a-Haystack. Its design as a training-free, drop-in module makes it immediately applicable to on-device. The empirical results show c
Despite the contribution being incremental relative to recent dynamic sparsity and KV compression literature (e.g., MInference, H2O, PyramidKV), with limited theoretical grounding for why hierarchical chunking yields near-optimal sparsity prediction. The dependency on hyperparameter tuning for chunk size and sparsity budgets limits generalizability across architectures and devices. The method’s scalability beyond 100K context is mentioned but needs to be empirically validated. The experimenta
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
