Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers
Zecheng Tang, Quantong Qiu, Yi Yang, Zhiyi Hong, Haiya Xiang, Kebin Liu, Qingqing Dang, Juntao Li, Min Zhang

TL;DR
Elastic Attention introduces a dynamic, input-dependent sparsity mechanism for transformers, significantly improving efficiency and performance in long-context scenarios by adapting attention modes during inference.
Contribution
We propose Elastic Attention, a novel method that dynamically adjusts attention sparsity ratios at test time using an Attention Router, enhancing scalability and adaptability of large language models.
Findings
Achieves strong performance with efficient inference on long-context benchmarks.
Enables dynamic adjustment of attention modes during inference.
Requires only 12 hours of training on 8xA800 GPUs.
Abstract
The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically employ static computation ratios (i.e., fixed proportions of sparse versus full attention) and fail to adapt to the varying sparsity sensitivities of downstream tasks during inference. To address this issue, we propose Elastic Attention, which allows the model to dynamically adjust its overall sparsity based on the input. This is achieved by integrating a lightweight Attention Router into the existing pretrained model, which dynamically assigns each attention head to different computation modes. Within only 12 hours of training on 8xA800 GPUs, our method enables models to achieve both strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗LCM-Lab/full_xattn_64k_qwen3-4b_wfrozenmodel· 1 dl1 dl
- 🤗LCM-Lab/full_streaming_64k_qwen3-4b_end0.7_wfrozenmodel· 1 dl1 dl
- 🤗LCM-Lab/full_xattn_64k_qwen3-8b_end0.7_wfrozenmodel· 1 dl1 dl
- 🤗LCM-Lab/full_streaming_64k_qwen3-4b_MLP2.0_wfrozenmodel· 2 dl2 dl
- 🤗LCM-Lab/full_xattn_64k_llama3.1-8b_wfrozenmodel
- 🤗LCM-Lab/full_streaming_64k_qwen3-4b_MLP3.0_wfrozenmodel
- 🤗LCM-Lab/full_streaming_64k_qwen3-4b_MLP8.0_wfrozenmodel· 1 dl1 dl
- 🤗LCM-Lab/infllm_qwen3-8bmodel
- 🤗LCM-Lab/moba_qwen3-8bmodel· 3 dl3 dl
- 🤗LCM-Lab/moba_llamamodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
