Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers
Yusheng Zhao, Hourun Li, Bohan Wu, Yichun Yin, Lifeng Shang, Jingyang Yuan, Meng Zhang, Ming Zhang

TL;DR
This paper introduces Switch Attention, a hybrid transformer mechanism that dynamically allocates computation between full and sliding window attention, improving efficiency and performance on long-context language tasks.
Contribution
The paper proposes SwiAttn, a novel hybrid transformer with dynamic routing between attention types, enabling more efficient long-context modeling.
Findings
Outperforms existing models on 23 benchmark datasets.
Effectively handles both regular and long context lengths.
Demonstrates improved efficiency and accuracy.
Abstract
The attention mechanism has been the core component in modern transformer architectures. However, the computation of standard full attention scales quadratically with the sequence length, serving as a major bottleneck in long-context language modeling. Sliding window attention restricts the context length for better efficiency at the cost of narrower receptive fields. While existing efforts attempt to take the benefits from both sides by building hybrid models, they often resort to static, heuristically designed alternating patterns that limit efficient allocation of computation in various scenarios. In this paper, we propose Switch Attention (SwiAttn), a novel hybrid transformer that enables dynamic and fine-grained routing between full attention and sliding window attention. For each token at each transformer layer, SwiAttn dynamically routes the computation to either a full-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
