Short-Range Dependency Effects on Transformer Instability and a Decomposed Attention Solution
Suvadeep Hajra

TL;DR
This paper identifies short-range dependency limitations in self-attention as a cause of transformer training instability and proposes a decomposed attention method that improves stability, reduces perplexity, and speeds up inference.
Contribution
The paper introduces Long Short-attention (LS-attention), a decomposed attention mechanism that separates local and global attention to enhance training stability and efficiency.
Findings
LS-attention reduces validation perplexity to 40% of baseline methods.
It achieves similar perplexity with only 5% of the GPU hours of other methods.
Inference latency is decreased by up to 36% with LS-attention.
Abstract
Transformer language models have driven significant progress across various fields, including natural language processing and computer vision. A central component of these models is the self-attention (SA) mechanism, which learns rich vector representations of tokens by modeling their relationships with others in a sequence. However, despite extensive research, transformers continue to suffer from training instability -- often manifesting as spikes or divergence in the training loss during a run. In this work, we identify one source of this instability: SA's limited ability to capture short-range dependencies, especially in tasks like language modeling, where almost every token heavily relies on its nearby neighbors. This limitation causes the pre-softmax logits of SA to grow rapidly, destabilizing training. To address this, we propose decomposing the SA into local (short-range) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPower Transformer Diagnostics and Insulation · High voltage insulation and dielectric phenomena · Power Quality and Harmonics
MethodsSoftmax · Attention Is All You Need
