Short-Range Dependency Effects on Transformer Instability and a Decomposed Attention Solution

Suvadeep Hajra

arXiv:2505.15548·cs.LG·May 22, 2025

Short-Range Dependency Effects on Transformer Instability and a Decomposed Attention Solution

Suvadeep Hajra

PDF

Open Access

TL;DR

This paper identifies short-range dependency limitations in self-attention as a cause of transformer training instability and proposes a decomposed attention method that improves stability, reduces perplexity, and speeds up inference.

Contribution

The paper introduces Long Short-attention (LS-attention), a decomposed attention mechanism that separates local and global attention to enhance training stability and efficiency.

Findings

01

LS-attention reduces validation perplexity to 40% of baseline methods.

02

It achieves similar perplexity with only 5% of the GPU hours of other methods.

03

Inference latency is decreased by up to 36% with LS-attention.

Abstract

Transformer language models have driven significant progress across various fields, including natural language processing and computer vision. A central component of these models is the self-attention (SA) mechanism, which learns rich vector representations of tokens by modeling their relationships with others in a sequence. However, despite extensive research, transformers continue to suffer from training instability -- often manifesting as spikes or divergence in the training loss during a run. In this work, we identify one source of this instability: SA's limited ability to capture short-range dependencies, especially in tasks like language modeling, where almost every token heavily relies on its nearby neighbors. This limitation causes the pre-softmax logits of SA to grow rapidly, destabilizing training. To address this, we propose decomposing the SA into local (short-range) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPower Transformer Diagnostics and Insulation · High voltage insulation and dielectric phenomena · Power Quality and Harmonics

MethodsSoftmax · Attention Is All You Need