Scaling Bidirectional Spans and Span Violations in Attention Mechanism

Jongwook Kim; Sangheon Yun; Sukjin Yoon

arXiv:2512.13033·cs.LG·December 16, 2025

Scaling Bidirectional Spans and Span Violations in Attention Mechanism

Jongwook Kim, Sangheon Yun, Sukjin Yoon

PDF

Open Access

TL;DR

This paper introduces an optimization framework for Transformers that decomposes attention gradients into spans and violations, leading to improved training efficiency and performance, especially on larger datasets.

Contribution

It presents a novel gradient decomposition method that enhances Transformer training by focusing on span and violation components, maintaining the forward pass structure.

Findings

01

Achieved a 0.56% reduction in validation loss on WikiText-2.

02

The standard attention gradient is shown to be suboptimal.

03

Selective scaling of gradient components improves learning signals.

Abstract

The canonical $O (N^{2})$ Transformer remains the empirical performance frontier in sequence modeling, and its training can be further optimized by addressing geometric inefficiency. We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans and orthogonal violations, while keeping the canonical forward-pass $Q K V$ structure intact. Through consistent experimental validation across various decomposition and projection setups, we provide strong theoretical evidence: the standard attention gradient is suboptimal. We demonstrated that selectively scaling these components, focusing primarily on $0^{t h}$ order bidirectional parallel spans, yields the most effective learning signal. On the limited WikiText-2 dataset, and using a crude configuration, this method achieved a $0.56%$ reduction in validation loss,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Stochastic Gradient Optimization Techniques