A Residual-Aware Theory of Position Bias in Transformers

Hanna Herasimchyk; Robin Labryga; Tomislav Prusina; S\"oren Laue

arXiv:2602.16837·cs.LG·February 20, 2026

A Residual-Aware Theory of Position Bias in Transformers

Hanna Herasimchyk, Robin Labryga, Tomislav Prusina, S\"oren Laue

PDF

Open Access

TL;DR

This paper develops a residual-aware theoretical framework to explain the position bias in Transformer models, revealing how residual connections prevent attention collapse and cause a U-shaped attention distribution across token positions.

Contribution

It introduces a novel residual-aware theory of cumulative attention rollout that explains position bias and the Lost-in-the-Middle phenomenon in Transformers.

Findings

01

Residual connections prevent attention collapse at infinite depth.

02

Transformers exhibit a U-shaped position bias at finite depth.

03

The theory explains the architectural origins of position bias phenomena.

Abstract

Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. Under causal masking at infinite depth, prior theoretical analyses of attention rollout predict an inevitable collapse of attention onto the first token. Such collapse, however, does not occur in practice. We resolve this discrepancy with a residual-aware theory of cumulative attention rollout. By incorporating residual connections, we show that this architectural component prevents collapse under realistic conditions. At finite depth, we prove that causal Transformers induce a U-shaped position bias, with attention concentrating on early and late tokens. This result provides a principled architectural explanation for the Lost-in-the-Middle phenomenon.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Neural dynamics and brain function · Face Recognition and Perception