A Residual-Aware Theory of Position Bias in Transformers
Hanna Herasimchyk, Robin Labryga, Tomislav Prusina, S\"oren Laue

TL;DR
This paper develops a residual-aware theoretical framework to explain the position bias in Transformer models, revealing how residual connections prevent attention collapse and cause a U-shaped attention distribution across token positions.
Contribution
It introduces a novel residual-aware theory of cumulative attention rollout that explains position bias and the Lost-in-the-Middle phenomenon in Transformers.
Findings
Residual connections prevent attention collapse at infinite depth.
Transformers exhibit a U-shaped position bias at finite depth.
The theory explains the architectural origins of position bias phenomena.
Abstract
Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. Under causal masking at infinite depth, prior theoretical analyses of attention rollout predict an inevitable collapse of attention onto the first token. Such collapse, however, does not occur in practice. We resolve this discrepancy with a residual-aware theory of cumulative attention rollout. By incorporating residual connections, we show that this architectural component prevents collapse under realistic conditions. At finite depth, we prove that causal Transformers induce a U-shaped position bias, with attention concentrating on early and late tokens. This result provides a principled architectural explanation for the Lost-in-the-Middle phenomenon.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Neural dynamics and brain function · Face Recognition and Perception
