Residual Stream Duality in Modern Transformer Architectures

Yifan Zhang

arXiv:2603.16039·cs.LG·May 15, 2026

Residual Stream Duality in Modern Transformer Architectures

Yifan Zhang

PDF

1 Repo

TL;DR

This paper introduces a duality perspective on residual streams in Transformers, revealing operator-level symmetries and guiding design choices for model modifications and efficiency.

Contribution

It presents a two-axis view of Transformers, connecting residual operations to sequence and depth dimensions, and discusses implications for model design and hardware efficiency.

Findings

01

Residual stream duality links depth-wise residuals to sliding-window attention.

02

Learned depth aggregation can outperform uniform residual accumulation.

03

Sequence-axis ShortSWA is hardware-friendly for large-scale models.

Abstract

Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer $^{2}$ . This perspective also clarifies the recent literature.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yifanzhang-pro/residual-stream-duality
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.