TL;DR
This paper introduces a duality perspective on residual streams in Transformers, revealing operator-level symmetries and guiding design choices for model modifications and efficiency.
Contribution
It presents a two-axis view of Transformers, connecting residual operations to sequence and depth dimensions, and discusses implications for model design and hardware efficiency.
Findings
Residual stream duality links depth-wise residuals to sliding-window attention.
Learned depth aggregation can outperform uniform residual accumulation.
Sequence-axis ShortSWA is hardware-friendly for large-scale models.
Abstract
Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer. This perspective also clarifies the recent literature.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
