Transformers with Selective Access to Early Representations

Skye Gunasekaran; T\'ea Wright; Rui-Jie Zhu; Jason Eshraghian

arXiv:2605.03953·cs.LG·May 7, 2026

Transformers with Selective Access to Early Representations

Skye Gunasekaran, T\'ea Wright, Rui-Jie Zhu, Jason Eshraghian

PDF

1 Repo

TL;DR

The paper introduces SATFormer, a Transformer variant that selectively reuses early layer representations through a context-dependent gating mechanism, improving performance on retrieval-heavy tasks while maintaining efficiency.

Contribution

It proposes a retrieval-based approach for selective access to early representations, outperforming static residual methods across various model sizes.

Findings

01

SATFormer improves validation loss and zero-shot accuracy over baselines.

02

Strong gains on retrieval-intensive benchmarks, about 1.5 points improvement.

03

Analysis shows sparse, depth-dependent, and category-sensitive access patterns.

Abstract

Several recent Transformer architectures expose later layers to representations computed in the earliest layers, motivated by the observation that low-level features can become harder to recover as the residual stream is repeatedly transformed through depth. The cheapest among these methods add static value residuals: learned mixing coefficients that expose the first-layer value projection V_1 uniformly across tokens and heads. More expressive dense or dynamic alternatives recover finer-grained access, but at higher memory cost and lower throughput. The usefulness of V_1 is unlikely to be constant across tokens, heads, and contexts; different positions plausibly require different amounts of access to early lexical or semantic information. We therefore treat early-representation reuse as a retrieval problem rather than a connectivity problem, and introduce Selective Access Transformer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SkyeGunasekaran/SATFormer
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.