TL;DR
The paper introduces SATFormer, a Transformer variant that selectively reuses early layer representations through a context-dependent gating mechanism, improving performance on retrieval-heavy tasks while maintaining efficiency.
Contribution
It proposes a retrieval-based approach for selective access to early representations, outperforming static residual methods across various model sizes.
Findings
SATFormer improves validation loss and zero-shot accuracy over baselines.
Strong gains on retrieval-intensive benchmarks, about 1.5 points improvement.
Analysis shows sparse, depth-dependent, and category-sensitive access patterns.
Abstract
Several recent Transformer architectures expose later layers to representations computed in the earliest layers, motivated by the observation that low-level features can become harder to recover as the residual stream is repeatedly transformed through depth. The cheapest among these methods add static value residuals: learned mixing coefficients that expose the first-layer value projection V_1 uniformly across tokens and heads. More expressive dense or dynamic alternatives recover finer-grained access, but at higher memory cost and lower throughput. The usefulness of V_1 is unlikely to be constant across tokens, heads, and contexts; different positions plausibly require different amounts of access to early lexical or semantic information. We therefore treat early-representation reuse as a retrieval problem rather than a connectivity problem, and introduce Selective Access Transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
