Attention Projection Mixing with Exogenous Anchors

Jonathan Su

arXiv:2601.08131·cs.CL·January 29, 2026

Attention Projection Mixing with Exogenous Anchors

Jonathan Su

PDF

Open Access

TL;DR

ExoFormer introduces external exogenous anchors and a normalized mixing framework to improve attention reuse, leading to better performance and efficiency in transformer models.

Contribution

The paper proposes ExoFormer, a novel approach that learns external anchor projections and a unified mixing framework, overcoming internal-anchor limitations and enhancing model performance.

Findings

01

ExoFormer variants outperform internal-anchor models.

02

Dynamic ExoFormer achieves 1.5x accuracy points with fewer tokens.

03

Normalized anchor sources are crucial for stable reuse.

Abstract

Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We demonstrate that this tension constrains the performance of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise, headwise, and scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5x downstream accuracy points while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Graph Neural Networks · Generative Adversarial Networks and Image Synthesis