Attention Projection Mixing with Exogenous Anchors
Jonathan Su

TL;DR
ExoFormer introduces external exogenous anchors and a normalized mixing framework to improve attention reuse, leading to better performance and efficiency in transformer models.
Contribution
The paper proposes ExoFormer, a novel approach that learns external anchor projections and a unified mixing framework, overcoming internal-anchor limitations and enhancing model performance.
Findings
ExoFormer variants outperform internal-anchor models.
Dynamic ExoFormer achieves 1.5x accuracy points with fewer tokens.
Normalized anchor sources are crucial for stable reuse.
Abstract
Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We demonstrate that this tension constrains the performance of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise, headwise, and scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5x downstream accuracy points while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Graph Neural Networks · Generative Adversarial Networks and Image Synthesis
