Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V

Chirag Shinde

arXiv:2604.10791·cs.CL·April 14, 2026

Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V

Chirag Shinde

PDF

TL;DR

This paper introduces position-agnostic pre-projection and content skip connections in transformer attention, improving language modeling performance without additional cache overhead.

Contribution

It proposes novel modifications to transformer attention that enhance feature richness and content bypass, leading to stronger results across model sizes.

Findings

01

+40.6% LAMBADA accuracy at 160M scale

02

-39% perplexity at 160M scale

03

Deeper layers activate content bypass more strongly

Abstract

We propose two complementary modifications to transformer attention blocks. First, a non-linear pre-projection MLP is inserted between layer norm and Q/K/V projections, constructing richer features in a position-agnostic manner before any positional encoding is applied. Second, a content skip connection routes the pre-projection's features around the attention mechanism, allowing content information to bypass position-aware attention where beneficial. In frozen-probe experiments on Pythia-160M and 410M, the combined approach achieves the strongest results across methods: +40.6% LAMBADA accuracy and -39% perplexity at 160M scale. Learned skip connection weights reveal a consistent pattern across model sizes: later transformer layers activate the content bypass more strongly than earlier layers, suggesting that deeper layers benefit from content information that does not pass through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.