Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget

Peter Balogh

arXiv:2603.03459·cs.LG·March 10, 2026

Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget

Peter Balogh

PDF

Open Access

TL;DR

This paper shows that most transformer MLP layers are near-linear and can be replaced with minimal performance loss, leading to more efficient models and revealing that nonlinearity is often unnecessary.

Contribution

It introduces a gating mechanism to measure and selectively replace MLPs with linear surrogates, demonstrating significant potential for model simplification and efficiency.

Findings

01

Most MLP computations are near-linear and can be replaced with minimal perplexity increase.

02

Gating achieves 25-56% linear routing with less than 1% perplexity cost.

03

Replacing certain MLP layers with linear matrices can improve perplexity, indicating some nonlinearities are harmful.

Abstract

We investigate when transformer MLP nonlinearity is actually necessary. A gate with $d + 1$ parameters decides when to replace the full MLP with a linear surrogate. Through systematic investigation across six models (162M-2.8B parameters), two architectures, and three corpora, we establish that nonlinearity need cannot be predicted from token identity: cross-corpus correlation is zero ( $r < 0.05$ ). The routing decision is fully contextual. Despite weak per-instance predictability, the gate exploits a heavily skewed distribution where most MLP computations are near-linear, achieving 25-56% linear routing at <1% perplexity cost in GPT-2. In GPT-2 Large, 11 of 36 layers beat baseline with gating and no layer exceeds 3.7% all-linear cost. This success is architecture-dependent: Pythia models show higher costs, though Pythia-2.8B's full 32-layer sweep reveals one layer that narrowly beats…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptical Network Technologies · Magnetic properties of thin films · Copper Interconnects and Reliability