Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget
Peter Balogh

TL;DR
This paper shows that most transformer MLP layers are near-linear and can be replaced with minimal performance loss, leading to more efficient models and revealing that nonlinearity is often unnecessary.
Contribution
It introduces a gating mechanism to measure and selectively replace MLPs with linear surrogates, demonstrating significant potential for model simplification and efficiency.
Findings
Most MLP computations are near-linear and can be replaced with minimal perplexity increase.
Gating achieves 25-56% linear routing with less than 1% perplexity cost.
Replacing certain MLP layers with linear matrices can improve perplexity, indicating some nonlinearities are harmful.
Abstract
We investigate when transformer MLP nonlinearity is actually necessary. A gate with parameters decides when to replace the full MLP with a linear surrogate. Through systematic investigation across six models (162M-2.8B parameters), two architectures, and three corpora, we establish that nonlinearity need cannot be predicted from token identity: cross-corpus correlation is zero (). The routing decision is fully contextual. Despite weak per-instance predictability, the gate exploits a heavily skewed distribution where most MLP computations are near-linear, achieving 25-56% linear routing at <1% perplexity cost in GPT-2. In GPT-2 Large, 11 of 36 layers beat baseline with gating and no layer exceeds 3.7% all-linear cost. This success is architecture-dependent: Pythia models show higher costs, though Pythia-2.8B's full 32-layer sweep reveals one layer that narrowly beats…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptical Network Technologies · Magnetic properties of thin films · Copper Interconnects and Reliability
