The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

Peter Balogh

arXiv:2603.10985·cs.LG·March 12, 2026

The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

Peter Balogh

PDF

Open Access

TL;DR

This paper reveals that transformer MLP layers perform binary routing of continuous signals, with neurons acting as consensus switches that determine whether tokens require nonlinear processing, explaining the limitations of polynomial approximations.

Contribution

It uncovers the binary routing mechanism in transformer MLP layers and characterizes its developmental stages and functional importance, providing a new perspective on neural computation.

Findings

01

Binary neuron activations effectively route signals without information loss.

02

Removing consensus neurons significantly increases perplexity, confirming their functional role.

03

Binary routing explains the failure of polynomial approximations in nonlinear layers.

Abstract

We show that MLP layers in transformer language models perform binary routing of continuous signals: the decision of whether a token needs nonlinear processing is well-captured by binary neuron activations, even though the signals being routed are continuous. In GPT-2 Small (124M parameters), we find that specific neurons implement a consensus architecture -- seven "default-ON" neurons and one exception handler (N2123 in Layer 11) that are 93-98% mutually exclusive -- creating a binary routing switch. A cross-layer analysis reveals a developmental arc: early layers (L1-3) use single gateway neurons to route exceptions without consensus quorums; middle layers (L4-6) show diffuse processing with neither gateway nor consensus; and late layers (L7-11) crystallize full consensus/exception architectures with increasing quorum size (1 to 3 to 7 consensus neurons). Causal validation confirms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Neural dynamics and brain function · Advanced Memory and Neural Computing