Secret mixtures of experts inside your LLM

Enric Boix-Adsera

arXiv:2512.18452·cs.LG·December 23, 2025

Secret mixtures of experts inside your LLM

Enric Boix-Adsera

PDF

Open Access

TL;DR

This paper reveals that dense MLP layers in large language models can be approximated by sparse Mixture of Experts layers, providing insights into their effectiveness and guiding more efficient architecture design.

Contribution

It introduces a novel theoretical connection between MoE models and Sparse Autoencoders, and empirically validates that MLPs in LLMs behave like sparse MoE layers based on activation distribution.

Findings

01

MLP layers can be approximated by sparse MoE layers

02

Activation distribution influences the approximation quality

03

Insights suggest new directions for efficient MoE architectures

Abstract

Despite being one of the earliest neural network layers, the Multilayer Perceptron (MLP) is arguably one of the least understood parts of the transformer architecture due to its dense computation and lack of easy visualization. This paper seeks to understand the MLP layers in dense LLM models by hypothesizing that these layers secretly approximately perform a sparse computation -- namely, that they can be well approximated by sparsely-activating Mixture of Experts (MoE) layers. Our hypothesis is based on a novel theoretical connection between MoE models and Sparse Autoencoder (SAE) structure in activation space. We empirically validate the hypothesis on pretrained LLMs, and demonstrate that the activation distribution matters -- these results do not hold for Gaussian data, but rather rely crucially on structure in the distribution of neural network activations. Our results shine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Random lasers and scattering media