MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts
Ivan Novikov

TL;DR
This paper presents MLPMoE, a training-free method to convert dense MLPs in transformer models into static mixture-of-experts structures, reducing parameters and computational costs with minimal impact on performance.
Contribution
MLPMoE introduces a novel, deterministic tensor slicing approach to transform dense MLPs into static MoE structures without training or calibration data.
Findings
Parameter reduction of about 20% with minimal perplexity increase
Transformation preserves model performance within 0.05% perplexity
Operates post hoc without gradient updates or router training
Abstract
Large Language Models (LLMs) are predominantly deployed as dense transformers, where every parameter in every feed-forward block is activated for every token. While architecturally simple, this is computationally inefficient, since inference costs scale linearly with parameter count. Recent upcycling methods such as MoEfication, CMoE, ToMoE, and MoORE reveal that much of the useful computation lives in sparse, semi-modular substructures inside dense feed-forward networks, but these approaches typically rely on clustering, activation profiling, singular value decomposition, or custom routing that requires calibration data. This paper introduces MLPMoE (MLP Mixture-of-Experts), a training-free, deterministic transformation that restructures the dense MLP in transformer blocks into a static, high-cardinality mixture of experts. The transformation uses simple tensor slicing and summation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science · Advanced Graph Neural Networks
