ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits

Aryan Karmore

arXiv:2601.13563·cs.LG·March 6, 2026

ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits

Aryan Karmore

PDF

Open Access

TL;DR

ButterflyMoE introduces a geometric approach to expert models that significantly reduces memory usage by reorienting shared parameters, enabling efficient deployment on edge devices with minimal accuracy loss.

Contribution

The paper proposes ButterflyMoE, a novel method that uses learned rotations of a shared quantized substrate to achieve sub-linear memory scaling for expert models.

Findings

01

Achieves 150× memory reduction with 256 experts on language benchmarks.

02

Maintains negligible accuracy loss despite significant memory savings.

03

Enables expert models to run on edge devices with constrained memory.

Abstract

Linear memory scaling stores $N$ independent expert weight matrices requiring $O (N \cdot d^{2})$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $O (d^{2} + N \cdot d lo g d)$ memory,sub-linear in the number of experts. The key insight: training these rotations with quantization reduces activation outliers and stabilizes extreme low bit training, where static methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Neural Networks and Reservoir Computing