ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits
Aryan Karmore

TL;DR
ButterflyMoE introduces a geometric approach to expert models that significantly reduces memory usage by reorienting shared parameters, enabling efficient deployment on edge devices with minimal accuracy loss.
Contribution
The paper proposes ButterflyMoE, a novel method that uses learned rotations of a shared quantized substrate to achieve sub-linear memory scaling for expert models.
Findings
Achieves 150× memory reduction with 256 experts on language benchmarks.
Maintains negligible accuracy loss despite significant memory savings.
Enables expert models to run on edge devices with constrained memory.
Abstract
Linear memory scaling stores independent expert weight matrices requiring memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields memory,sub-linear in the number of experts. The key insight: training these rotations with quantization reduces activation outliers and stabilizes extreme low bit training, where static methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Neural Networks and Reservoir Computing
