OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale
Jingze Shi, Zhangyang Peng, Yizhang Zhu, Yifan Wu, Guang Liu, Yuyu Luo

TL;DR
OmniMoE introduces a novel system-algorithm co-designed framework that enables extremely fine-grained expert routing in MoE architectures, significantly improving efficiency and accuracy at scale.
Contribution
It proposes vector-level Atomic Experts with a Cartesian Product Router and Expert-Centric Scheduling, enabling scalable, efficient MoE with maximal expert granularity.
Findings
Achieves 50.9% zero-shot accuracy on seven benchmarks.
Reduces inference latency from 73ms to 6.7ms.
Outperforms existing coarse- and fine-grained MoE baselines.
Abstract
Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Mobile Crowdsensing and Crowdsourcing · IoT and Edge/Fog Computing
