EC2MoE: Adaptive End-Cloud Pipeline Collaboration Enabling Scalable Mixture-of-Experts Inference
Zheming Yang, Yunqing Hu, Sheng Sun, and Wen Ji

TL;DR
EC2MoE introduces an adaptive end-cloud pipeline collaboration framework that significantly improves the scalability and efficiency of Mixture-of-Experts inference across heterogeneous environments.
Contribution
The paper presents a hardware-aware expert selection mechanism and a pipeline optimization strategy for scalable MoE inference in end-cloud settings.
Findings
Increases throughput by 2.2x to 5.1x.
Reduces end-to-end latency by 53% to 67%.
Maintains high accuracy and scalability under dynamic conditions.
Abstract
The Mixture-of-Experts (MoE) paradigm has emerged as a promising solution to scale up model capacity while maintaining inference efficiency. However, deploying MoE models across heterogeneous end-cloud environments poses new challenges in expert scheduling, communication overhead, and resource heterogeneity. In this paper, we propose EC2MoE, an adaptive framework for scalable MoE inference via end-cloud pipeline collaboration. First, we design a hardware-aware lightweight group gate network that enhances expert selection and computational efficiency. By incorporating a hardware-aware local expert selection mechanism, the system adaptively filters candidate experts based on real-time device profiles. A lightweight group gate module then integrates local and global gating outputs to achieve high-quality expert routing with minimal overhead. Second, we develop a pipeline optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Data Stream Mining Techniques · Big Data and Business Intelligence
