CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge
Muqing Li, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang

TL;DR
CoMoE is a dynamic, resource-aware framework that optimizes expert aggregation and offloading for large language models at mobile edge devices, significantly reducing memory and latency while maintaining performance.
Contribution
It introduces a novel adaptive optimization framework for MoE deployment in mobile edge environments, addressing expert aggregation and offloading challenges dynamically.
Findings
70% memory reduction compared to baselines
10.5% lower inference latency than existing methods
Enables deployment of large-scale MoE models on resource-constrained devices
Abstract
The proliferation of large language models (LLMs) has driven the adoption of Mixture-of-Experts (MoE) architectures as a promising solution to scale model capacity while controlling computational costs. However, deploying MoE models in resource-constrained mobile edge computing environments presents significant challenges due to their large memory footprint and dynamic expert activation patterns. To address these challenges, we propose a novel dynamic resource-aware collaborative optimization framework that jointly optimizes expert aggregation granularity and offloading strategies based on real-time device resource states, network conditions, and input characteristics in mobile edge environments, denoted as CoMoE. In CoMoE, we first systematically analyze existing expert aggregation techniques, including expert parameter merging,knowledge distillation,and parameter sharing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
