SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference
Qian Chen, Xianhao Chen, Kaibin Huang

TL;DR
This paper proposes a novel caching strategy for distributed inference in large language models using Mixture-of-Experts, optimizing expert placement on edge servers to minimize latency under storage constraints.
Contribution
It introduces a new optimization framework for expert caching in MoE models, including algorithms with provable guarantees for both simple and complex cases, improving inference efficiency.
Findings
Significant latency reduction in distributed MoE inference
Effective caching algorithms with theoretical approximation guarantees
Enhanced scalability of large language models at edge devices
Abstract
Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage/memory burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed across an edge network for distributed inference. Based on the popular Top- expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When , the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a -approximation guarantee. For the general case where , expert co-activation within the same MoE layer introduces non-submodularity, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Mobile Crowdsensing and Crowdsourcing · Stochastic Gradient Optimization Techniques
