SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

Qian Chen; Xianhao Chen; Kaibin Huang

arXiv:2507.06567·cs.LG·March 3, 2026

SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

Qian Chen, Xianhao Chen, Kaibin Huang

PDF

Open Access

TL;DR

This paper proposes a novel caching strategy for distributed inference in large language models using Mixture-of-Experts, optimizing expert placement on edge servers to minimize latency under storage constraints.

Contribution

It introduces a new optimization framework for expert caching in MoE models, including algorithms with provable guarantees for both simple and complex cases, improving inference efficiency.

Findings

01

Significant latency reduction in distributed MoE inference

02

Effective caching algorithms with theoretical approximation guarantees

03

Enhanced scalability of large language models at edge devices

Abstract

Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage/memory burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed across an edge network for distributed inference. Based on the popular Top- $K$ expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When $K = 1$ , the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a $(1 - 1/ e)$ -approximation guarantee. For the general case where $K \geq 1$ , expert co-activation within the same MoE layer introduces non-submodularity, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Mobile Crowdsensing and Crowdsourcing · Stochastic Gradient Optimization Techniques