Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
Jingcong Liang, Siyuan Wang, Miren Tian, Yitong Li, Duyu Tang, Zhongyu Wei

TL;DR
This paper investigates the local routing consistency of Mixture-of-Experts models, proposing metrics to measure it, analyzing various models, and revealing factors affecting routing efficiency and load balancing for memory-efficient deployment.
Contribution
The paper introduces two metrics for measuring local routing consistency in MoE models and provides a comprehensive analysis of factors influencing routing behavior across diverse models.
Findings
Strong trade-off between local routing consistency and local load balance.
Global load balance can coexist with local routing consistency.
Domain-specialized experts enhance routing consistency.
Abstract
Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) **Segment Routing Best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) **Segment Cache Best Hit Rate (SCH)**, which measures the hit rate of an expert cache…
Peer Reviews
Decision·ICLR 2026 Poster
The authors tackle a specific and important problem in LLM deployment. The work proposes two metrics (SRP and SCH) to analyze the local routing consistency. The empirical analysis is quite thorough with 20 different MoE models and several datasets from both general training corpora and downstream tasks. The paper considers several reasons that could be responsible for the consistency - model architecture, expert specialization to specific domains or vocabulary subsets and load balancing. The fin
1. There is no explicit connection of proposed metrics to throughput. For instance, how would SRP/SCH affect throughput given cache size, communication time and LLM forward propagation time? There are no measurements of throughput in any of the experiments either. 2. What advantage does SCH have compared to common cache algorithm hit rate? Can we not determine required cache size by analyzing LRU hit rate vs $\rho$? 3. In analysis in Section 3.3, the authors suggest that applying MoE on every
This paper offers a novel analytical perspective on the important and practical problem of efficient MoE model deployment. The concept of "local routing consistency" is insightful. It provides a clear and quantifiable framework for evaluating and comparing the deployment potential of different MoE models in resource-constrained environments. The experimental evaluation is thorough and comprehensive, standing out as a primary strength of this work. The authors analyze up to 20 representative Mo
Although the analysis is insightful, its conclusions are primarily based on correlation rather than causation. For example, the study observes that architectures with "MoE on every layer" and "no shared experts" correlate with high routing consistency and conjectures that dense modules might "interfere with or weaken routing signals". Ablation studies, such as modifying these architectural features on the same backbone (even a small one is ok) and observing the change in consistency, would great
- The idea is intuitive and based on an interesting observation that the activation consistency could change due to MoE architectures. Early studies in this field were unable to cover the later architectural changes in MoEs, so the proposed study addresses an area that has not been well studied. - The selected models and datasets are comprehensive, covering both classic and recent variants of MoE LLMs (within hardware constraints). Based on these empirical results, the authors have provided som
- The abstract is poorly written. It does not clearly mention the different MoE architectures (shared experts, small/large experts) as the potential reasons for the local consistency discrepancy, which I believe is the most important contribution of this paper. - While the authors have shown that the different MoE models could have vastly different expert activation consistency, the authors did not perform an end-to-end empirical evaluation on the expert offloading performance with the propose
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Big Data and Digital Economy · IoT and Edge/Fog Computing
MethodsMixture of Experts
