Multi-Layer Scheduling for MoE-Based LLM Reasoning

Yifan Sun; Gholamreza Haffari; Minxian Xu; Rajkumar Buyya; Adel N. Toosi

arXiv:2602.21626·cs.DC·March 4, 2026

Multi-Layer Scheduling for MoE-Based LLM Reasoning

Yifan Sun, Gholamreza Haffari, Minxian Xu, Rajkumar Buyya, Adel N. Toosi

PDF

Open Access

TL;DR

This paper introduces a multi-layer scheduling framework for efficient MoE-based LLM inference, optimizing request handling, resource utilization, and expert routing to significantly reduce latency.

Contribution

It proposes a novel multi-layer scheduling approach for MoE LLMs, addressing request, engine, and expert-level challenges to improve performance over existing frameworks.

Findings

01

Up to 17.8% reduction in TTFT latency.

02

Up to 13.3% reduction in TPOT latency.

03

Consistent performance improvements across diverse workloads.

Abstract

Large Language Models (LLMs) have achieved remarkable success across a wide range of tasks, but serving them efficiently at scale remains a critical challenge due to their substantial computational and latency demands. While most existing inference frameworks rely on simple scheduling strategies such as First-Come-First-Serve (FCFS) at the engine level and Round-Robin (RR) at the scheduler or coordinator level, they often fail to fully utilize system resources and may suffer from issues such as head-of-line blocking and load imbalance. Recent advances in Mixture-of-Experts (MoE) models have also introduced new challenges in scheduling arising from expert parallelism and routing complexity. This research proposes a multi-layer scheduling framework tailored for MoE-based LLM serving. It targets scheduling at three levels: request-level, enginelevel, and expert-level. At the request level,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · IoT and Edge/Fog Computing · Software System Performance and Reliability