Multi-Layer Scheduling for MoE-Based LLM Reasoning
Yifan Sun, Gholamreza Haffari, Minxian Xu, Rajkumar Buyya, Adel N. Toosi

TL;DR
This paper introduces a multi-layer scheduling framework for efficient MoE-based LLM inference, optimizing request handling, resource utilization, and expert routing to significantly reduce latency.
Contribution
It proposes a novel multi-layer scheduling approach for MoE LLMs, addressing request, engine, and expert-level challenges to improve performance over existing frameworks.
Findings
Up to 17.8% reduction in TTFT latency.
Up to 13.3% reduction in TPOT latency.
Consistent performance improvements across diverse workloads.
Abstract
Large Language Models (LLMs) have achieved remarkable success across a wide range of tasks, but serving them efficiently at scale remains a critical challenge due to their substantial computational and latency demands. While most existing inference frameworks rely on simple scheduling strategies such as First-Come-First-Serve (FCFS) at the engine level and Round-Robin (RR) at the scheduler or coordinator level, they often fail to fully utilize system resources and may suffer from issues such as head-of-line blocking and load imbalance. Recent advances in Mixture-of-Experts (MoE) models have also introduced new challenges in scheduling arising from expert parallelism and routing complexity. This research proposes a multi-layer scheduling framework tailored for MoE-based LLM serving. It targets scheduling at three levels: request-level, enginelevel, and expert-level. At the request level,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · IoT and Edge/Fog Computing · Software System Performance and Reliability
