BrownoutServe: SLO-Aware Inference Serving under Bursty Workloads for MoE-based LLMs
Jianmin Hu, Minxian Xu, Kejiang Ye, Chengzhong Xu

TL;DR
BrownoutServe is a dynamic inference serving framework for MoE-based large language models that improves throughput and reduces latency during bursty workloads by adaptive expert management and token processing.
Contribution
It introduces a novel brownout mechanism and united experts to enhance resource utilization and maintain SLOs in MoE LLM inference serving.
Findings
Achieves up to 2.07x throughput improvement.
Reduces SLO violations by 90.28%.
Maintains acceptable inference accuracy.
Abstract
In recent years, the Mixture-of-Experts (MoE) architecture has been widely applied to large language models (LLMs), providing a promising solution that activates only a subset of the model's parameters during computation, thereby reducing overall memory requirements and allowing for faster inference compared to dense models. Despite these advantages, existing systems still face issues of low efficiency due to static model placement and lack of dynamic workloads adaptation. This leads to suboptimal resource utilization and increased latency, especially during bursty requests periods. To address these challenges, this paper introduces BrownoutServe, a novel serving framework designed to optimize inference efficiency and maintain service reliability for MoE-based LLMs under dynamic computational demands and traffic conditions. BrownoutServe introduces "united experts" that integrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
