Janus: Disaggregating Attention and Experts for Scalable MoE Inference
Zhexiang Zhang, Ye Wang, Yumiao Zhao, Jiayu Xiao, Qianjing Yang, Xiangyu Wang, Jingzhe Jiang, Qizhen Weng, Ruichuan Chen, Shaohuai Shi, Adel N. Toosi, Yin Chen, Minchen Yu

TL;DR
JANUS is a scalable, resource-efficient MoE inference system that disaggregates attention and expert layers, enabling independent resource management and improved performance.
Contribution
It introduces a novel disaggregation approach, an adaptive communication mechanism, and a microsecond-scale activation scheduler for efficient MoE inference.
Findings
Up to 4.7x throughput improvement over baselines
Reduces inference latency by balancing activated experts
Minimizes GPU cost while meeting latency SLOs
Abstract
Serving large Mixture-of-Experts (MoE) models is challenging because of their large memory footprints, heterogeneous resource demands, and highly dynamic inference workloads. Most existing MoE inference systems deploy the entire model as a monolithic unit, forcing attention and MoE layers to share the same resource configuration despite their different scaling behaviors and resource bottlenecks. Such coarse-grained provisioning leads to resource inefficiency and suboptimal performance. We present JANUS, a scalable and resource-efficient MoE inference system built around three key principles. First, JANUS disaggregates attention and MoE layers onto separate GPU worker pools, enabling independent resource provisioning for the two layer types, and uses an adaptive two-phase communication mechanism for low-latency data exchange. Second, because MoE-layer execution is often memory-bound and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
