Janus: Disaggregating Attention and Experts for Scalable MoE Inference

Zhexiang Zhang; Ye Wang; Yumiao Zhao; Jiayu Xiao; Qianjing Yang; Xiangyu Wang; Jingzhe Jiang; Qizhen Weng; Ruichuan Chen; Shaohuai Shi; Adel N. Toosi; Yin Chen; Minchen Yu

arXiv:2512.13525·cs.DC·April 29, 2026

Janus: Disaggregating Attention and Experts for Scalable MoE Inference

Zhexiang Zhang, Ye Wang, Yumiao Zhao, Jiayu Xiao, Qianjing Yang, Xiangyu Wang, Jingzhe Jiang, Qizhen Weng, Ruichuan Chen, Shaohuai Shi, Adel N. Toosi, Yin Chen, Minchen Yu

PDF

TL;DR

JANUS is a scalable, resource-efficient MoE inference system that disaggregates attention and expert layers, enabling independent resource management and improved performance.

Contribution

It introduces a novel disaggregation approach, an adaptive communication mechanism, and a microsecond-scale activation scheduler for efficient MoE inference.

Findings

01

Up to 4.7x throughput improvement over baselines

02

Reduces inference latency by balancing activated experts

03

Minimizes GPU cost while meeting latency SLOs

Abstract

Serving large Mixture-of-Experts (MoE) models is challenging because of their large memory footprints, heterogeneous resource demands, and highly dynamic inference workloads. Most existing MoE inference systems deploy the entire model as a monolithic unit, forcing attention and MoE layers to share the same resource configuration despite their different scaling behaviors and resource bottlenecks. Such coarse-grained provisioning leads to resource inefficiency and suboptimal performance. We present JANUS, a scalable and resource-efficient MoE inference system built around three key principles. First, JANUS disaggregates attention and MoE layers onto separate GPU worker pools, enabling independent resource provisioning for the two layer types, and uses an adaptive two-phase communication mechanism for low-latency data exchange. Second, because MoE-layer execution is often memory-bound and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.