Huawei Cloud Model-as-a-Service on the CloudMatrix384 SuperPod

Ao Xiao; Bangzheng He; Baoquan Zhang; Baoxing Huai; Bingji Wang; Bo Wang; Bo Xu; Boyi Hou; Chan Yang; Changhong Liu; Cheng Cui; Chenyu Zhu; Cong Feng; Daohui Wang; Dayun Lin; Duo Zhao; Fengshao Zou; Fu Wang; Gangqiang Zhang; Gengyuan Dan; Guanjie Chen; Guodong Guan; Guodong Yang; Haifeng Li; Haipei Zhu; Haley Li; Hao Feng; Hao Huang; Hao Xu; Hengrui Ma; Hengtao Fan; Hui Liu; Jia Li; Jiang Liu; Jiang Xu; Jie Meng; Jinhan Xin; Junhao Hu; Juwei Chen; Lan Yu; Lanxin Miao; Liang Liu; Linan Jing; Lu Zhou; Meina Han; Mingkun Deng; Mingyu Deng; Naitian Deng; Nizhong Lin; Peihan Zhao; Peng Pan; Pengfei Shen; Ping Li; Qi Zhang; Qian Wang; Qin ZhC Qingrong Xia; Qingyi Zhang; Qunchao Fu; Ren Guo; Ruimin Gao; Shaochun Li; Sheng Long; Shentian Li; Shining Wan; Shuai Shen; Shuangfu Zeng; Shuming Jing; Siqi Yang; Song Zhang; Tao Xu; Tianlin Du; Ting Chen; Wanxu Wu; Wei Jiang; Weinan Tong; Weiwei Chen; Wen Peng; Wenli Zhou; Wenquan Yang; Wenxin Liang; Xiang Liu; Xiaoli Zhou; Xin Jin; Xinyu Duan; Xu Li; Xu Zhang; Xusheng Chen; Yalong Shan; Yang Gan; Yao Lu; Yi Deng; Yi Zheng; Ying Xiong; Yingfei Zheng; Yiyun Zheng; Yizhou Shan; Yong Gao; Yong Zhang; Yongqiang Yang; Yuanjin Gong; Yue Yu; Yuetao Chen; Yukun Zhu; Yulong He; Yusu Zhao; Yuyan Wu; Zenan Zhang; Zhaojin Zhuo; Zhaoyang Ji; Zhefeng Wang; Zheng Wang; Zhenan Fan; Zhenhua Yang; Zhenli Sheng; Zhibin Yu; Zhigang Ji; Zhihao Ren; Zhipeng Bian; Zhixia Liu; Zhiyu Dong; Zhonghua Li; Zhou Yu; Zhuoming Shen; Zhuwei Peng; Zi Ye; Zihao Xiang; Zimin Fu; Zixuan Zhang

arXiv:2508.02520·cs.DC·March 3, 2026

Huawei Cloud Model-as-a-Service on the CloudMatrix384 SuperPod

Ao Xiao, Bangzheng He, Baoquan Zhang, Baoxing Huai, Bingji Wang, Bo Wang, Bo Xu, Boyi Hou, Chan Yang, Changhong Liu, Cheng Cui, Chenyu Zhu, Cong Feng, Daohui Wang, Dayun Lin, Duo Zhao, Fengshao Zou, Fu Wang, Gangqiang Zhang, Gengyuan Dan, Guanjie Chen, Guodong Guan, Guodong Yang

PDF

Open Access

TL;DR

This paper introduces xDeepServe, a novel production serving system for large-scale MoE LLMs on Huawei's CloudMatrix384 SuperPod, emphasizing disaggregation, low latency, and high throughput.

Contribution

It presents a disaggregated execution architecture and communication layer enabling efficient serving of MoE LLMs at scale on SuperPods.

Findings

01

Achieves 2400 tokens/sec per chip in peak decoding

02

Supports diverse models including DeepSeek, Kimi, GLM, Qwen, MiniMax

03

Reduces time-per-output-token to ~50ms

Abstract

Scaled-out MoE LLMs and scaled-up SuperPods create new systems challenges for production Model-as-a-Service (MaaS), requiring disaggregation, low-latency communication, and decentralized serving. This report presents xDeepServe, the production serving system behind Huawei Cloud's MaaS offering on CloudMatrix384, a 48-server SuperPod with 384 Ascend 910C chips connected by a high-bandwidth UB fabric and global shared memory. It serves models including DeepSeek, Kimi, GLM, Qwen, and MiniMax, among others. xDeepServe is built around Transformerless, a disaggregated execution architecture that decomposes transformer inference into modular units -- attention, feedforward, and MoE -- and supports disaggregated Prefill-Decode and MoE-Attention deployments. To enable disaggregation, we develop XCCL, a memory-semantic communication layer providing microsecond-level point-to-point and scalable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Cloud Computing and Resource Management