Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend

Tianlun Hu; Tiancheng Hu; Shengsheng Litang; Sheng Wang; Xiaoming Bao; Yuxing Li; Wei Wang; Zhongzhe Hu; Lijun Li; Hongwei Sun; Jingbin Zhou

arXiv:2605.06055·cs.DC·May 11, 2026

Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend

Tianlun Hu, Tiancheng Hu, Shengsheng Litang, Sheng Wang, Xiaoming Bao, Yuxing Li, Wei Wang, Zhongzhe Hu, Lijun Li, Hongwei Sun, Jingbin Zhou

PDF

TL;DR

This paper introduces a relay-buffer-free communication method for MoE inference on Ascend systems, reducing latency and improving efficiency by reorganizing data placement and eliminating intermediate buffers.

Contribution

It proposes a novel communication design that removes relay buffers in MoE inference, enabling faster and more efficient execution on globally pooled high-bandwidth memory systems.

Findings

01

Reduced dispatch and combine latency in MoE workloads.

02

Improved time to first token (TTFT) and maintained competitive time per output token (TPOT).

03

Expanded feasible scheduling space under latency constraints.

Abstract

Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, temporary relay, and output restoration can add substantial overhead. Existing MoE communication paths are often buffer-centric, using explicit inter-process relay and reordering buffers around collective transfer. This report presents a relay-buffer-free communication design for MoE inference acceleration on Ascend systems. The design reorganizes dispatch and combine around direct placement into destination expert windows and direct reading from remote expert windows. Built on globally pooled high-bandwidth memory and symmetric-memory allocation, it removes most intermediate relay and reordering buffers while retaining only lightweight control state, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.