Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
Tianlun Hu, Tiancheng Hu, Shengsheng Litang, Sheng Wang, Xiaoming Bao, Yuxing Li, Wei Wang, Zhongzhe Hu, Lijun Li, Hongwei Sun, Jingbin Zhou

TL;DR
This paper introduces a relay-buffer-free communication method for MoE inference on Ascend systems, reducing latency and improving efficiency by reorganizing data placement and eliminating intermediate buffers.
Contribution
It proposes a novel communication design that removes relay buffers in MoE inference, enabling faster and more efficient execution on globally pooled high-bandwidth memory systems.
Findings
Reduced dispatch and combine latency in MoE workloads.
Improved time to first token (TTFT) and maintained competitive time per output token (TPOT).
Expanded feasible scheduling space under latency constraints.
Abstract
Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, temporary relay, and output restoration can add substantial overhead. Existing MoE communication paths are often buffer-centric, using explicit inter-process relay and reordering buffers around collective transfer. This report presents a relay-buffer-free communication design for MoE inference acceleration on Ascend systems. The design reorganizes dispatch and combine around direct placement into destination expert windows and direct reading from remote expert windows. Built on globally pooled high-bandwidth memory and symmetric-memory allocation, it removes most intermediate relay and reordering buffers while retaining only lightweight control state, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
