EPD-Serve: A Flexible Multimodal EPD Disaggregation Inference Serving System On Ascend
Fan Bai, Pai Peng, Zhengzhi Tang, Zhe Wang, Gong Chen, Xiang Lu, Yinuo Li, Huan Lin, Weizhe Lin, Yaoyuan Wang, Xiaosong Li

TL;DR
EPD-Serve is a novel disaggregated inference system for multimodal models that improves resource utilization and throughput by decoupling pipeline stages and optimizing cross-node communication, especially under high concurrency.
Contribution
The paper introduces EPD-Serve, a stage-level disaggregated inference framework that enhances efficiency and scalability for multimodal large models on heterogeneous hardware.
Findings
Increases end-to-end throughput by up to 69.48% under high concurrency.
Achieves strict SLOs with TTFT below 2000 ms and TPOT below 50 ms.
Effectively utilizes heterogeneous hardware through dynamic orchestration and hierarchical communication mechanisms.
Abstract
With the widespread adoption of large multimodal models, efficient inference across text, image, audio, and video modalities has become critical. However, existing multimodal inference systems typically employ monolithic architectures that tightly couple the Encode, Prefill, and Decode stages on homogeneous hardware, neglecting the heterogeneous computational characteristics of each stage. This design leads to inefficient resource utilization and limited system throughput. To address these issues, we propose EPD-Serve, a stage-level disaggregated inference serving system for multimodal models. EPD-Serve decouples the inference pipeline into independent Encode, Prefill, and Decode stages, enabling logical isolation and flexible co-located deployment through dynamic orchestration. Leveraging the Ascend interconnect topology, EPD-Serve introduces asynchronous feature prefetching between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Advanced Neural Network Applications
