MultiPath Memory Access: Breaking Host-GPU Bandwidth Bottlenecks in LLM Services
Lingfeng Tang, Daoping Zhang, Junjie Chen, Peihao Huang, Feng Jin, Chengguang Xu, Yuxin Chen, Feiqiang Sun, Guo Chen

TL;DR
MMA is a novel software system that enables efficient multipath host-GPU data transfer in multi-GPU servers, significantly improving bandwidth and reducing latency for large language model serving.
Contribution
It introduces the first software-defined multipath host-GPU data transfer system that leverages intra-server links without hardware or driver modifications.
Findings
Achieves 245 GB/s peak bandwidth, 4.62x faster than native CUDA copies.
Reduces KV cache fetch TTFT by up to 2.38x.
Lowers model wake-up latency by up to 2.48x.
Abstract
Host-GPU data movement has become a latency-critical bottleneck in LLM serving, surfacing in common paths such as model-weight movement and KV cache offload/fetch. Today, each host-GPU copy is effectively confined to the PCIe path of the target GPU, even though modern multi-GPU servers contain additional PCIe links on peer GPUs and high bandwidth GPU interconnects. This leaves substantial intra-server I/O capacity unused. To address this issue, we present Multipath Memory Access (MMA), a software-defined multipath memory access system for host--GPU data transfer. To the best of our knowledge, MMA is the first software-defined system to enable efficient multipath host--GPU data transfer within a single multi-GPU server. MMA expands a single host--GPU copy across available direct and relay paths without hardware, driver, or application changes. It preserves CUDA stream semantics with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
