Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
Amel Fatima, Tuan Ta, Bradford M. Beckmann

TL;DR
This paper systematically studies the performance impact of Reverse Address Translation in large-scale multi-GPU clusters, revealing bottlenecks and proposing optimization techniques to improve collective communication efficiency.
Contribution
It presents the first detailed analysis of Reverse Address Translation overheads and introduces optimization strategies like fused kernels and TLB prefetching for scalable GPU workloads.
Findings
Cold TLB misses cause up to 1.4x latency increase in small collectives.
Warmed caches reduce translation overheads for larger collectives.
Proposed techniques can hide translation latency, boosting throughput and scalability.
Abstract
Distributed ML workloads rely heavily on collective communication across multi-GPU, multi-node systems. Emerging scale-up fabrics, such as NVLink and UALink, enable direct memory access across nodes but introduce a critical destination-side translation step: translating Network Physical Addresses (NPAs) to System Physical Addresses (SPAs), which we term Reverse Translation (Reverse Address Translation). Despite its importance, the performance impact of Reverse Address Translation remains poorly understood. In this work, we present the first systematic study of Reverse Address Translation in large-scale GPU clusters. Using an extended ASTRA-sim framework with Omnet++ as the network backend, we model Link MMUs and Link TLBs and evaluate their effect on All-to-All collective communication across varying input sizes and GPU counts. Our analysis shows that cold TLB misses dominate latency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
