Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects
Mansi Choudhary, Karthik Sangaiah, Sonali Singh, Muhammad Osama, Lisa Wu Wills, Ganesh Dasika

TL;DR
This paper introduces a NUMA-aware scheduling strategy for large-scale attention workloads on disaggregated GPUs, significantly improving performance by aligning attention heads with GPU NUMA domains and exploiting cache reuse.
Contribution
It presents Swizzled Head-first Mapping, a novel spatially-aware scheduling method that optimizes attention workloads on multi-chiplet GPU architectures, addressing NUMA-induced locality issues.
Findings
Achieves up to 50% higher performance on AMD MI300X.
Maintains high L2 cache hit rates of 80-97%.
Demonstrates importance of NUMA-aware scheduling for scalable AI workloads.
Abstract
The rise of disaggregated AI GPUs has exposed a critical bottleneck in large-scale attention workloads: non-uniform memory access (NUMA). As multi-chiplet designs become the norm for scaling compute capabilities, memory latency and bandwidth vary sharply across compute regions, undermining the performance of traditional GPU kernel scheduling strategies that assume uniform memory access. We identify how these NUMA effects distort locality in multi-head attention (MHA) and present Swizzled Head-first Mapping, a spatially-aware scheduling strategy that aligns attention heads with GPU NUMA domains to exploit intra-chiplet cache reuse. On AMD's MI300X architecture, our method achieves up to 50% higher performance over state-of-the-art attention algorithms using conventional scheduling techniques and sustains consistently high L2 cache hit rates of 80-97%. These results demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy
