Enabling Large-Reach TLBs for High-Throughput Processors by Exploiting Memory Subregion Contiguity
Chao Yu, Yuebin Bai, Rui Wang

TL;DR
This paper introduces MESC, a method that exploits memory subregion contiguity to significantly improve TLB efficiency and GPU performance without altering memory allocation policies.
Contribution
MESC leverages OS-identified memory contiguity to coalesce multiple page translations into single TLB entries, enhancing translation efficiency for GPUs.
Findings
Achieves 77.2% performance improvement on translation-sensitive workloads.
Reduces dynamic translation energy by 76.4%.
Effectively coalesces up to 512 pages into one TLB entry.
Abstract
Accelerators, like GPUs, have become a trend to deliver future performance desire, and sharing the same virtual memory space between CPUs and GPUs is increasingly adopted to simplify programming. However, address translation, which is the key factor of virtual memory, is becoming the bottleneck of performance for GPUs. In GPUs, a single TLB miss can stall hundreds of threads due to the SIMT execute model, degrading performance dramatically. Through real system analysis, we observe that the OS shows an advanced contiguity (e.g., hundreds of contiguous pages), and more large memory regions with advanced contiguity tend to be allocated with the increase of working sets. Leveraging the observation, we propose MESC to improve the translation efficiency for GPUs. The key idea of MESC is to divide each large page frame (2MB size) in virtual memory space into memory subregions with fixed size…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Interconnection Networks and Systems
