A Limits Study of Memory-side Tiering Telemetry
Vinicius Petrucci, Felippe Zacarias, David Roberts

TL;DR
This paper investigates the limitations of current memory tiering strategies and demonstrates how programmable, device-level telemetry can significantly improve memory management and performance in future systems.
Contribution
It introduces a CXL-based memory request logger and a Hotness Monitoring Unit (HMU) that enhance memory tiering through precise access monitoring and data movement strategies.
Findings
Potential 1.94x speedup over Linux NUMA tiering
Offloads over 90% of pages to CXL memory
Only 3% slowdown compared to Host-DRAM allocation
Abstract
Increasing workload demands and emerging technologies necessitate the use of various memory and storage tiers in computing systems. This paper presents results from a CXL-based Experimental Memory Request Logger that reveals precise memory access patterns at runtime without interfering with the running workloads. We use it for software emulation of future memory telemetry hardware. By combining reactive placement based on data address monitoring, proactive data movement, and compiler hints, a Hotness Monitoring Unit (HMU) within memory modules can greatly improve memory tiering solutions. Analysis of page placement using profiled access counts on a Deep Learning Recommendation Model (DLRM) indicates a potential 1.94x speedup over Linux NUMA balancing tiering, and only a 3% slowdown compared to Host-DRAM allocation while offloading over 90% of pages to CXL memory. The study underscores…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
