Mao: Machine learning approach for NUMA optimization in Warehouse Scale Computers
Yueji Liu, Jun Jin, Wenhui Shu, Shiyong Li, Yongzhan He

TL;DR
This paper presents MAO, a machine learning-based system deployed at Baidu to optimize NUMA memory access, significantly improving workload performance and resource efficiency across large-scale servers.
Contribution
The paper introduces MAO, a novel production system combining online and offline modules with a new NUMA Sensitivity model for effective NUMA optimization in large-scale data centers.
Findings
Achieved 12.1% latency reduction in Baidu's Feed system.
Realized 9.8% CPU resource savings.
Successfully deployed MAO on over 100,000 servers.
Abstract
Non-Uniform Memory Access (NUMA) architecture imposes numerous performance challenges to today's cloud workloads. Due to the complexity and the massive scale of modern warehouse-scale computers (WSCs), a lot of efforts need to be done to improve the memory access locality on the NUMA architecture. In Baidu, we have found that NUMA optimization has significant performance benefit to the major workloads like Search and Feed (Baidu's recommendation system). But how to conduct NUMA optimization within the large scale cluster brings a lot of subtle complexities and workload-specific scenario optimizations. In this paper, we will present a production environment deployed solution in Baidu called MAP (Memory Access Optimizer) that helps improve the memory access locality for Baidu's various workloads. MAO includes an online module and an offline module. The online module is responsible for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScheduling and Optimization Algorithms
