Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems
Wenyi Wang, Maxime Gonthier, Poornima Nookala, Haochen Pan, Ian, Foster, Ioan Raicu, Kyle Chard

TL;DR
This paper presents novel lock-free data structures and synchronization strategies that significantly improve fine-grained parallel task performance on multi-socket many-core systems, reducing overhead and enhancing scalability.
Contribution
It introduces XQueue, a lock-less task queue, a scalable distributed barrier, and NUMA-aware load balancing strategies, advancing parallel runtime efficiency.
Findings
Performance improved by up to 1522.8× with new synchronization methods.
Lock-less load balancing enhances performance by up to 4×.
Significant reduction in synchronization overhead on many-core systems.
Abstract
Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks due to time spent on runtime synchronization. In this work, we introduce and analyze three key advances that collectively achieve significant performance gains. First, we introduce XQueue, a lock-less concurrent queue implementation to replace GNU's priority task queue and remove the global task lock. Second, we develop a scalable, efficient, and hybrid lock-free/lock-less distributed tree barrier to address the high hardware synchronization overhead from GNU's centralized barrier. Third, we develop two lock-less and NUMA-aware load balancing strategies. We evaluate our implementation using Barcelona OpenMP Task Suite (BOTS) benchmarks. We show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Graph Theory and Algorithms
