Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on   Multi-Socket Many-Core Systems

Wenyi Wang; Maxime Gonthier; Poornima Nookala; Haochen Pan; Ian; Foster; Ioan Raicu; Kyle Chard

arXiv:2502.05293·cs.DC·March 20, 2025

Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems

Wenyi Wang, Maxime Gonthier, Poornima Nookala, Haochen Pan, Ian, Foster, Ioan Raicu, Kyle Chard

PDF

Open Access

TL;DR

This paper presents novel lock-free data structures and synchronization strategies that significantly improve fine-grained parallel task performance on multi-socket many-core systems, reducing overhead and enhancing scalability.

Contribution

It introduces XQueue, a lock-less task queue, a scalable distributed barrier, and NUMA-aware load balancing strategies, advancing parallel runtime efficiency.

Findings

01

Performance improved by up to 1522.8× with new synchronization methods.

02

Lock-less load balancing enhances performance by up to 4×.

03

Significant reduction in synchronization overhead on many-core systems.

Abstract

Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks due to time spent on runtime synchronization. In this work, we introduce and analyze three key advances that collectively achieve significant performance gains. First, we introduce XQueue, a lock-less concurrent queue implementation to replace GNU's priority task queue and remove the global task lock. Second, we develop a scalable, efficient, and hybrid lock-free/lock-less distributed tree barrier to address the high hardware synchronization overhead from GNU's centralized barrier. Third, we develop two lock-less and NUMA-aware load balancing strategies. We evaluate our implementation using Barcelona OpenMP Task Suite (BOTS) benchmarks. We show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Graph Theory and Algorithms