A Proof of Concept for Optimizing Task Parallelism by Locality Queues
Markus Wittmann, Georg Hager

TL;DR
This paper introduces a technique using locality queues to optimize task parallelism in OpenMP, reducing memory access bottlenecks on ccNUMA systems while maintaining dynamic scheduling within local domains.
Contribution
It proposes a novel locality queue method that improves data access locality in task parallelism without sacrificing dynamic scheduling capabilities.
Findings
Significant performance improvements on ccNUMA systems.
Effective balancing of load and data locality.
Demonstrated with a six-point stencil solver.
Abstract
Task parallelism as employed by the OpenMP task construct, although ideal for tackling irregular problems or typical producer/consumer schemes, bears some potential for performance bottlenecks if locality of data access is important, which is typically the case for memory-bound code on ccNUMA systems. We present a programming technique which ameliorates adverse effects of dynamic task distribution by sorting tasks into locality queues, each of which is preferably processed by threads that belong to the same locality domain. Dynamic scheduling is fully preserved inside each domain, and is preferred over possible load imbalance even if non-local access is required. The effectiveness of the approach is demonstrated using a blocked six-point stencil solver as a toy model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Interconnection Networks and Systems
