Locality Optimization for Data Parallel Programs
Eric Hielscher, Alex Rubinsteyn, Dennis Shasha

TL;DR
This paper presents a set of locality optimization techniques for data parallel programs in Parakeet, a Python JIT compiler, using tiled operators and autotuning to improve cache and register usage, resulting in significant speedups.
Contribution
Introduction of tiled data parallel operators and a novel tiling transformation with runtime autotuning for improved locality in Parakeet.
Findings
Significant speedups on data locality benchmarks.
Effective automatic tiling for cache and registers.
Enhanced performance of data parallel Python programs.
Abstract
Productivity languages such as NumPy and Matlab make it much easier to implement data-intensive numerical algorithms. However, these languages can be intolerably slow for programs that don't map well to their built-in primitives. In this paper, we discuss locality optimizations for our system Parakeet, a just-in-time compiler and runtime system for an array-oriented subset of Python. Parakeet dynamically compiles whole user functions to high performance multi-threaded native code. Parakeet makes extensive use of the classic data parallel operators Map, Reduce, and Scan. We introduce a new set of data parallel operators,TiledMap, TiledReduce, and TiledScan, that break up their computations into local pieces of bounded size so as better to make use of small fast memories. We introduce a novel tiling transformation to generate tiled operators automatically. Applying this transformation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Algorithms and Data Compression
