Partitioning Unstructured Sparse Tensor Algebra for Load-Balanced Parallel Execution
Atharva Chougule, Alexander J Root, Rubens Lacouture, Bobby Yan, Rohan Yadav, Fredrik Kjolstad

TL;DR
This paper introduces a novel partitioning algorithm for sparse tensor algebra that ensures load-balanced parallel execution across CPUs and GPUs, improving efficiency over existing methods.
Contribution
It presents the first provably load-balanced partitioning algorithm for any sparse tensor algebra expression, generalizing parallel merging to multi-dimensional, hierarchical data structures.
Findings
Generated code is competitive with vendor libraries like Intel MKL and NVIDIA cuSPARSE.
Outperforms general-purpose strategies by 2.0--6.4 times in unoptimized scenarios.
Achieves load balancing for diverse sparse tensor algebra expressions.
Abstract
Sparse tensor algebra is challenging to efficiently parallelize due to the irregular, data-dependent, and potentially skewed structure of sparse computation. We propose the first partitioning algorithm that provably load balances the computation of any sparse tensor algebra expression across parallel execution units. Our algorithm generalizes parallel merging algorithms to any number of operands, and to multi-dimensional, hierarchical sparse data structures. We implement our algorithm within an existing sparse tensor algebra compilation framework to automatically generate parallel sparse tensor algebra kernels that target multi-core CPUs and GPUs. We show that our generated code is competitive with hand-implemented parallelization strategies used by vendor libraries like Intel MKL and NVIDIA cuSPARSE (geo-means of --) and \textsc{Taco} (geo-means of --),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
