Deinsum: Practically I/O Optimal Multilinear Algebra

Alexandros Nikolaos Ziogas; Grzegorz Kwasniewski; Tal Ben-Nun; Timo; Schneider; Torsten Hoefler

arXiv:2206.08301·cs.DC·June 17, 2022

Deinsum: Practically I/O Optimal Multilinear Algebra

Alexandros Nikolaos Ziogas, Grzegorz Kwasniewski, Tal Ben-Nun, Timo, Schneider, Torsten Hoefler

PDF

Open Access

TL;DR

Deinsum is an automated framework that derives data movement-optimal distributed schedules for multilinear algebra, significantly improving performance and scalability on supercomputers by optimizing tensor computations expressed in Einstein notation.

Contribution

It introduces a mathematically rigorous method to automatically generate optimal distributed schedules for multilinear algebra, surpassing heuristic-based approaches.

Findings

01

Achieved up to 19x speedup on 512 nodes of Piz Daint supercomputer.

02

Demonstrated improved performance on tensor kernel classes like Matricized Tensor Times Khatri-Rao Products.

03

Validated scalability and efficiency of the approach in high-performance computing environments.

Abstract

Multilinear algebra kernel performance on modern massively-parallel systems is determined mainly by data movement. However, deriving data movement-optimal distributed schedules for programs with many high-dimensional inputs is a notoriously hard problem. State-of-the-art libraries rely on heuristics and often fall back to suboptimal tensor folding and BLAS calls. We present Deinsum, an automated framework for distributed multilinear algebra computations expressed in Einstein notation, based on rigorous mathematical tools to address this problem. Our framework automatically derives data movement-optimal tiling and generates corresponding distributed schedules, further optimizing the performance of local computations by increasing their arithmetic intensity. To show the benefits of our approach, we test it on two important tensor kernel classes: Matricized Tensor Times Khatri-Rao Products…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Quantum Computing Algorithms and Architecture