Engineering Boolean Matrix Multiplication for Multiple-Accelerator   Shared-Memory Architectures

Matti Karppa; Petteri Kaski

arXiv:1909.01554·cs.DS·September 5, 2019

Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures

Matti Karppa, Petteri Kaski

PDF

Open Access

TL;DR

This paper presents high-performance, energy-efficient algorithms for Boolean and binary matrix multiplication on multi-accelerator shared-memory systems, enabling large-scale computations close to hardware memory limits.

Contribution

It introduces novel optimized algorithms and implementation strategies for large-scale Boolean and binary matrix multiplication on multi-accelerator architectures.

Findings

01

Boolean product computed in ~2100 seconds for large matrices

02

Binary product achieved in less than 950 seconds for large matrices

03

Energy-efficient implementations with detailed hardware optimization

Abstract

We study the problem of multiplying two bit matrices with entries either over the Boolean algebra $(0, 1, \lor, \land)$ or over the binary field $(0, 1, +, \cdot)$ . We engineer high-performance open-source algorithm implementations for contemporary multiple-accelerator shared-memory architectures, with the objective of time-and-energy-efficient scaling up to input sizes close to the available shared memory capacity. For example, given two terabinary-bit square matrices as input, our implementations compute the Boolean product in approximately 2100 seconds (1.0 Pbop/s at 3.3 pJ/bop for a total of 2.1 kWh/product) and the binary product in less than 950 seconds (2.4 effective Pbop/s at 1.5 effective pJ/bop for a total of 0.92 kWh/product) on an NVIDIA DGX-1 with power consumption at peak system power (3.5 kW). Our contributions are (a) for the binary product, we use alternative-basis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Complexity and Algorithms in Graphs