Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures
Matti Karppa, Petteri Kaski

TL;DR
This paper presents high-performance, energy-efficient algorithms for Boolean and binary matrix multiplication on multi-accelerator shared-memory systems, enabling large-scale computations close to hardware memory limits.
Contribution
It introduces novel optimized algorithms and implementation strategies for large-scale Boolean and binary matrix multiplication on multi-accelerator architectures.
Findings
Boolean product computed in ~2100 seconds for large matrices
Binary product achieved in less than 950 seconds for large matrices
Energy-efficient implementations with detailed hardware optimization
Abstract
We study the problem of multiplying two bit matrices with entries either over the Boolean algebra or over the binary field . We engineer high-performance open-source algorithm implementations for contemporary multiple-accelerator shared-memory architectures, with the objective of time-and-energy-efficient scaling up to input sizes close to the available shared memory capacity. For example, given two terabinary-bit square matrices as input, our implementations compute the Boolean product in approximately 2100 seconds (1.0 Pbop/s at 3.3 pJ/bop for a total of 2.1 kWh/product) and the binary product in less than 950 seconds (2.4 effective Pbop/s at 1.5 effective pJ/bop for a total of 0.92 kWh/product) on an NVIDIA DGX-1 with power consumption at peak system power (3.5 kW). Our contributions are (a) for the binary product, we use alternative-basis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Complexity and Algorithms in Graphs
