TL;DR
This paper introduces a practical framework for implementing and benchmarking various fast matrix multiplication algorithms, demonstrating their performance advantages over classical methods on different problem sizes and shapes.
Contribution
It develops a code generation tool for automatic implementation of multiple fast algorithms, including novel parallel schemes, and analyzes their practical performance and implementation issues.
Findings
Fast algorithms outperform classical methods on modest problem sizes.
Algorithm efficiency depends on matrix size and shape.
Practical implementation issues are identified for shared-memory machines.
Abstract
Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, including our novel parallelization scheme. This allows us to rapidly benchmark over 20 fast algorithms on several problem sizes. Furthermore, we discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
