Cache oblivious storage and access heuristics for blocked matrix-matrix multiplication
Nicolas Bock, Emanuel H. Rubensson, Pawe{\l} Sa{\l}ek, Anders, M. N. Niklasson, Matt Challacombe

TL;DR
This paper explores how the order of operations in blocked matrix multiplication affects performance, revealing that non-contiguous storage can still achieve near-optimal speedups through execution order, especially for small blocks.
Contribution
It demonstrates that execution order, rather than contiguous memory storage, is crucial for optimizing blocked matrix multiplication performance.
Findings
Execution order significantly impacts performance.
Non-contiguous submatrix storage can still be efficient.
Speedup of up to four times for small block sizes.
Abstract
We investigate effects of ordering in blocked matrix--matrix multiplication. We find that submatrices do not have to be stored contiguously in memory to achieve near optimal performance. Instead it is the choice of execution order of the submatrix multiplications that leads to a speedup of up to four times for small block sizes. This is in contrast to results for single matrix elements showing that contiguous memory allocation quickly becomes irrelevant as the blocksize increases.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
