GEMMFIP: Unifying GEMM in BLIS
RuQing G. Xu, Field G. Van Zee, Robert A. van de Geijn

TL;DR
This paper introduces GEMMFIP, a unified approach for high-performance matrix multiplication across small and large matrices by fusing packing with computation, simplifying tuning and improving efficiency.
Contribution
It proposes a novel technique that unifies optimization strategies for small and large matrices in GEMM, implemented within the BLIS framework.
Findings
Achieves high performance for both small and large matrices
Simplifies tuning of general-purpose matrix libraries
Demonstrates effectiveness across multiple architectures
Abstract
Matrix libraries often focus on achieving high performance for problems considered to be either "small" or "large", as these two scenarios tend to respond best to different optimization strategies. We propose a unified technique for implementing matrix operations like general matrix multiplication (GEMM) that can achieve high performance for both small and large problem sizes. The key is to fuse packing -- an operation that copies data to a contiguous layout in memory and which is critical for large matrix performance -- with the first computational "pass" over that data. This boosts performance across the problem size spectrum. As a result, tuning general-purpose libraries becomes simpler since it obviates the need to carefully express and parameterize logic that chooses between a "small matrix" strategy and a "large matrix" strategy. A prototype implementation of the technique built…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Interconnection Networks and Systems
