LP-GEMM: Integrating Layout Propagation into GEMM Operations
C\'esar Guedes Carneiro, Lucas Alvarenga, Guido Araujo, Sandro Rigo

TL;DR
LP-GEMM introduces layout propagation across sequential GEMMs, reducing redundant data packing and significantly improving performance in scientific computing and machine learning workloads.
Contribution
It proposes a novel GEMM kernel decomposition that enables packing-layout propagation, eliminating unnecessary data repacking while maintaining BLAS correctness.
Findings
Achieves 2.25x speedup over OpenBLAS on x86 for sequential GEMMs.
Demonstrates practical performance gains in Llama-3.2 inference using only BLAS-level GEMM calls.
Provides a portable implementation on x86 and RISC-V architectures.
Abstract
In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM calls, they remain constrained by the BLAS API, which requires each call to independently pack input matrices and restore outputs to a canonical memory layout. In sequential GEMMs, these constraints cause redundant packing and unpacking, wasting valuable computational resources. This paper introduces LP-GEMM, a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries. We evaluate LP-GEMM on x86 (AVX-512) and RISC-V (RVV 1.0) architectures across MLP-like and Attention-like workloads.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
