LP-GEMM: Integrating Layout Propagation into GEMM Operations

C\'esar Guedes Carneiro; Lucas Alvarenga; Guido Araujo; Sandro Rigo

arXiv:2604.04599·cs.DC·April 7, 2026

LP-GEMM: Integrating Layout Propagation into GEMM Operations

C\'esar Guedes Carneiro, Lucas Alvarenga, Guido Araujo, Sandro Rigo

PDF

TL;DR

LP-GEMM introduces layout propagation across sequential GEMMs, reducing redundant data packing and significantly improving performance in scientific computing and machine learning workloads.

Contribution

It proposes a novel GEMM kernel decomposition that enables packing-layout propagation, eliminating unnecessary data repacking while maintaining BLAS correctness.

Findings

01

Achieves 2.25x speedup over OpenBLAS on x86 for sequential GEMMs.

02

Demonstrates practical performance gains in Llama-3.2 inference using only BLAS-level GEMM calls.

03

Provides a portable implementation on x86 and RISC-V architectures.

Abstract

In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM calls, they remain constrained by the BLAS API, which requires each call to independently pack input matrices and restore outputs to a canonical memory layout. In sequential GEMMs, these constraints cause redundant packing and unpacking, wasting valuable computational resources. This paper introduces LP-GEMM, a decomposition of the GEMM kernel that enables packing-layout propagation across sequential GEMM operations. This approach eliminates unnecessary data repacking while preserving full BLAS semantic correctness at the boundaries. We evaluate LP-GEMM on x86 (AVX-512) and RISC-V (RVV 1.0) architectures across MLP-like and Attention-like workloads.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.