Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels

Arun Thangamani; Md Asghar Ahmad Shahid; Adam Siemieniuk; Rolf Morel; Renato Golin; Alexander Heinecke

arXiv:2511.13764·cs.LG·November 19, 2025

Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels

Arun Thangamani, Md Asghar Ahmad Shahid, Adam Siemieniuk, Rolf Morel, Renato Golin, Alexander Heinecke

PDF

Open Access

TL;DR

This paper presents an MLIR-based compilation scheme that automatically generates high-performance, scalable microkernels for AI workloads, reducing reliance on hand-crafted libraries and improving hardware utilization.

Contribution

It introduces a novel compiler technique for composing nanokernels from IR constructs, enabling automatic generation of near-peak performance microkernels tailored to hardware.

Findings

01

Generated nanokernels are of production quality.

02

Performance is competitive with state-of-the-art libraries.

03

Supports both vector and tile CPU instructions.

Abstract

The rapidly evolving landscape of AI and machine learning workloads has widened the gap between high-level domain operations and efficient hardware utilization. Achieving near-peak performance still demands deep hardware expertise-experts either handcraft target-specific kernels (e.g., DeepSeek) or rely on specialized libraries (e.g., CUTLASS)-both of which add complexity and limit scalability for most ML practitioners. This paper introduces a compilation scheme that automatically generates scalable, high-performance microkernels by leveraging the MLIR dialects to bridge domain-level operations and processor capabilities. Our approach removes dependence on low-level libraries by enabling the compiler to auto-generate near-optimal code directly. At its core is a mechanism for composing nanokernels from low-level IR constructs with near-optimal register utilization, forming efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Big Data and Digital Economy