High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results
Navdeep Katel, Vivek Khandelwal, Uday Bondhugula

TL;DR
This paper demonstrates early results in automatically generating high-performance GPU code for matrix multiplication using MLIR, achieving near-optimized performance on NVIDIA tensor cores and highlighting MLIR's potential for domain-specific library development.
Contribution
The paper introduces an MLIR-based pipeline for automatic code generation targeting NVIDIA GPU tensor cores, showing promising performance results and advancing compiler infrastructure for deep learning libraries.
Findings
Achieved 95-119% of CuBLAS performance for FP32 on RTX 3090.
Achieved 80-160% of CuBLAS performance for FP16 on RTX 3090.
Demonstrated MLIR's potential for automatic, high-performance code generation.
Abstract
This report presents some early results on code generation targeting tensor cores on NVIDIA GPUs using the MLIR compiler infrastructure. The state-of-the-art in high-performance deep learning today is primarily driven by manually optimized highly tuned libraries. The approach to develop such libraries is often not modular or reusable to the same extent that compiler infrastructure like LLVM is. Manual optimization typically does not use a standard intermediate representation (IR), although the optimizations performed can be encoded as a sequence of transformation steps and customized passes on an IR. Hand tuning may also miss exploration of design points only reachable easily by automatic code generation. We believe that until the recent introduction of MLIR (Multi-level intermediate representation), IR infrastructure was not geared to tackle the problem of automatic generation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Distributed and Parallel Computing Systems
