Analyzing Latency Hiding and Parallelism in an MLIR-based AI Kernel Compiler

Javed Absar; Samarth Narang; Muthu Baskaran

arXiv:2602.20204·cs.PL·February 25, 2026

Analyzing Latency Hiding and Parallelism in an MLIR-based AI Kernel Compiler

Javed Absar, Samarth Narang, Muthu Baskaran

PDF

Open Access

TL;DR

This paper evaluates how vectorization, multi-threading, and double buffering techniques improve latency hiding and parallelism in an MLIR-based AI kernel compiler, providing a benchmark methodology and quantitative insights.

Contribution

It introduces a benchmark methodology for analyzing compiler-controlled mechanisms and quantifies their impact on kernel performance in an MLIR-based compilation pipeline.

Findings

01

Vectorization yields the main bandwidth-related performance gains.

02

Multi-threading significantly improves performance with larger problem sizes.

03

Double buffering enhances performance by overlapping data transfers and computation.

Abstract

AI kernel compilation for edge devices depends on the compiler's ability to exploit parallelism and hide memory latency in the presence of hierarchical memory and explicit data movement. This paper reports a benchmark methodology and corresponding results for three compiler-controlled mechanisms in an MLIR-based compilation pipeline: vectorization (Vec), multi-threading (MT) across hardware contexts, and double buffering (DB) using ping--pong scratchpad buffers to overlap DMA transfers with compute. Using Triton/Inductor-generated kernels, we present an ablation ladder that separates the contribution of Vec, MT, and DB, and we quantify how MT speedup scales with problem size using GELU as a representative activation kernel. The results show that vectorization provides the primary gain for bandwidth-sensitive kernels, MT delivers substantial improvements once scheduling overhead is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Embedded Systems Design Techniques