Hexagon-MLIR: An AI Compilation Stack For Qualcomm's Neural Processing Units (NPUs)

Mohammed Javed Absar; Muthu Baskaran; Abhikrant Sharma; Abhilash Bhandari; Ankit Aggarwal; Arun Rangasamy; Dibyendu Das; Fateme Hosseini; Franck Slama; Iulian Brumar; Jyotsna Verma; Krishnaprasad Bindumadhavan; Mitesh Kothari; Mohit Gupta; Ravishankar Kolachana; Richard Lethin; Samarth Narang; Sanjay Motilal Ladwa; Shalini Jain; Snigdha Suresh Dalvi; Tasmia Rahman; Venkat Rasagna Reddy Komatireddy; Vivek Vasudevbhai Pandya; Xiyue Shi; Zachary Zipper

arXiv:2602.19762·cs.PL·February 24, 2026

Hexagon-MLIR: An AI Compilation Stack For Qualcomm's Neural Processing Units (NPUs)

Mohammed Javed Absar, Muthu Baskaran, Abhikrant Sharma, Abhilash Bhandari, Ankit Aggarwal, Arun Rangasamy, Dibyendu Das, Fateme Hosseini, Franck Slama, Iulian Brumar, Jyotsna Verma, Krishnaprasad Bindumadhavan, Mitesh Kothari, Mohit Gupta, Ravishankar Kolachana, Richard Lethin

PDF

Open Access

TL;DR

Hexagon-MLIR is an open-source compilation framework that optimizes AI workloads on Qualcomm's NPUs by automating kernel-to-binary translation and enhancing data locality, thereby accelerating deployment and performance.

Contribution

It introduces a unified, open-source MLIR-based compilation stack for Qualcomm NPUs that streamlines deployment of Triton kernels and PyTorch models with optimized data handling.

Findings

01

Automates compilation from Triton kernels to NPU binaries.

02

Maximizes data locality in TCM to reduce bandwidth bottlenecks.

03

Supports faster deployment of AI workloads on Qualcomm NPUs.

Abstract

In this paper, we present Hexagon-MLIR,an open-source compilation stack that targets Qualcomm Hexagon Neural Processing Unit (NPU) and provides unified support for lowering Triton kernels and PyTorch models . Built using the MLIR framework, our compiler applies a structured sequence of passes to exploit NPU architectural features to accelerate AI workloads. It enables faster deployment of new Triton kernels (hand-written or subgraphs from PyTorch 2.0), for our target by providing automated compilation from kernel to binary. By ingesting Triton kernels, we generate mega-kernels that maximize data locality in the NPU's Tightly Coupled Memory (TCM), reducing the bandwidth bottlenecks inherent in library-based approaches. This initiative complements our commercial toolchains by providing developers with an open-source MLIR-based compilation stack that gives them a path to advance AI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Machine Learning in Materials Science