A Tensor Compiler for Processing-In-Memory Architectures

Peiming Yang; Sankeerth Durvasula; Ivan Fernandez; Mohammad Sadrosadati; Onur Mutlu; Gennady Pekhimenko; Christina Giannoula

arXiv:2511.15503·cs.AR·November 20, 2025

A Tensor Compiler for Processing-In-Memory Architectures

Peiming Yang, Sankeerth Durvasula, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu, Gennady Pekhimenko, Christina Giannoula

PDF

Open Access

TL;DR

This paper introduces DCC, a novel data-centric ML compiler for PIM architectures that jointly optimizes data rearrangements and compute code, significantly accelerating ML kernels and LLM inference.

Contribution

DCC is the first compiler to systematically co-optimize data rearrangements and compute code across diverse PIM backends in a unified framework.

Findings

01

Achieves up to 7.68x speedup on HBM-PIM

02

Achieves up to 13.17x speedup on AttAcc PIM

03

Accelerates GPT-3 and LLaMA-2 by up to 7.71x over GPU

Abstract

Processing-In-Memory (PIM) devices integrated with high-performance Host processors (e.g., GPUs) can accelerate memory-intensive kernels in Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging high memory bandwidth at PIM cores. However, Host processors and PIM cores require different data layouts: Hosts need consecutive elements distributed across DRAM banks, while PIM cores need them within local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM backends. Current compilation approaches lack systematic optimization for diverse ML kernels across multiple PIM backends and may largely ignore data rearrangements during compute code optimization. We demonstrate that data rearrangements and compute code optimization are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Advanced Neural Network Applications