A Tensor Compiler for Processing-In-Memory Architectures
Peiming Yang, Sankeerth Durvasula, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu, Gennady Pekhimenko, Christina Giannoula

TL;DR
This paper introduces DCC, a novel data-centric ML compiler for PIM architectures that jointly optimizes data rearrangements and compute code, significantly accelerating ML kernels and LLM inference.
Contribution
DCC is the first compiler to systematically co-optimize data rearrangements and compute code across diverse PIM backends in a unified framework.
Findings
Achieves up to 7.68x speedup on HBM-PIM
Achieves up to 13.17x speedup on AttAcc PIM
Accelerates GPT-3 and LLaMA-2 by up to 7.71x over GPU
Abstract
Processing-In-Memory (PIM) devices integrated with high-performance Host processors (e.g., GPUs) can accelerate memory-intensive kernels in Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging high memory bandwidth at PIM cores. However, Host processors and PIM cores require different data layouts: Hosts need consecutive elements distributed across DRAM banks, while PIM cores need them within local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM backends. Current compilation approaches lack systematic optimization for diverse ML kernels across multiple PIM backends and may largely ignore data rearrangements during compute code optimization. We demonstrate that data rearrangements and compute code optimization are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Advanced Neural Network Applications
