PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding   with a Processing-In-Memory-Enabled Computing System

Yintao He; Haiyu Mao; Christina Giannoula; Mohammad Sadrosadati; Juan; G\'omez-Luna; Huawei Li; Xiaowei Li; Ying Wang; Onur Mutlu

arXiv:2502.15470·cs.AR·February 28, 2025

PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System

Yintao He, Haiyu Mao, Christina Giannoula, Mohammad Sadrosadati, Juan, G\'omez-Luna, Huawei Li, Xiaowei Li, Ying Wang, Onur Mutlu

PDF

TL;DR

This paper introduces PAPI, a dynamic, PIM-enabled heterogeneous system that optimizes large language model decoding by adaptively scheduling kernels based on their runtime characteristics, significantly improving performance.

Contribution

PAPI's novel dynamic kernel characterization and scheduling approach effectively adapt to changing kernel behaviors, outperforming static and homogeneous architectures.

Findings

01

Achieves 1.8× speedup over state-of-the-art heterogeneous LLM accelerator.

02

Achieves 11.1× speedup over state-of-the-art PIM-only LLM accelerator.

03

Effectively handles dynamic kernel behavior in LLM decoding.

Abstract

Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Some prior works statically identify and map these different kernels to a heterogeneous architecture consisting of both processing-in-memory (PIM) units and computation-centric accelerators. We observe that characteristics of LLM decoding kernels (e.g., whether or not a kernel is memory-bound) can change dynamically due to parameter changes to meet user and/or system demands, making (1) static kernel mapping to PIM units and computation-centric accelerators suboptimal, and (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus