PIM-GPT: A Hybrid Process-in-Memory Accelerator for Autoregressive Transformers
Yuting Wu, Ziyu Wang, Wei D. Lu

TL;DR
PIM-GPT is a novel DRAM-based process-in-memory accelerator that significantly speeds up and improves energy efficiency for GPT inference by executing matrix operations directly within memory chips.
Contribution
This work introduces PIM-GPT, the first DRAM-based PIM architecture tailored for GPT inference, combining hardware design and software mapping for end-to-end acceleration.
Findings
Achieves up to 137x speedup over GPU
Provides up to 602x energy efficiency improvement
Supports multiple GPT models with up to 1.4 billion parameters
Abstract
Decoder-only Transformer models such as GPT have demonstrated exceptional performance in text generation, by autoregressively predicting the next token. However, the efficacy of running GPT on current hardware systems is bounded by low compute-to-memory-ratio and high memory access. Process-in-memory (PIM) architectures can minimize off-chip data movement and utilize high internal bandwidth. They stand out as promising candidates for accelerating memory-bounded tasks such as GPT inference. In this work, we propose a PIM accelerator, PIM-GPT, which achieves end-to-end acceleration of GPT inference with high performance and high energy efficiency. PIM-GPT leverages DRAM-based PIM designs for executing multiply-accumulate (MAC) operations directly in the DRAM chips, eliminating the need to move matrix data off-chip. Non-linear functions and data communication is supported by an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Semiconductor Devices and Circuit Design · Semiconductor materials and devices · Advanced Data Storage Technologies
