Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving

Wonung Kim; Yubin Lee; Yoonsung Kim; Jinwoo Hwang; Seongryong Oh; Jiyong Jung; Aziz Huseynov; Woong Gyu Park; Chang Hyun Park; Divya Mahajan; Jongse Park

arXiv:2507.10178·cs.AR·September 17, 2025

Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving

Wonung Kim, Yubin Lee, Yoonsung Kim, Jinwoo Hwang, Seongryong Oh, Jiyong Jung, Aziz Huseynov, Woong Gyu Park, Chang Hyun Park, Divya Mahajan, Jongse Park

PDF

TL;DR

Pimba introduces a processing-in-memory system optimized for both transformer and post-transformer LLMs, significantly improving inference throughput by addressing memory bandwidth and hardware cost challenges.

Contribution

The paper presents a novel PIM architecture with shared processing units for efficient state updates and attention, enabling unified serving of diverse LLM architectures.

Findings

01

Achieves up to 4.1x higher token throughput compared to GPU-based systems.

02

Effectively supports both transformer and post-transformer LLMs within a unified framework.

03

Reduces hardware costs by optimizing low-precision arithmetic methods.

Abstract

Transformers are the driving force behind today's Large Language Models (LLMs), serving as the foundation for their performance and versatility. Yet, their compute and memory costs grow with sequence length, posing scalability challenges for long-context inferencing. In response, the algorithm community is exploring alternative architectures, such as state space models (SSMs), linear attention, and recurrent neural networks (RNNs), which we refer to as post-transformers. This shift presents a key challenge: building a serving system that efficiently supports both transformer and post-transformer LLMs within a unified framework. To address this challenge, we analyze the performance characteristics of transformer and post-transformer LLMs. Despite their algorithmic differences, both are fundamentally limited by memory bandwidth under batched inference due to attention in transformers and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.