Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving
Wonung Kim, Yubin Lee, Yoonsung Kim, Jinwoo Hwang, Seongryong Oh, Jiyong Jung, Aziz Huseynov, Woong Gyu Park, Chang Hyun Park, Divya Mahajan, Jongse Park

TL;DR
Pimba introduces a processing-in-memory system optimized for both transformer and post-transformer LLMs, significantly improving inference throughput by addressing memory bandwidth and hardware cost challenges.
Contribution
The paper presents a novel PIM architecture with shared processing units for efficient state updates and attention, enabling unified serving of diverse LLM architectures.
Findings
Achieves up to 4.1x higher token throughput compared to GPU-based systems.
Effectively supports both transformer and post-transformer LLMs within a unified framework.
Reduces hardware costs by optimizing low-precision arithmetic methods.
Abstract
Transformers are the driving force behind today's Large Language Models (LLMs), serving as the foundation for their performance and versatility. Yet, their compute and memory costs grow with sequence length, posing scalability challenges for long-context inferencing. In response, the algorithm community is exploring alternative architectures, such as state space models (SSMs), linear attention, and recurrent neural networks (RNNs), which we refer to as post-transformers. This shift presents a key challenge: building a serving system that efficiently supports both transformer and post-transformer LLMs within a unified framework. To address this challenge, we analyze the performance characteristics of transformer and post-transformer LLMs. Despite their algorithmic differences, both are fundamentally limited by memory bandwidth under batched inference due to attention in transformers and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
