TokenStack: A Heterogeneous HBM-PIM Architecture and Runtime for Efficient LLM Inference
Zhuoran Li, Zhuohang Bian, Zihao Huang, Guangyu Sun, Yun Liang, Youwei Zhuo

TL;DR
TokenStack introduces a heterogeneous HBM-PIM architecture that efficiently accelerates large language model inference by optimizing memory and compute resource allocation for hot and cold key-value data.
Contribution
It proposes a novel vertically heterogeneous HBM-PIM design with a dedicated control die and topology-aware KV placement for improved LLM serving performance.
Findings
Increases token throughput by 1.62x over AttAcc.
Enhances serving capacity by 1.70x while maintaining SLOs.
Reduces per-token energy consumption by 30-47%.
Abstract
Large language model (LLM) serving is now limited by the key-value (KV) cache. During decode, each new token rereads prior KV state, so attention becomes a bandwidth- and capacity-heavy memory task. HBM-PIM helps by moving attention closer to memory, but current stack organizations still waste resources. In practice, only hot KV blocks benefit from near-memory compute. Weights, activations, and cold KV mainly need dense storage and GPU-visible bandwidth. A uniform HBM-PIM stack makes all layers pay for PIM logic, while a dedicated-PIM design such as AttAcc recovers capacity but shrinks the HBM bandwidth left for GPU-side work. We propose TokenStack, a vertically heterogeneous HBM-PIM architecture for KV-centric LLM serving that leverages HBM4's logic-die substrate. TokenStack separates each stack into dense capacity layers and PIM-enabled compute layers, then uses the logic base die as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
