TokenStack: A Heterogeneous HBM-PIM Architecture and Runtime for Efficient LLM Inference

Zhuoran Li; Zhuohang Bian; Zihao Huang; Guangyu Sun; Yun Liang; Youwei Zhuo

arXiv:2605.05639·cs.AR·May 8, 2026

TokenStack: A Heterogeneous HBM-PIM Architecture and Runtime for Efficient LLM Inference

Zhuoran Li, Zhuohang Bian, Zihao Huang, Guangyu Sun, Yun Liang, Youwei Zhuo

PDF

TL;DR

TokenStack introduces a heterogeneous HBM-PIM architecture that efficiently accelerates large language model inference by optimizing memory and compute resource allocation for hot and cold key-value data.

Contribution

It proposes a novel vertically heterogeneous HBM-PIM design with a dedicated control die and topology-aware KV placement for improved LLM serving performance.

Findings

01

Increases token throughput by 1.62x over AttAcc.

02

Enhances serving capacity by 1.70x while maintaining SLOs.

03

Reduces per-token energy consumption by 30-47%.

Abstract

Large language model (LLM) serving is now limited by the key-value (KV) cache. During decode, each new token rereads prior KV state, so attention becomes a bandwidth- and capacity-heavy memory task. HBM-PIM helps by moving attention closer to memory, but current stack organizations still waste resources. In practice, only hot KV blocks benefit from near-memory compute. Weights, activations, and cold KV mainly need dense storage and GPU-visible bandwidth. A uniform HBM-PIM stack makes all layers pay for PIM logic, while a dedicated-PIM design such as AttAcc recovers capacity but shrinks the HBM bandwidth left for GPU-side work. We propose TokenStack, a vertically heterogeneous HBM-PIM architecture for KV-centric LLM serving that leverages HBM4's logic-die substrate. TokenStack separates each stack into dense capacity layers and PIM-enabled compute layers, then uses the logic base die as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.