PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System

Hyucksung Kwon; Kyungmo Koo; Janghyeon Kim; Woongkyu Lee; Minjae Lee; Gyeonggeun Jung; Hyungdeok Lee; Yousub Jung; Jaehan Park; Yosub Song; Byeongsu Yang; Haerang Choi; Guhyun Kim; Jongsoon Won; Woojae Shin; Changhyun Kim; Gyeongcheol Shin; Yongkee Kwon; Ilkon Kim; Euicheol Lim; John Kim; Jungwook Choi

arXiv:2412.20166·cs.AR·December 29, 2025

PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System

Hyucksung Kwon, Kyungmo Koo, Janghyeon Kim, Woongkyu Lee, Minjae Lee, Gyeonggeun Jung, Hyungdeok Lee, Yousub Jung, Jaehan Park, Yosub Song, Byeongsu Yang, Haerang Choi, Guhyun Kim, Jongsoon Won, Woojae Shin, Changhyun Kim, Gyeongcheol Shin, Yongkee Kwon, Ilkon Kim, Euicheol Lim

PDF

Open Access

TL;DR

PIMphony is a novel PIM orchestrator that enhances long-context LLM inference efficiency by addressing channel utilization, I/O bottlenecks, and memory waste through three co-designed techniques, significantly boosting throughput.

Contribution

It introduces three innovative techniques—TCP, DCS, and DPA—for systematic optimization of PIM-based long-context LLM inference systems.

Findings

01

Up to 11.3x throughput improvement on PIM-only systems.

02

Significant reduction in memory waste and I/O bottlenecks.

03

Enables efficient deployment of large long-context LLMs.

Abstract

The expansion of long-context Large Language Models (LLMs) creates significant memory system challenges. While Processing-in-Memory (PIM) is a promising accelerator, we identify that it suffers from critical inefficiencies when scaled to long contexts: severe channel underutilization, performance-limiting I/O bottlenecks, and massive memory waste from static KV cache management. In this work, we propose PIMphony, a PIM orchestrator that systematically resolves these issues with three co-designed techniques. First, Token-Centric PIM Partitioning (TCP) ensures high channel utilization regardless of batch size. Second, Dynamic PIM Command Scheduling (DCS) mitigates the I/O bottleneck by overlapping data movement and computation. Finally, a Dynamic PIM Access (DPA) controller enables dynamic memory management to eliminate static memory waste. Implemented via an MLIR-based compiler and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topicssemigroups and automata theory

MethodsSoftmax · Attention Is All You Need · Dual Multimodal Attention