PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System

Lian Liu; Shixin Zhao; Yutian Zhou; Yintao He; Mengdi Wang; Yinhe Han; Ying Wang

arXiv:2602.11521·cs.AR·February 13, 2026

PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System

Lian Liu, Shixin Zhao, Yutian Zhou, Yintao He, Mengdi Wang, Yinhe Han, Ying Wang

PDF

Open Access

TL;DR

PAM is a hierarchical, memory-aware system that optimizes key-value operations in large language model serving by balancing bandwidth and capacity through novel memory and computation coordination.

Contribution

The paper introduces PAM, a hierarchical, heterogeneous memory system with new algorithms for efficient KV processing in LLM serving, addressing bandwidth and capacity bottlenecks.

Findings

01

PAM significantly improves LLM serving efficiency.

02

PAM balances memory bandwidth and capacity effectively.

03

PAM demonstrates scalability in large-scale AI deployments.

Abstract

The widespread adoption of Large Language Models (LLMs) has exponentially increased the demand for efficient serving systems. With growing requests and context lengths, key-value (KV)-related operations, including attention computation and KV cache storage, have emerged as critical bottlenecks. They require massive memory bandwidth and capacity. Unfortunately, existing LLM serving systems, optimized for compute-bound workloads, fail to handle these memory-intensive operations effectively. Even with Processing-In-Memory (PIM) technology, current single-level memory designs cannot simultaneously satisfy the bandwidth and capacity requirements. To address these challenges, we propose Processing Across Memory (PAM), a KV-centric LLM serving system that coordinates heterogeneous PIM-enabled memory devices within a hierarchical architecture. PAM introduces a novel computing paradigm to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques