PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System
Lian Liu, Shixin Zhao, Yutian Zhou, Yintao He, Mengdi Wang, Yinhe Han, Ying Wang

TL;DR
PAM is a hierarchical, memory-aware system that optimizes key-value operations in large language model serving by balancing bandwidth and capacity through novel memory and computation coordination.
Contribution
The paper introduces PAM, a hierarchical, heterogeneous memory system with new algorithms for efficient KV processing in LLM serving, addressing bandwidth and capacity bottlenecks.
Findings
PAM significantly improves LLM serving efficiency.
PAM balances memory bandwidth and capacity effectively.
PAM demonstrates scalability in large-scale AI deployments.
Abstract
The widespread adoption of Large Language Models (LLMs) has exponentially increased the demand for efficient serving systems. With growing requests and context lengths, key-value (KV)-related operations, including attention computation and KV cache storage, have emerged as critical bottlenecks. They require massive memory bandwidth and capacity. Unfortunately, existing LLM serving systems, optimized for compute-bound workloads, fail to handle these memory-intensive operations effectively. Even with Processing-In-Memory (PIM) technology, current single-level memory designs cannot simultaneously satisfy the bandwidth and capacity requirements. To address these challenges, we propose Processing Across Memory (PAM), a KV-centric LLM serving system that coordinates heterogeneous PIM-enabled memory devices within a hierarchical architecture. PAM introduces a novel computing paradigm to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques
