Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

Zifan He; Rui Ma; Yizhou Sun; Jason Cong

arXiv:2603.29002·cs.DC·May 12, 2026

Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

Zifan He, Rui Ma, Yizhou Sun, Jason Cong

PDF

1 Repo

TL;DR

This paper unifies various LLM memory optimization techniques into a four-step pipeline, identifies significant memory overheads, and demonstrates heterogeneous GPU-FPGA systems can substantially accelerate inference and reduce energy consumption.

Contribution

It introduces a unified memory processing pipeline for LLMs, highlights heterogeneity as key for acceleration, and validates this on GPU-FPGA systems with notable speed and energy improvements.

Findings

01

Memory processing overhead in LLM inference ranges from 22% to 97%.

02

Heterogeneous GPU-FPGA systems can be up to 2.2x faster and 4.7x more energy-efficient.

03

Heterogeneous systems are practical for accelerating LLM memory processing.

Abstract

Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbf{heterogeneous systems} are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oswaldhe/HeteroLLM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.