TL;DR
This paper unifies various LLM memory optimization techniques into a four-step pipeline, identifies significant memory overheads, and demonstrates heterogeneous GPU-FPGA systems can substantially accelerate inference and reduce energy consumption.
Contribution
It introduces a unified memory processing pipeline for LLMs, highlights heterogeneity as key for acceleration, and validates this on GPU-FPGA systems with notable speed and energy improvements.
Findings
Memory processing overhead in LLM inference ranges from 22% to 97%.
Heterogeneous GPU-FPGA systems can be up to 2.2x faster and 4.7x more energy-efficient.
Heterogeneous systems are practical for accelerating LLM memory processing.
Abstract
Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbf{heterogeneous systems} are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
