Hardware-Software Co-design for 3D-DRAM-based LLM Serving Accelerator
Cong Li, Yihan Yin, Chenhao Xue, Zhao Wang, Fujun Bai, Yixin Guo, Xiping Jiang, Qiang Wu, Yuan Xie, Guangyu Sun

TL;DR
Helios is a hardware-software co-designed accelerator for 3D-DRAM-based LLM serving that improves efficiency and speed by dynamically managing KV caches and optimizing distributed attention execution.
Contribution
The paper introduces Helios, a novel co-designed accelerator with dynamic cache management and distributed attention execution tailored for LLM workloads.
Findings
Achieves 3.25x speedup over existing designs.
Provides 3.36x better energy efficiency.
Reduces time-between-tokens degradation by up to 76%.
Abstract
Large language models (LLMs) have been widely deployed for online generative services, where numerous LLM instances jointly handle workloads with fluctuating request arrival rates and variable request lengths. To efficiently execute coexisting compute-intensive and memory-intensive operators, near-memory processing (NMP) based computing paradigm has been extensively proposed. However, existing NMP designs adopt coarse-grained KV cache management and inflexible attention execution flow. Such limitations hinder these proposals from efficiently handling \textit{highly dynamic} LLM serving workloads, limiting their ability to accelerate LLM serving. To tackle these problems, we propose Helios, a Hybrid-bonding-based \uline{L}LM \uline{S}erving accelerator. Helios aims to bridge the fundamental gap between the dynamic nature of KV cache management in LLM serving and the distributed,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Big Data and Digital Economy
