Hardware-Software Co-design for 3D-DRAM-based LLM Serving Accelerator

Cong Li; Yihan Yin; Chenhao Xue; Zhao Wang; Fujun Bai; Yixin Guo; Xiping Jiang; Qiang Wu; Yuan Xie; Guangyu Sun

arXiv:2603.04797·cs.AR·March 6, 2026

Hardware-Software Co-design for 3D-DRAM-based LLM Serving Accelerator

Cong Li, Yihan Yin, Chenhao Xue, Zhao Wang, Fujun Bai, Yixin Guo, Xiping Jiang, Qiang Wu, Yuan Xie, Guangyu Sun

PDF

Open Access

TL;DR

Helios is a hardware-software co-designed accelerator for 3D-DRAM-based LLM serving that improves efficiency and speed by dynamically managing KV caches and optimizing distributed attention execution.

Contribution

The paper introduces Helios, a novel co-designed accelerator with dynamic cache management and distributed attention execution tailored for LLM workloads.

Findings

01

Achieves 3.25x speedup over existing designs.

02

Provides 3.36x better energy efficiency.

03

Reduces time-between-tokens degradation by up to 76%.

Abstract

Large language models (LLMs) have been widely deployed for online generative services, where numerous LLM instances jointly handle workloads with fluctuating request arrival rates and variable request lengths. To efficiently execute coexisting compute-intensive and memory-intensive operators, near-memory processing (NMP) based computing paradigm has been extensively proposed. However, existing NMP designs adopt coarse-grained KV cache management and inflexible attention execution flow. Such limitations hinder these proposals from efficiently handling \textit{highly dynamic} LLM serving workloads, limiting their ability to accelerate LLM serving. To tackle these problems, we propose Helios, a Hybrid-bonding-based \uline{L}LM \uline{S}erving accelerator. Helios aims to bridge the fundamental gap between the dynamic nature of KV cache management in LLM serving and the distributed,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Big Data and Digital Economy