Rethinking Compute Substrates for 3D-Stacked Near-Memory LLM Decoding: Microarchitecture-Scheduling Co-Design
Chenyang Ai, Yixing Zhang, Haoran Wu, Yudong Pan, Lechuan Zhao, Wenhui OU

TL;DR
This paper introduces a novel microarchitecture and scheduling framework for 3D-stacked near-memory processing to accelerate large language model decoding, achieving significant speedup and energy efficiency improvements.
Contribution
It rethinks the compute microarchitecture by replacing MAC trees with systolic arrays, enabling reconfigurability and efficiency tailored for 3D-stacked NMP LLM decoding.
Findings
Achieves 2.91x speedup over prior designs
Attains 2.40x higher energy efficiency
Demonstrates effectiveness on both dense and MoE models
Abstract
Large language model (LLM) decoding is a major inference bottleneck because its low arithmetic intensity makes performance highly sensitive to memory bandwidth. 3D-stacked near-memory processing (NMP) provides substantially higher local memory bandwidth than conventional off-chip interfaces, making it a promising substrate for decode acceleration. However, our analysis shows that this bandwidth advantage also shifts many decode operators on 3D-stacked NMP back into the compute-bound regime. Under the tight area budget of the logic die, the design of the compute substrate itself therefore becomes a first-order challenge. Therefore, we rethink the compute microarchitecture of prior 3D-stacked NMP designs. First, we replace prior MAC tree-based compute units with a more area-efficient systolic array, and we further observe that decode operators exhibit substantial shape diversity, making…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
