Rethinking Compute Substrates for 3D-Stacked Near-Memory LLM Decoding: Microarchitecture-Scheduling Co-Design

Chenyang Ai; Yixing Zhang; Haoran Wu; Yudong Pan; Lechuan Zhao; Wenhui OU

arXiv:2604.04253·cs.AR·April 10, 2026

Rethinking Compute Substrates for 3D-Stacked Near-Memory LLM Decoding: Microarchitecture-Scheduling Co-Design

Chenyang Ai, Yixing Zhang, Haoran Wu, Yudong Pan, Lechuan Zhao, Wenhui OU

PDF

TL;DR

This paper introduces a novel microarchitecture and scheduling framework for 3D-stacked near-memory processing to accelerate large language model decoding, achieving significant speedup and energy efficiency improvements.

Contribution

It rethinks the compute microarchitecture by replacing MAC trees with systolic arrays, enabling reconfigurability and efficiency tailored for 3D-stacked NMP LLM decoding.

Findings

01

Achieves 2.91x speedup over prior designs

02

Attains 2.40x higher energy efficiency

03

Demonstrates effectiveness on both dense and MoE models

Abstract

Large language model (LLM) decoding is a major inference bottleneck because its low arithmetic intensity makes performance highly sensitive to memory bandwidth. 3D-stacked near-memory processing (NMP) provides substantially higher local memory bandwidth than conventional off-chip interfaces, making it a promising substrate for decode acceleration. However, our analysis shows that this bandwidth advantage also shifts many decode operators on 3D-stacked NMP back into the compute-bound regime. Under the tight area budget of the logic die, the design of the compute substrate itself therefore becomes a first-order challenge. Therefore, we rethink the compute microarchitecture of prior 3D-stacked NMP designs. First, we replace prior MAC tree-based compute units with a more area-efficient systolic array, and we further observe that decode operators exhibit substantial shape diversity, making…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.