DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators

Zhiwen Mo; Guoyu Li; Hao Mark Chen; Yu Cheng; Zhengju Tang; Qianzhou Wang; Lei Wang; Shuang Liang; Lingxiao Ma; Xianqi Zhou; Yuxiao Guo; Wayne Luk; Jilong Xue; Hongxiang Fan

arXiv:2604.04750·cs.AR·April 10, 2026

DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators

Zhiwen Mo, Guoyu Li, Hao Mark Chen, Yu Cheng, Zhengju Tang, Qianzhou Wang, Lei Wang, Shuang Liang, Lingxiao Ma, Xianqi Zhou, Yuxiao Guo, Wayne Luk, Jilong Xue, Hongxiang Fan

PDF

TL;DR

DeepStack is a high-speed, accurate performance modeling tool for co-designing distributed 3D-stacked AI accelerators, enabling extensive exploration of design options for improved throughput.

Contribution

It introduces a novel, fast, and accurate performance model for distributed 3D-stacked AI systems, supporting large-scale design space exploration and hardware-software co-optimization.

Findings

01

DeepStack achieves up to 100,000x faster runtime than state-of-the-art simulators.

02

It enables exploration of 2.5x10^14 design points across multiple hardware parameters.

03

DeepStack finds that batch size and parallelism strategy are critical for optimal architecture design.

Abstract

Advances in hybrid bonding and packaging have driven growing interest in 3D DRAM-stacked accelerators with higher memory bandwidth and capacity. As LLMs scale to hundreds of billions or trillions of parameters, distributed inference across multiple 3D chips becomes essential. With cross-stack co-design increasingly critical, we propose DeepStack, an accurate and efficient performance model and tool to enable early-stage system-hardware co-design space exploration (DSE) for distributed 3D-stacked AI systems. At the hardware level, DeepStack captures fine-grained 3D memory semantics such as transaction-aware bandwidth, bank activation constraints, buffering limitations, and thermal-power modeling. At the system level, DeepStack incorporates comprehensive parallelization strategies and execution scheduling for distributed LLM inference. With novel modeling techniques such as dual-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.