Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel
Yiqi Liu, Noelle Crawford, Michael Wang, Jilong Xue, Jian Huang

TL;DR
This paper introduces Voxel, a simulation framework for exploring the efficiency of 3D-stacked AI chips in LLM inference, considering hardware, software, and mapping optimizations.
Contribution
We develop Voxel, a fast, compiler-aware simulation tool enabling comprehensive co-exploration of 3D-stacked AI chip architectures for LLM inference.
Findings
Efficiency depends on mappings from tiles to cores and DRAM banks.
Multiple factors like compute paradigms, NoC topology, and bandwidth influence performance.
Open source release of Voxel and study results for public research.
Abstract
To overcome the well-known memory bottleneck of AI chips, 3D stacked architectures that employ advanced packaging technology with high-density through-silicon vias (TSVs) pins have proven to be a promising solution. The 3D-stacked AI chip enables ultra-high memory bandwidth between compute and memory by stacking numerous DRAM banks atop many AI cores in a distributed manner. However, it is not easy to explore the efficiency of the 3D-stacked AI chip, due to its unique distributed nature. And we need to carefully consider multiple intertwined factors that range from upper-level computing paradigm to machine learning (ML) compiler optimizations, and to the underlying hardware architecture. In this paper, we develop Voxel, a fast and compiler-aware end-to-end simulation framework to facilitate exploring the efficiency of 3D-stacked AI chips for large language model (LLM) inference. Voxel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
