A3D-MoE: Acceleration of Large Language Models with Mixture of Experts via 3D Heterogeneous Integration
Wei-Hsing Huang, Janak Sharda, Cheng-Jhih Shih, Yuyao Kong, Faaiq Waqar, Pin-Jun Chen, Yingyan (Celine) Lin, Shimeng Yu

TL;DR
A3D-MoE introduces a 3D heterogeneous integration system for large language models, significantly improving efficiency, reducing latency and energy consumption, and enhancing hardware utilization through innovative dataflow, scheduling, and expert placement strategies.
Contribution
The paper presents a novel 3D integration architecture and techniques that address hardware utilization, latency, and energy challenges in MoE-based large language models.
Findings
Latency reduced by 1.8x to 2x
Energy consumption decreased by 2x to 4x
Throughput improved by 1.44x to 1.8x
Abstract
Conventional large language models (LLMs) are equipped with dozens of GB to TB of model parameters, making inference highly energy-intensive and costly as all the weights need to be loaded to onboard processing elements during computation. Recently, the Mixture-of-Experts (MoE) architecture has emerged as an efficient alternative, promising efficient inference with less activated weights per token. Nevertheless, fine-grained MoE-based LLMs face several challenges: 1) Variable workloads during runtime create arbitrary GEMV-GEMM ratios that reduce hardware utilization, 2) Traditional MoE-based scheduling for LLM serving cannot fuse attention operations with MoE operations, leading to increased latency and decreased hardware utilization, and 3) Despite being more efficient than conventional LLMs, loading experts from DRAM still consumes significant energy and requires substantial DRAM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Machine Learning in Materials Science
