A3D-MoE: Acceleration of Large Language Models with Mixture of Experts via 3D Heterogeneous Integration

Wei-Hsing Huang; Janak Sharda; Cheng-Jhih Shih; Yuyao Kong; Faaiq Waqar; Pin-Jun Chen; Yingyan (Celine) Lin; Shimeng Yu

arXiv:2507.19142·cs.AR·July 28, 2025

A3D-MoE: Acceleration of Large Language Models with Mixture of Experts via 3D Heterogeneous Integration

Wei-Hsing Huang, Janak Sharda, Cheng-Jhih Shih, Yuyao Kong, Faaiq Waqar, Pin-Jun Chen, Yingyan (Celine) Lin, Shimeng Yu

PDF

Open Access

TL;DR

A3D-MoE introduces a 3D heterogeneous integration system for large language models, significantly improving efficiency, reducing latency and energy consumption, and enhancing hardware utilization through innovative dataflow, scheduling, and expert placement strategies.

Contribution

The paper presents a novel 3D integration architecture and techniques that address hardware utilization, latency, and energy challenges in MoE-based large language models.

Findings

01

Latency reduced by 1.8x to 2x

02

Energy consumption decreased by 2x to 4x

03

Throughput improved by 1.44x to 1.8x

Abstract

Conventional large language models (LLMs) are equipped with dozens of GB to TB of model parameters, making inference highly energy-intensive and costly as all the weights need to be loaded to onboard processing elements during computation. Recently, the Mixture-of-Experts (MoE) architecture has emerged as an efficient alternative, promising efficient inference with less activated weights per token. Nevertheless, fine-grained MoE-based LLMs face several challenges: 1) Variable workloads during runtime create arbitrary GEMV-GEMM ratios that reduce hardware utilization, 2) Traditional MoE-based scheduling for LLM serving cannot fuse attention operations with MoE operations, leading to increased latency and decreased hardware utilization, and 3) Despite being more efficient than conventional LLMs, loading experts from DRAM still consumes significant energy and requires substantial DRAM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Machine Learning in Materials Science