ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
Yuseon Choi, Jingu Lee, Jungjun Oh, Sunjoo Whang, Byeongcheol Kim, Minsung Kim, Hoi-Jun Yoo, Sangjin Kim

TL;DR
ELMoE-3D introduces a hardware-software co-designed framework that unifies cache acceleration and speculative decoding, significantly improving speed and energy efficiency for large-scale MoE language models on on-premises hardware.
Contribution
It proposes Elastic Self-Speculative Decoding (Elastic-SD) leveraging intrinsic elasticity axes of MoE, and a bit-sliced architecture supporting bit-nested execution for improved performance.
Findings
Achieves 6.6x speedup and 4.4x energy efficiency over naive MoE serving.
Delivers 2.2x speedup and 1.4x energy efficiency over prior accelerators.
Effective across batch sizes 1-16 on 3D-stacked hardware.
Abstract
Mixture-of-Experts (MoE) models have become the dominant architecture for large-scale language models, yet on-premises serving remains fundamentally memory-bound as batching turns sparse per-token compute into dense memory activation. Memory-centric architectures (PIM, NMP) improve bandwidth but leave compute underutilized under MoE's low arithmetic intensity at high batch sizes. Speculative decoding (SD) trades idle compute for fewer target invocations, yet verification must load experts even for rejected tokens, severely limiting its benefit in MoE especially at low batch sizes. We propose ELMoE-3D, a hybrid-bonding (HB)-based HW-SW co-designed framework that unifies cache-based acceleration and speculative decoding to offer overall speedup across batch sizes. We identify two intrinsic elasticity axes of MoE-expert and bit-and jointly scale them to construct Elastic Self-Speculative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
