ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

Yuseon Choi; Jingu Lee; Jungjun Oh; Sunjoo Whang; Byeongcheol Kim; Minsung Kim; Hoi-Jun Yoo; Sangjin Kim

arXiv:2604.14626·cs.LG·April 24, 2026

ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

Yuseon Choi, Jingu Lee, Jungjun Oh, Sunjoo Whang, Byeongcheol Kim, Minsung Kim, Hoi-Jun Yoo, Sangjin Kim

PDF

TL;DR

ELMoE-3D introduces a hardware-software co-designed framework that unifies cache acceleration and speculative decoding, significantly improving speed and energy efficiency for large-scale MoE language models on on-premises hardware.

Contribution

It proposes Elastic Self-Speculative Decoding (Elastic-SD) leveraging intrinsic elasticity axes of MoE, and a bit-sliced architecture supporting bit-nested execution for improved performance.

Findings

01

Achieves 6.6x speedup and 4.4x energy efficiency over naive MoE serving.

02

Delivers 2.2x speedup and 1.4x energy efficiency over prior accelerators.

03

Effective across batch sizes 1-16 on 3D-stacked hardware.

Abstract

Mixture-of-Experts (MoE) models have become the dominant architecture for large-scale language models, yet on-premises serving remains fundamentally memory-bound as batching turns sparse per-token compute into dense memory activation. Memory-centric architectures (PIM, NMP) improve bandwidth but leave compute underutilized under MoE's low arithmetic intensity at high batch sizes. Speculative decoding (SD) trades idle compute for fewer target invocations, yet verification must load experts even for rejected tokens, severely limiting its benefit in MoE especially at low batch sizes. We propose ELMoE-3D, a hybrid-bonding (HB)-based HW-SW co-designed framework that unifies cache-based acceleration and speculative decoding to offer overall speedup across batch sizes. We identify two intrinsic elasticity axes of MoE-expert and bit-and jointly scale them to construct Elastic Self-Speculative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.