SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference
Yuseon Choi, Sangjin Kim, Jungjun Oh, Gwangtae Park, Byeongcheol Kim, and Hoi-Jun Yoo

TL;DR
SliceMoE introduces a novel energy-efficient MoE inference framework that employs bit-sliced expert caching and mixed-precision quantization to enable efficient on-device deployment with reduced energy and latency.
Contribution
The paper proposes SliceMoE, which combines dynamic bit-sliced caching, a new quantization scheme, and predictive cache warmup for efficient MoE inference under miss-rate constraints.
Findings
Reduces decode-stage energy consumption by up to 2.85x.
Improves decode latency by up to 1.81x.
Maintains near-high-bit accuracy with energy-efficient caching.
Abstract
MoE models offer efficient scaling through conditional computation, but their large parameter size and expensive expert offloading make on-device deployment challenging. Existing acceleration techniques such as prefetching or expert clustering often increase energy usage or reduce expert diversity. We present SliceMoE, an energy-efficient MoE inference framework for miss-rate-constrained deployment. SliceMoE introduces Dynamic Bit-Sliced Caching (DBSC), which caches experts at slice-level granularity and assigns precision on demand to expand effective expert capacity. To support mixed-precision experts without memory duplication, we propose Calibration-Free Asymmetric Matryoshka Quantization (AMAT), a truncation-based scheme that maintains compatibility between low-bit and high-bit slices. We further introduce Predictive Cache Warmup (PCW) to reduce early-decode cold misses by reshaping…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
