SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference

Yuseon Choi; Sangjin Kim; Jungjun Oh; Gwangtae Park; Byeongcheol Kim; and Hoi-Jun Yoo

arXiv:2512.12990·cs.AR·April 3, 2026

SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference

Yuseon Choi, Sangjin Kim, Jungjun Oh, Gwangtae Park, Byeongcheol Kim, and Hoi-Jun Yoo

PDF

TL;DR

SliceMoE introduces a novel energy-efficient MoE inference framework that employs bit-sliced expert caching and mixed-precision quantization to enable efficient on-device deployment with reduced energy and latency.

Contribution

The paper proposes SliceMoE, which combines dynamic bit-sliced caching, a new quantization scheme, and predictive cache warmup for efficient MoE inference under miss-rate constraints.

Findings

01

Reduces decode-stage energy consumption by up to 2.85x.

02

Improves decode latency by up to 1.81x.

03

Maintains near-high-bit accuracy with energy-efficient caching.

Abstract

MoE models offer efficient scaling through conditional computation, but their large parameter size and expensive expert offloading make on-device deployment challenging. Existing acceleration techniques such as prefetching or expert clustering often increase energy usage or reduce expert diversity. We present SliceMoE, an energy-efficient MoE inference framework for miss-rate-constrained deployment. SliceMoE introduces Dynamic Bit-Sliced Caching (DBSC), which caches experts at slice-level granularity and assigns precision on demand to expand effective expert capacity. To support mixed-precision experts without memory duplication, we propose Calibration-Free Asymmetric Matryoshka Quantization (AMAT), a truncation-based scheme that maintains compatibility between low-bit and high-bit slices. We further introduce Predictive Cache Warmup (PCW) to reduce early-decode cold misses by reshaping…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.