MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

Zhixiong Zhao; Zukang Xu; Zhixuan Chen; Dawei Yang

arXiv:2604.06798·cs.LG·April 22, 2026

MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Dawei Yang

PDF

1 Repo

TL;DR

MoBiE is a novel binarization framework for MoE-based large language models that reduces redundancy and routing shifts, achieving high efficiency without performance loss.

Contribution

MoBiE introduces three innovations—joint SVD, global loss gradient integration, and input null space-guided error constraint—for effective MoE model binarization.

Findings

01

Reduces perplexity by 52.2% on Qwen3-30B-A3B.

02

Improves zero-shot performance by 43.4%.

03

Over 2x inference speedup and shorter quantization time.

Abstract

Mixture-of-Experts (MoE) based large language models (LLMs) offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE-specific issues, including cross-expert redundancy, task-agnostic importance estimation, and quantization-induced routing shifts. To this end, we propose MoBiE, the first binarization framework tailored for MoE-based LLMs. MoBiE is built on three core innovations: 1. using joint SVD decomposition to reduce cross-expert redundancy; 2. integrating global loss gradients into local Hessian metrics to enhance weight importance estimation; 3. introducing an error constraint guided by the input null space to mitigate routing distortion. Notably, MoBiE achieves these optimizations while incurring no additional storage overhead, striking a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Kishon-zzx/MoBiE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.