PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference
Yushu Zhao, Zheng Wang, Minjia Zhang

TL;DR
PuzzleMoE is a novel, training-free compression method for Mixture-of-Experts models that significantly reduces memory and improves inference efficiency while maintaining high accuracy, by merging experts and using bit-packed encoding.
Contribution
It introduces a new expert merging technique and a bit-packed encoding scheme, enabling high compression and efficiency without retraining.
Findings
Compresses MoE models by up to 50% with maintained accuracy.
Outperforms prior methods by up to 16.7% on MMLU at 50% compression.
Achieves up to 1.28× inference speedup.
Abstract
Mixture-of-Experts (MoE) models have shown strong potential in scaling language models efficiently by activating only a small subset of experts per input. However, their widespread deployment remains limited due to the high memory overhead associated with storing all expert parameters, particularly as the number of experts increases. To address this challenge, prior works have explored expert dropping and merging strategies, yet they often suffer from performance drop at high compression ratios. In this paper, we introduce PuzzleMoE, a training-free MoE compression method that achieves both high accuracy and efficient inference through two key innovations: First, PuzzleMoE performs sparse expert merging by identifying element-wise weight redundancy and specialization. It uses a dual-mask to capture both shared and expert-specific parameters. Second, to avoid the overhead of storing…
Peer Reviews
Decision·Submitted to ICLR 2026
* The method is training-free and exceptionally fast. Compressing Mixtral-8x7B takes only 2 minutes, drastically outperforming the 90 minutes required for search-based NAEE or 55 minutes for SVD-based D2. * The dual-mask merging strategy is highly effective, demonstrating state-of-the-art accuracy. At 50% compression, it shows minimal degradation, while prior methods suffer catastrophic accuracy drops. On Mixtral-8x7B MMLU, PuzzleMoE achieves 65.7%, whereas NAEE and HC-SMOE collapse to 47.3% and
* The bit-packing scheme is the method's primary strength but also its critical weakness. It is rigidly tied to 2-to-1 pairwise merging. As the ablation in Table 6 confirms, merging 3+ experts is infeasible because the required metadata (5+ bits) exceeds the 4 bits available (1 sign + 3 freed exponent). This fundamentally limits PuzzleMoE to a fixed 50% expert compression ratio (or 75% at 25% sparsity), lacking the flexibility of other pruning methods. * The paper's claim of compatibility with q
- **Novel and intuitive merging/unmerging framework** - The "puzzle" analogy, where experts are assembled from shared pieces, provides a clear and compelling conceptual model for structured expert compression (Fig. 2; Sec. 3.2; p.4). This enhances the clarity and impact of the proposed method. - The framework allows for flexible, data-driven expert reconstruction during inference, which is more sophisticated than static merging or pruning techniques (Algorithm 1; Sec. 3.2.2; p.6). - The ap
- **Complexity and overhead of the unmerging mechanism** - The learnable unmerging mechanism introduces additional parameters (the unmerging coefficients) and computational steps, which could offset some of the gains from expert merging. The overhead is not fully quantified in terms of memory and latency (Sec. 3.2.2; p.6). - The process of learning the unmerging coefficients requires a separate optimization step, which adds complexity to the training pipeline. The sensitivity to the hyperpar
- The proposed dual-mask merging mechanism is technically coherent and effectively preserves both shared and expert-specific information. - The work includes comprehensive experiments across several modern MoE architectures and diverse tasks, including reasoning benchmarks like GSM8K. - The writing is organized and reproducible, with detailed algorithms, ablations, and implementation notes that enhance transparency.
- The proposed design mainly integrates ideas from prior expert merging (e.g., HC-SMoE, Sub-MoE) and bit-level quantization methods rather than introducing a fundamentally new approach. - The reported inference acceleration (~1.2×) is relatively small given the 50% compression ratio; the main benefit appears to be memory reduction rather than compute efficiency. - Pairwise merging may not scale efficiently to larger expert counts, and the method’s complexity for >128 experts or hierarchical r
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing · Machine Learning and Data Classification
