Mixture Compressor for Mixture-of-Experts LLMs Gains More
Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang,, Hongsheng Li, Si Liu, Xiaojuan Qi

TL;DR
This paper introduces MC, a training-free compression method for MoE-LLMs that significantly reduces memory and computation by adaptive quantization and dynamic token expert selection, with minimal accuracy loss.
Contribution
The paper proposes a novel, training-free Mixture-Compressor (MC) that combines adaptive quantization and dynamic pruning to compress MoE-LLMs efficiently.
Findings
76.6% model compression at 2.54 bits with 3.8% accuracy loss
15% reduction in activated parameters during inference with less than 0.6% performance drop
Effective trade-off between model efficiency and accuracy demonstrated through extensive experiments
Abstract
Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading latency; and 2) the current activated experts are redundant, as many tokens may only require a single expert. Motivated by these issues, we investigate the MoE-LLMs and make two key observations: a) different experts exhibit varying behaviors on activation reconstruction error, routing scores, and activated frequencies, highlighting their differing importance, and b) not all tokens are equally important -- only a small subset is critical. Building on these insights, we propose MC, a training-free Mixture-Compressor for MoE-LLMs, which leverages the significance of both experts and tokens to achieve an extreme compression. First, to mitigate storage and…
Peer Reviews
Decision·ICLR 2025 Poster
- MoE efficiency is a relatively under explored area in comparison to general, dense LLMs. This work is a welcomed addition. - Two proposed designs are able to deliver decent task performance, especially for PMQ over BSP in Table 2.
- The novelty of the proposed work is limited, as both mixed-precision quantization and token-dependent expert pruning are well-explored avenues for efficient MoE inference. - Potential lack of baseline: BSP is the only truly relevant comparison to PMQ due to its mixed-precision approach. No pruning comparisons are provided for ODP. - Most datasets used in Table 2 are common-sense intelligence tasks. Extensive literature across various fields has shown that such tasks (and ppl) are relatively ro
1. Pre-loading is an intuitive yet effective method to cope with overheads in loading expert parameters. 2. This method leveraged the uneven features learned by different expert heads as guidance to optimize quantization effort with integer programming while providing valid expert significance analysis to defend the assumption. 3. This method introduced token relevance from the attention heat map to the criterion of parameter pruning, offering salient pruning instructions without utilizing exter
While this research adopts weight-only pruning, we encourage the authors to compare the effectiveness of other popular pruning methods in the second stage to demonstrate the weight-only pruning is sufficient and effective among all methods selected.
1. The proposed PMQ method innovatively considers multiple factors (activation reconstruction error, routing scores, and frequencies) in determining bit-width allocation. 2. The paper presents a comprehensive solution that addresses both static model compression and dynamic inference optimization. 3. The authors provide extensive empirical validation across multiple benchmarks and model sizes, demonstrating the method’s robustness and scalability.
1. The paper lacks ablation studies on the impact of different hyperparameters (μ threshold, protection ratio) on model performance. 2. The paper does not adequately address the potential compounding effects of quantization errors across multiple MoE layers, particularly in deeper networks where error propagation could be more significant. 3. The paper lacks a comprehensive error analysis to identify which types of tasks or linguistic phenomena are most affected by the compression techniques. 4.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management
MethodsPruning
