Mixture Compressor for Mixture-of-Experts LLMs Gains More

Wei Huang; Yue Liao; Jianhui Liu; Ruifei He; Haoru Tan; Shiming Zhang,; Hongsheng Li; Si Liu; Xiaojuan Qi

arXiv:2410.06270·cs.LG·February 25, 2025

Mixture Compressor for Mixture-of-Experts LLMs Gains More

Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang,, Hongsheng Li, Si Liu, Xiaojuan Qi

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces MC, a training-free compression method for MoE-LLMs that significantly reduces memory and computation by adaptive quantization and dynamic token expert selection, with minimal accuracy loss.

Contribution

The paper proposes a novel, training-free Mixture-Compressor (MC) that combines adaptive quantization and dynamic pruning to compress MoE-LLMs efficiently.

Findings

01

76.6% model compression at 2.54 bits with 3.8% accuracy loss

02

15% reduction in activated parameters during inference with less than 0.6% performance drop

03

Effective trade-off between model efficiency and accuracy demonstrated through extensive experiments

Abstract

Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading latency; and 2) the current activated experts are redundant, as many tokens may only require a single expert. Motivated by these issues, we investigate the MoE-LLMs and make two key observations: a) different experts exhibit varying behaviors on activation reconstruction error, routing scores, and activated frequencies, highlighting their differing importance, and b) not all tokens are equally important -- only a small subset is critical. Building on these insights, we propose MC, a training-free Mixture-Compressor for MoE-LLMs, which leverages the significance of both experts and tokens to achieve an extreme compression. First, to mitigate storage and…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

- MoE efficiency is a relatively under explored area in comparison to general, dense LLMs. This work is a welcomed addition. - Two proposed designs are able to deliver decent task performance, especially for PMQ over BSP in Table 2.

Weaknesses

- The novelty of the proposed work is limited, as both mixed-precision quantization and token-dependent expert pruning are well-explored avenues for efficient MoE inference. - Potential lack of baseline: BSP is the only truly relevant comparison to PMQ due to its mixed-precision approach. No pruning comparisons are provided for ODP. - Most datasets used in Table 2 are common-sense intelligence tasks. Extensive literature across various fields has shown that such tasks (and ppl) are relatively ro

Reviewer 02Rating 8Confidence 4

Strengths

1. Pre-loading is an intuitive yet effective method to cope with overheads in loading expert parameters. 2. This method leveraged the uneven features learned by different expert heads as guidance to optimize quantization effort with integer programming while providing valid expert significance analysis to defend the assumption. 3. This method introduced token relevance from the attention heat map to the criterion of parameter pruning, offering salient pruning instructions without utilizing exter

Weaknesses

While this research adopts weight-only pruning, we encourage the authors to compare the effectiveness of other popular pruning methods in the second stage to demonstrate the weight-only pruning is sufficient and effective among all methods selected.

Reviewer 03Rating 6Confidence 4

Strengths

1. The proposed PMQ method innovatively considers multiple factors (activation reconstruction error, routing scores, and frequencies) in determining bit-width allocation. 2. The paper presents a comprehensive solution that addresses both static model compression and dynamic inference optimization. 3. The authors provide extensive empirical validation across multiple benchmarks and model sizes, demonstrating the method’s robustness and scalability.

Weaknesses

1. The paper lacks ablation studies on the impact of different hyperparameters (μ threshold, protection ratio) on model performance. 2. The paper does not adequately address the potential compounding effects of quantization errors across multiple MoE layers, particularly in deeper networks where error propagation could be more significant. 3. The paper lacks a comprehensive error analysis to identify which types of tasks or linguistic phenomena are most affected by the compression techniques. 4.

Code & Models

Repositories

aaronhuang-778/mc-moe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management

MethodsPruning