CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis
Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che

TL;DR
This paper introduces CAMERA, a novel micro-expert based compression framework for MoE models, which effectively reduces parameters and computational costs while maintaining high performance across multiple tasks.
Contribution
The paper proposes CAMERA, a training-free micro-expert redundancy analysis method and two compression techniques, CAMERA-P and CAMERA-Q, for efficient MoE model compression.
Findings
CAMERA-P outperforms baselines at 20-60% pruning ratios.
CAMERA-Q achieves superior results with 2-bit quantization.
Complete micro-expert analysis of large models in under 5 minutes.
Abstract
Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDistributed and Parallel Computing Systems
