CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

Yuzhuang Xu; Xu Han; Yuanchi Zhang; Yixuan Wang; Yijun Liu; Shiyu Ji; Qingfu Zhu; Wanxiang Che

arXiv:2508.02322·cs.CL·November 27, 2025

CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che

PDF

Open Access 1 Video

TL;DR

This paper introduces CAMERA, a novel micro-expert based compression framework for MoE models, which effectively reduces parameters and computational costs while maintaining high performance across multiple tasks.

Contribution

The paper proposes CAMERA, a training-free micro-expert redundancy analysis method and two compression techniques, CAMERA-P and CAMERA-Q, for efficient MoE model compression.

Findings

01

CAMERA-P outperforms baselines at 20-60% pruning ratios.

02

CAMERA-Q achieves superior results with 2-bit quantization.

03

Complete micro-expert analysis of large models in under 5 minutes.

Abstract

Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis· underline

Taxonomy

TopicsDistributed and Parallel Computing Systems