$\mu$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts

Toshiaki Koike-Akino; Jing Liu; Ye Wang

arXiv:2505.18451·cs.LG·May 27, 2025

$\mu$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts

Toshiaki Koike-Akino, Jing Liu, Ye Wang

PDF

TL;DR

This paper introduces $b$-MoE, a dynamic, test-time pruning method that adaptively reduces model complexity based on task-specific prompts, improving efficiency without retraining.

Contribution

It proposes a novel mixture of micro-experts framework enabling adaptive, prompt-dependent pruning during inference, addressing domain shift issues in activation-aware compression.

Findings

01

$b$-MoE achieves effective task-specific sparsity.

02

It reduces inference complexity dynamically.

03

Demonstrates adaptability across various downstream tasks.

Abstract

To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these rely on calibration data, domain shift may arise for unknown downstream tasks. With a computationally efficient calibration, activation-aware pruning can be executed for every prompt adaptively, yet achieving reduced complexity at inference. We formulate it as a mixture of micro-experts, called $μ$ -MoE. Several experiments demonstrate that $μ$ -MoE can dynamically adapt to task/prompt-dependent structured sparsity on the fly.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPruning