$\mu$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts
Toshiaki Koike-Akino, Jing Liu, Ye Wang

TL;DR
This paper introduces $b$-MoE, a dynamic, test-time pruning method that adaptively reduces model complexity based on task-specific prompts, improving efficiency without retraining.
Contribution
It proposes a novel mixture of micro-experts framework enabling adaptive, prompt-dependent pruning during inference, addressing domain shift issues in activation-aware compression.
Findings
$b$-MoE achieves effective task-specific sparsity.
It reduces inference complexity dynamically.
Demonstrates adaptability across various downstream tasks.
Abstract
To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these rely on calibration data, domain shift may arise for unknown downstream tasks. With a computationally efficient calibration, activation-aware pruning can be executed for every prompt adaptively, yet achieving reduced complexity at inference. We formulate it as a mixture of micro-experts, called -MoE. Several experiments demonstrate that -MoE can dynamically adapt to task/prompt-dependent structured sparsity on the fly.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPruning
