TL;DR
EvoESAP introduces a non-uniform expert pruning method for sparse Mixture-of-Experts models, optimizing layer-wise sparsity allocation to improve language model performance while reducing memory and compute costs.
Contribution
The paper proposes EvoESAP, an evolutionary search framework that optimizes non-uniform sparsity allocation across layers using a stable, cost-effective metric, outperforming uniform pruning.
Findings
EvoESAP improves open-ended generation performance by up to 19.6% at 50% sparsity.
EvoESAP consistently outperforms uniform pruning across models from 7B to 30B parameters.
The method maintains competitive accuracy on multiple-choice tasks despite aggressive sparsity.
Abstract
Sparse Mixture-of-Experts (SMoE) language models achieve strong capability at low per-token compute, yet deployment remains constrained by memory footprint and throughput because the full expert pool must still be stored and served. Post-training expert pruning reduces this cost, but most methods focus on which experts to prune within each layer and default to a uniform layer-wise sparsity allocation, even though the layer-wise allocation can strongly affect performance. We decouple pruning into within-layer expert ranking and across-layer budget allocation, and introduce \textbf{E}xpected \textbf{S}peculative \textbf{A}cceptance \textbf{P}roxy (\textbf{ESAP}), a speculative-decoding-inspired, teacher-forced metric that measures how well a pruned model matches the full model without costly autoregressive decoding. ESAP is bounded and stable, enabling cheap comparison of many candidates.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
