Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling

Sanghyeok Chu; Pyunghwan Ahn; Gwangmo Song; SeungHwan Kim; Honglak Lee; and Bohyung Han

arXiv:2604.13508·cs.CV·April 20, 2026

Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling

Sanghyeok Chu, Pyunghwan Ahn, Gwangmo Song, SeungHwan Kim, Honglak Lee, and Bohyung Han

PDF

1 Repo

TL;DR

This paper introduces Cluster-aware Upcycling, a novel initialization method for Mixture-of-Experts models that leverages semantic clustering and self-distillation to improve specialization, diversity, and performance.

Contribution

It proposes a cluster-aware initialization strategy that incorporates semantic structure into MoE, breaking symmetry and enhancing early expert specialization.

Findings

01

Outperforms existing methods on CLIP benchmarks

02

Produces more diverse and disentangled expert representations

03

Reduces inter-expert similarity and improves routing confidence

Abstract

Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster-aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model's input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router's initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://sanghyeokchu.github.io/cluster-aware-upcycling
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.