Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs
Zhongyang Li, Ziyue Li, Tianyi Zhou

TL;DR
This paper introduces Routing Manifold Alignment (RoMA), a lightweight finetuning method that aligns routing weights with task embeddings to enhance the generalization of Mixture-of-Experts large language models.
Contribution
RoMA is a novel manifold regularization technique that improves MoE LLM routing by encouraging similar task samples to share expert choices, leading to better downstream performance.
Findings
RoMA significantly improves MoE LLM performance across benchmarks.
Lightweight finetuning of routers suffices for effective alignment.
Enhanced task understanding and solution generation integration.
Abstract
Sparse Mixture-of-Experts (MoE) have been widely adopted in recent large language models since it can efficiently scale up the model capability without increasing the inference cost. However, evaluations on broad downstream tasks reveal a consistent suboptimality of the routers in existing MoE LLMs, which results in a severe performance gap (e.g., 10-20% in accuracy) to the optimal routing. In this paper, we show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap and improve MoE LLMs' generalization performance. Our method, "Routing Manifold Alignment (RoMA)", introduces an additional manifold regularization term in the post-training objective and only requires lightweight finetuning of routers (with other parameters frozen). Specifically, the regularization encourages the routing weights of each sample to be close to those of its…
Peer Reviews
Decision·ICLR 2026 Poster
1. The high level idea of manifold alignment is intuitive and original. 2. The result that a substantive performance gain is attainable purely through better routing is compelling. 3. The authors present their work in two frontier backbones, OLMoE and DeepSeekMoE, and extensively validate their work across a host of model sizes. The authors also do a good job of validating their method against a range of downstream tasks.
**Missing baselines**. Perhaps the largest issue is that the authors essentially present a parameter-efficient finetuning method for MoE but do no compare with existing MoE-PEFT / PEFT works. The authors do a good job of comparing with a wide range of lightweight baselines including prompt and prefix tuning up to Dense BP and C3PO, but, in my view, comparison with LoRA [1] or one of its new sota variants (perhaps DoRA [2]), and a PEFT-MoE method, such as MoLE [3] is vital to assessing the validi
1. RoMA introduces a novel manifold alignment perspective to MoE routing, unifying task understanding (via embeddings) with expert selection. 2. The paper extensively evaluates RoMA on two recent MoE architectures (OLMoE and DeepSeekMoE) across eight diverse benchmarks. It outperforms strong baselines, including C3PO, Dense BP, and tuning methods, and shows competitive or superior performance to dense models with up to 34B parameters. 3. The authors systematically analyze key design choices—laye
1. While inference cost is unchanged, the training cost of RoMA—especially the nearest-neighbor search over the training set—is not discussed. 2. The paper uses a pre-trained embedding model E(⋅) to compute task embeddings but does not justify the choice or explore its sensitivity. The impact of different embedding models on RoMA’s performance remains unclear.
- Clear, practical objective: The manifold regularizer is standard, well-posed, and easy to implement for router-only fine-tuning. - **Empirical breadth:** The paper reports consistent gains over a wide suite (MMLU, ARC-C/E, HellaSwag, etc.) and shows competitive performance relative to a strong test-time routing method (C3PO). - **Focused ablations:** The paper varies the number of routed layers, token choice, and neighborhood size, which are some of the design levers that plausibly matter for
**Related works section is really sparse.** A huge body of related works are present in the literature that are missing and many works such as Moefication etc. are relevant. **Oracle” routing is defined but not computed, yet used as an empirical anchor.** The paper defines a per-example oracle $r_i^*=\arg\min_r \mathcal L_{\mathrm{CE}}(f(x_i,r),y_i)$ to claim a 10–20% “oracle gap,” and it reports Oracle rows in Table 1. However, no procedure is provided for obtaining these oracle numbers (se
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Complex Network Analysis Techniques · Domain Adaptation and Few-Shot Learning
