FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach
Ning Liao, Xiaoxing Wang, Xiaohan Qin, Junchi Yan

TL;DR
FineRMoE introduces a novel dimension expansion architecture with an upcycling approach to surpass single-dimension limits in fine-grained MoE, significantly improving efficiency and performance across multiple benchmarks.
Contribution
The paper proposes a new architecture extending fine-grained MoE to multiple dimensions, along with a cost-effective upcycling method and specialized routing for enhanced expert specialization.
Findings
Achieves 6x higher parameter efficiency
Reduces prefill latency by 281x
Increases decoding throughput by 136x
Abstract
As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks.…
Peer Reviews
Decision·Submitted to ICLR 2026
* Extended Fine-Grained Design: the proposed method innovatively extends fine-grained expert design from the intermediate dimension to the output dimension of MoE models, addressing the long-standing issue of dimensional inconsistency that limited output-dimension specialization in previous MoEs, thus enhancing expert redundancy reduction and specialization. * Generalized Upcycling Method: The proposed upcycling approach resolves incompatibilities between existing upcycling techniques (for singl
* The proposed method’s performance depends on the proper tuning of four hyperparameters. Improper configurations may lead to suboptimal expert specialization or increased computational overhead, adding complexity to model deployment. * This paper lacks experiments to validate the effectiveness of FineRMoE in Reinforcement Learning scenarios. Throughout the experimental sections, the evaluations are exclusively conducted on ten standard benchmarks covering knowledge, reasoning, code, and math, w
To my knowledge, no well-known Sparse MoE (SMoE) models have adopted concatenation as an internal mechanism. The experimental results from this exploration could potentially aid future development in this area.
- The most important problem with this paper is that it doesn't discuss a strong necessity for introducing the additional structure (concat) into sparse experts. The paper describes the qualitative features of the proposed method (e.g., that concatenation allows different information to coexist without being mixed, unlike summation), but it fails to mention a specific situation that *must* be solved by the proposed method rather than by other implementations. This makes it difficult to distingui
1. High reproducibility: all training details, datasets, and hyperparameters are explicitly reported, increasing the paper’s credibility. 2. Robust empirical validation: experiments span multiple model sizes and show consistent gains. 3. Practical relevance: the proposed method can be readily applied to existing pretrained dense LLMs. 4. Experimental evidence: In the reported experiments, the model’s performance drops after CT, yet the proposed method achieves improvement, which strongly support
1. Missing comparison with Drop-Upcycling: Although cited as a related method, Drop-Upcycling is not included in Sec. 4.1 baseline comparisons. The omission leaves unclear whether FineRMoE’s gains hold against the strongest existing upcycling techniques. 2. Unclear effectiveness when CT does not degrade: In cases where CT does not cause performance drops (e.g., with weaker pretrained models such as Llama-3, or using higher quality datasets), it remains unclear whether FineRMoE would still outper
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications
