Less is More: Undertraining Experts Improves Model Upcycling
Stefan Horoi, Guy Wolf, Eugene Belilovsky, Gintare Karolina Dziugaite

TL;DR
This paper investigates how prolonged expert fine-tuning can harm model upcycling performance, revealing that early stopping strategies can enhance the reuse of models in multi-task systems.
Contribution
It challenges the assumption that longer fine-tuning always improves transfer, showing that early stopping can prevent memorization issues and improve upcycling results.
Findings
Long fine-tuning degrades merging performance.
Memorization of difficult examples causes degradation.
Early stopping improves upcycling performance.
Abstract
Modern deep learning is increasingly characterized by the use of open-weight foundation models that can be fine-tuned on specialized datasets. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. To leverage these resources, numerous model upcycling methods have emerged, enabling the reuse of fine-tuned models in multi-task systems. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then upcycled into more general-purpose systems. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model upcycling. We show that long…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Writing is clear and easy to understand. 2. The experiments cover wide aspects, across both vision and NLP domain, model merging and model moering, as well as different architectures.
1. The explanation why overtraining hurts merging performance is unconvincing. The authors attribute that overtrained experts hurt merging performance because later training steps “memorize” difficult samples, which are later forgotten during the merging stage. However, this explanation at most explains why overtrained experts do not improve over undertrained experts, as the forgetting also happens with the undertrained experts as well (or it is even worse with undertrained experts, because acco
- The paper introduces a different perspective/new insight on why the further training of expert hurts the overall performance. - The paper shows that LoRA trained for as few as 4 steps leads to better performance than overtrained experts after merging. Further, the paper shows that training an expert for only 1/8 of training iterations leads to better performance. - The paper is well-written and easy to follow.
- The result that demonstrates the benefit of undertraining in the context of model merging. - The paper lacks discussions on other works that analyze the influence of fine-tuning stage on the performance of merged models [A,B,C,D]. - The paper proposes to reduce LR on plateau, which however is not new, but rather, a common technique. - It is a bit hard to see why reducing LR on plateau is an effective early stopping strategy for model upcycling. - The strategy of reducing LR on plateau has weak
- The paper presents a well-designed empirical study with clear hypotheses and extensive experiments across vision and language domains. - Novel perspective: It challenges a strong but untested assumption in model upcycling research and provides evidence for the benefits of undertraining. - Clarity and reproducibility: The methodology, datasets, and hyperparameters are clearly described, and the open-source implementation is a strong plus. - The inclusion of MoE evaluations adds depth and rel
* The main limitation, also acknowledged by the authors, is the lack of actionable implementation details. The paper recommends publishing intermediate checkpoints and applying early stopping, but does not specify how many checkpoints to release, how to store or curate them, or how early stopping parameters should vary across tasks or domains. * Experimental diversity: The study relies on single model families in each domain (ViT for vision and T5 for NLP). Evaluating additional architectures (e
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistics Education and Methodologies · Complex Systems and Decision Making · Data Analysis with R
MethodsMixture of Experts · Early Stopping · Sparse Evolutionary Training
