TL;DR
mSFT is an iterative algorithm that improves multi-task language model training by dynamically avoiding overfitting on faster-learning datasets, leading to better performance across benchmarks.
Contribution
Introduces mSFT, a novel overfitting-aware search algorithm that optimizes dataset mixtures during multi-task supervised fine-tuning.
Findings
mSFT outperforms 4 baselines across 10 benchmarks.
It maintains robust gains across diverse dataset sizes and task granularities.
At low compute budgets, mSFT reduces training FLOPs while improving performance.
Abstract
Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
