Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start
Kun Chen, Peng Shi, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao, Lin Ma

TL;DR
This paper introduces SPECS, a novel cold start framework for multimodal learning that uses self-distillation and preference-based training to improve generalization and downstream reinforcement learning performance.
Contribution
It proposes a decoupled learning approach that generates preference data via self-distillation and focuses on transferable surface-form criteria, enhancing multimodal model training.
Findings
Preference-based training outperforms supervised fine-tuning in cold start scenarios.
SPECS improves benchmark performance by up to 12.2%.
The framework reduces in-distribution stuckness and stabilizes training.
Abstract
Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of "MLLM-r1" approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
