TL;DR
SkillFactory introduces a supervised fine-tuning method that uses self-generated samples to prime models for cognitive skills, enhancing their ability to generalize and be robust after reinforcement learning.
Contribution
The paper presents a novel fine-tuning approach that leverages self-sampled data to enable models to acquire cognitive skills before reinforcement learning.
Findings
SkillFactory improves model generalization to harder tasks post-RL.
Models trained with SkillFactory utilize cognitive skills effectively.
SkillFactory models show increased robustness to out-of-domain regressions.
Abstract
Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
