TL;DR
This paper empirically investigates when large language models can learn reasoning with weak supervision, highlighting the importance of reasoning faithfulness and training dynamics for successful generalization.
Contribution
It identifies reasoning faithfulness as a key predictor for learning success under weak supervision and demonstrates the combined effect of supervised fine-tuning and continual pre-training.
Findings
Models that generalize show prolonged reward growth before saturation.
Reasoning faithfulness predicts learning success, unlike output diversity.
Fine-tuning on reasoning traces and domain pre-training improve weak supervision performance.
Abstract
Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
