Incentivizing Strong Reasoning from Weak Supervision
Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao, Bolin Ding, Bingbing Xu

TL;DR
This paper proposes a cost-effective method to enhance large language models' reasoning abilities by using supervision from weaker models, achieving near the performance of expensive reinforcement learning techniques.
Contribution
It introduces a novel weak supervision approach to incentivize reasoning in LLMs, reducing reliance on costly demonstrations and reinforcement learning.
Findings
Weak supervision from weaker models significantly improves reasoning performance.
The approach recovers up to 94% of the gains of reinforcement learning.
Effective across diverse benchmarks and model architectures.
Abstract
Large language models (LLMs) have demonstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) demonstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality demonstrations and reinforcement learning. We investigate whether the reasoning capabilities of LLMs can be effectively incentivized via supervision from significantly weaker models. We further analyze when and why such weak supervision succeeds in eliciting reasoning abilities in stronger models. Our findings show that supervision from significantly weaker reasoners can substantially improve student reasoning performance, recovering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Intelligent Tutoring Systems and Adaptive Learning
