When Can LLMs Learn to Reason with Weak Supervision?

Salman Rahman; Jingyan Shen; Anna Mordvina; Hamid Palangi; Saadia Gabriel; Pavel Izmailov

arXiv:2604.18574·cs.LG·April 21, 2026

When Can LLMs Learn to Reason with Weak Supervision?

Salman Rahman, Jingyan Shen, Anna Mordvina, Hamid Palangi, Saadia Gabriel, Pavel Izmailov

PDF

1 Repo

TL;DR

This paper empirically investigates when large language models can learn reasoning with weak supervision, highlighting the importance of reasoning faithfulness and training dynamics for successful generalization.

Contribution

It identifies reasoning faithfulness as a key predictor for learning success under weak supervision and demonstrates the combined effect of supervised fine-tuning and continual pre-training.

Findings

01

Models that generalize show prolonged reward growth before saturation.

02

Reasoning faithfulness predicts learning success, unlike output diversity.

03

Fine-tuning on reasoning traces and domain pre-training improve weak supervision performance.

Abstract

Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pavelslab-nyu/rlvr-weak-supervision
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.