When Is Compositional Reasoning Learnable from Verifiable Rewards?
Daniel Barzilai, Yotam Wolf, Ronen Basri

TL;DR
This paper provides a theoretical framework to understand when compositional reasoning can be learned through reinforcement learning with verifiable rewards, highlighting the importance of task-advantage ratios.
Contribution
It introduces the task-advantage ratio as a key metric for determining learnability of compositional problems in RLVR and analyzes conditions for successful learning.
Findings
Compositional problems with clear intermediate advantages are learnable.
Absence of structural advantage can lead to suboptimal convergence.
Base model quality influences the presence of advantages and learning outcomes.
Abstract
The emergence of compositional reasoning in large language models through reinforcement learning with verifiable rewards (RLVR) has been a key driver of recent empirical successes. Despite this progress, it remains unclear which compositional problems are learnable in this setting using outcome-level feedback alone. In this work, we theoretically study the learnability of compositional problems in autoregressive models under RLVR training. We identify a quantity that we call the task-advantage ratio, a joint property of the compositional problem and the base model, that characterizes which tasks and compositions are learnable from outcome-level feedback. On the positive side, using this characterization, we show that compositional problems where correct intermediate steps provide a clear advantage are efficiently learnable with RLVR. We also analyze how such an advantage naturally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Reinforcement Learning in Robotics
