Reinforcement Learning with Conditional Expectation Reward
Changyi Xiao, Caijun Xu, Yixin Cao

TL;DR
This paper introduces Conditional Expectation Reward (CER), a novel verification method for reinforcement learning in language models that uses the model itself as an implicit verifier, enabling application to diverse reasoning tasks without external rules.
Contribution
The paper proposes CER, a new soft reward mechanism that replaces domain-specific verifiers with the language model's own likelihood, broadening reinforcement learning applicability.
Findings
CER improves reasoning performance across mathematical and general tasks.
CER provides a graded reward signal reflecting answer correctness.
Experimental results validate CER's effectiveness and flexibility.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing the reasoning capabilities of large language models, particularly in domains such as mathematics where reliable rule-based verifiers can be constructed. However, the reliance on handcrafted, domain-specific verification rules substantially limits the applicability of RLVR to general reasoning domains with free-form answers, where valid answers often exhibit significant variability, making it difficult to establish complete and accurate rules. To address this limitation, we propose Conditional Expectation Reward (CER), which leverages the large language model itself as an implicit verifier, and is therefore applicable to general domains and eliminates the need for external verifiers or auxiliary models. CER is defined as the expected likelihood of generating the reference answer conditioned on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
