TL;DR
This paper introduces K2V, a framework that extends RLVR to knowledge-intensive domains by synthesizing verifiable data and verifying reasoning, improving LLM reasoning without harming general capabilities.
Contribution
K2V is a novel framework that combines automated data synthesis and reasoning verification for RLVR in knowledge-intensive domains.
Findings
K2V improves LLM reasoning in knowledge-intensive tasks.
Automated data synthesis enhances training data quality.
Verification of reasoning processes leads to better model performance.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has demonstrated promising potential to enhance the reasoning capabilities of large language models (LLMs) in domains such as mathematics and coding. However, its applications on knowledge-intensive domains have not been effectively explored due to the scarcity of high-quality verifiable data. Furthermore, current RLVR focuses solely on the correctness of final answers, leading to the limitations of flawed reasoning and sparse reward signals. In this work, we propose Knowledge-to-Verification (K2V), a framework that extends RLVR to knowledge-intensive domains through automated verifiable data synthesis, while enabling verification of the LLM's reasoning process. Extensive experiments demonstrate that K2V enhances the reasoning of LLM in knowledge-intensive domains without significantly compromising the model's general capabilities.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
