RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution
Kaiyuan Li, Jing-Cheng Pang, Yang Yu

TL;DR
Reinforcement learning from verifiable rewards enhances reasoning on specific tasks but does not automatically improve general question answering, necessitating explicit training methods like START to foster better thinking and answers.
Contribution
The paper introduces a new evaluation framework for reasoning quality, demonstrates the limited transfer of RLVR to GQA, and proposes START, a simple training method that improves GQA performance.
Findings
RLVR improves reasoning on verifiable tasks but not on GQA.
Explicit GQA training remains necessary despite RLVR.
START enhances reasoning and answer quality on GQA benchmarks.
Abstract
Reinforcement learning from verifiable rewards (RLVR) stimulates the thinking processes of large language models (LLMs), substantially enhancing their reasoning abilities on verifiable tasks. It is often assumed that similar gains should transfer to general question answering (GQA), but this assumption has not been thoroughly validated. To assess whether RLVR automatically improves LLM performance on GQA, we propose a Cross-Generation evaluation framework that measures the quality of intermediate reasoning by feeding the generated thinking context into LLMs of varying capabilities. Our evaluation leads to a discouraging finding: the efficacy of the thinking process on GQA tasks is markedly lower than on verifiable tasks, suggesting that explicit training on GQA remains necessary in addition to training on verifiable tasks. We further observe that direct RL training on GQA is less…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
