Incentivizing LLMs to Self-Verify Their Answers
Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, Bo An

TL;DR
This paper introduces a self-verification framework for LLMs that trains models to assess their own answers, leading to improved reasoning performance and effective test-time scaling without external reward models.
Contribution
The paper proposes a unified reinforcement learning approach enabling LLMs to self-verify answers, addressing distribution mismatch issues and enhancing reasoning accuracy.
Findings
Models trained with self-verification outperform baseline models.
Self-verification enables effective test-time scaling.
Approach generalizes across different reasoning tasks.
Abstract
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find that only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance at inference time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDigital Rights Management and Security
