S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement   Learning

Ruotian Ma; Peisong Wang; Cheng Liu; Xingyan Liu; Jiaqi Chen; Bang; Zhang; Xin Zhou; Nan Du; Jia Li

arXiv:2502.12853·cs.CL·February 19, 2025

S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang, Zhang, Xin Zhou, Nan Du, Jia Li

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces S$^2$R, a resource-efficient framework that teaches LLMs to self-verify and self-correct during inference, significantly improving reasoning accuracy with minimal additional data and training.

Contribution

S$^2$R is a novel framework that enhances LLM reasoning by combining supervised fine-tuning with reinforcement learning for self-verification and correction, requiring fewer resources.

Findings

01

Qwen2.5-math-7B accuracy improved from 51.0% to 81.6%.

02

Achieved superior performance with only 3.1k initialization samples.

03

Validated effectiveness across multiple models and benchmarks.

Abstract

Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S $^{2}$ R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nineabyss/s2r
noneOfficial

Datasets

S2R-data/S2R-dataset
dataset· 6 dl
6 dl

Videos

S²R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning· underline

Taxonomy

TopicsArtificial Intelligence in Law

MethodsBalanced Selection