Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought

Muhammed Emrullah Ildiz; Halil Alperen Gozeten; Ege Onur Taga; Samet Oymak

arXiv:2604.17912·cs.LG·April 21, 2026

Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought

Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Samet Oymak

PDF

TL;DR

This paper introduces Calibrated Attempt-Level (CAL) GRPO, a reinforcement learning method that effectively leverages per-attempt rewards in multi-attempt chain-of-thought reasoning, improving training unbiasedness and performance.

Contribution

The paper proposes a novel weighting strategy for RL in multi-attempt reasoning, ensuring unbiased gradients and demonstrating theoretical and empirical advantages over existing methods.

Findings

01

CAL-GRPO achieves better Verification@K performance than naive methods.

02

Theoretical analysis explains how per-attempt rewards affect training.

03

Experiments validate the unbiasedness and effectiveness of CAL-GRPO.

Abstract

State-of-the-art reasoning models utilize long chain-of-thought (CoT) to solve increasingly complex problems using more test-time computation. In this work, we explore a long CoT setting where the model makes up to K successive attempts at solving a problem, in which each attempt is allowed to build on earlier ones after the model receives a hard verifier feedback. This motivates RL methods that can harness per-attempt rewards by carefully weighting individual attempts. We study optimizing the Verification@K reward (the model succeeds by the K-th attempt) and show that naively weighing the attempts by their pass/fail results in biased gradients. We introduce Calibrated Attempt-Level (CAL) GRPO by devising a weighing strategy to obtain unbiased gradients while maintaining small variance. Our theory reveals how incorporating per-attempt rewards influence the training and the eventual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.