Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought
Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Samet Oymak

TL;DR
This paper introduces Calibrated Attempt-Level (CAL) GRPO, a reinforcement learning method that effectively leverages per-attempt rewards in multi-attempt chain-of-thought reasoning, improving training unbiasedness and performance.
Contribution
The paper proposes a novel weighting strategy for RL in multi-attempt reasoning, ensuring unbiased gradients and demonstrating theoretical and empirical advantages over existing methods.
Findings
CAL-GRPO achieves better Verification@K performance than naive methods.
Theoretical analysis explains how per-attempt rewards affect training.
Experiments validate the unbiasedness and effectiveness of CAL-GRPO.
Abstract
State-of-the-art reasoning models utilize long chain-of-thought (CoT) to solve increasingly complex problems using more test-time computation. In this work, we explore a long CoT setting where the model makes up to K successive attempts at solving a problem, in which each attempt is allowed to build on earlier ones after the model receives a hard verifier feedback. This motivates RL methods that can harness per-attempt rewards by carefully weighting individual attempts. We study optimizing the Verification@K reward (the model succeeds by the K-th attempt) and show that naively weighing the attempts by their pass/fail results in biased gradients. We introduce Calibrated Attempt-Level (CAL) GRPO by devising a weighing strategy to obtain unbiased gradients while maintaining small variance. Our theory reveals how incorporating per-attempt rewards influence the training and the eventual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
