VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction
Xin-Qiang Cai, Masashi Sugiyama

TL;DR
This paper introduces VI-CuRL, a verifier-independent reinforcement learning framework that uses model confidence to stabilize training and improve reasoning in large language models, eliminating reliance on external verifiers.
Contribution
We develop a confidence-guided curriculum reinforcement learning method that reduces variance and enhances stability without external verifiers, supported by theoretical guarantees and empirical results.
Findings
VI-CuRL outperforms verifier-independent baselines on six benchmarks.
The method reduces gradient variance and prevents training collapse.
The estimator is proven to be asymptotically unbiased.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduceVerifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
