Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

Pengyi Li; Matvey Skripkin; Alexander Zubrey; Andrey Kuznetsov; Ivan Oseledets

arXiv:2506.06395·cs.CL·June 12, 2025

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, Ivan Oseledets

PDF

Open Access 3 Reviews

TL;DR

This paper introduces RLSC, a reinforcement learning method that uses a language model's own confidence as a reward signal, enabling effective fine-tuning with minimal labeled data for improved mathematical reasoning accuracy.

Contribution

The paper presents RLSC, a novel self-confidence based RL approach that eliminates the need for external rewards or labels in fine-tuning large language models for reasoning tasks.

Findings

01

RLSC improves accuracy on multiple math benchmarks significantly.

02

It requires only 16 samples per question and few training steps.

03

The method is simple, scalable, and does not rely on external reward models.

Abstract

Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 2

Strengths

- This paper converts self-confidence into a direct, differentiable objective, and training requires no external rewards or labels and uses few samples per question. - This paper reports convergence within 15–30 steps and strong efficiency compared to TTRL’s multi-sample majority voting. - Empirical results demonstrate that RLSC improves performance across multiple backbones and benchmarks.

Weaknesses

- The method explicitly sharpens the output distribution, which could harm exploration/diversity or BoN performance. Pass@k metrics should be reported. - Benchmarks are predominantly mathematical reasoning. Generalization to other domains (code and instruction following) is not demonstrated, limiting external validity.

Reviewer 02Rating 4Confidence 4

Strengths

- The motivation behind the proposed method is sound. - The paper is well-organized and easy-to-follow. - Expermental results show that the proposed method is effective.

Weaknesses

- Citation format employed in the paper requires revision. For example, "Models such as DeepSeek-R1 Guo et al. (2025)" should be "Models such as DeepSeek-R1 (Guo et al., 2025)". - Notations in Table 3 seem unclear. LLaMA-8B should be written as LLaMA-3.1-8B; Qwen-Math-1.5B should be written as Qwen2.5-Math-1.5B; Gemma-4B-pt should be written as Gemma-2-4B-pt. And the similar issue occurs from Line 081 to Line 087. - Missing references: [1] Wang, Yiping et al. “Reinforcement Learning for Reasoni

Reviewer 03Rating 4Confidence 4

Strengths

1. The proposed method relies of model's own log-probabilities to improve performance, removing the need for a reward as in RL fine-tuning. This simplification is a valuable contribution in cases where rewards are difficult to obtain. 2. Experiments studied a wide array of base model classes (Qwen, llama, gemma, etc.) and several common math benchmarks. The results show that the proposed method can match RL fine-tuning in model improvement.

Weaknesses

1. One key issue with the proposed method of reinforcing the highest probability response is that it cannot correct cases where the most-likely response is initially wrong. As mentioned in Sec. 2.1 the motivation behind the proposed method is biasing the probability distribution towards the most-likely response, which cannot make the necessary correction. Such cases is exactly why external information from a reward signal is needed to improve a model. Is this method to be used as a final fine

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning and Data Classification