Post-Training Large Language Models via Reinforcement Learning from Self-Feedback

Carel van Niekerk; Renato Vukovic; Benjamin Matthias Ruppik; Hsien-chin Lin; Milica Ga\v{s}i\'c

arXiv:2507.21931·cs.CL·July 30, 2025

Post-Training Large Language Models via Reinforcement Learning from Self-Feedback

Carel van Niekerk, Renato Vukovic, Benjamin Matthias Ruppik, Hsien-chin Lin, Milica Ga\v{s}i\'c

PDF

TL;DR

This paper introduces Reinforcement Learning from Self-Feedback (RLSF), a method for improving large language models' calibration and reasoning by using the model's own confidence as an intrinsic reward during post-training.

Contribution

RLSF is a novel post-training technique that leverages the model's confidence to fine-tune itself without external labels, enhancing calibration and reasoning abilities.

Findings

01

Improves model calibration and reasoning performance.

02

Requires no human labels or external rewards.

03

Strengthens step-by-step reasoning in LLMs.

Abstract

Large Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model's own confidence as an intrinsic reward, mimicking how humans learn in the absence of external feedback. After a frozen LLM generates several chain-of-thought solutions, we define and compute the confidence of each final answer span and rank the traces accordingly. These synthetic preferences are then used to fine-tune the policy with standard preference optimization, similar to RLHF yet requiring no human labels, gold answers, or externally curated rewards. RLSF simultaneously (i) refines the model's probability estimates -- restoring well-behaved calibration -- and (ii) strengthens step-by-step reasoning, yielding improved performance on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.