No Free Lunch: Rethinking Internal Feedback for LLM Reasoning

Yanzhi Zhang; Zhaoxi Zhang; Haoxiang Guan; Yilin Cheng; Yitong Duan; Chen Wang; Yue Wang; Shuxin Zheng; Jiyan He

arXiv:2506.17219·cs.LG·June 26, 2025

No Free Lunch: Rethinking Internal Feedback for LLM Reasoning

Yanzhi Zhang, Zhaoxi Zhang, Haoxiang Guan, Yilin Cheng, Yitong Duan, Chen Wang, Yue Wang, Shuxin Zheng, Jiyan He

PDF

Open Access

TL;DR

This paper explores internal, model-derived feedback methods for improving large language model reasoning, offering an alternative to external reward-based reinforcement learning, with mixed results depending on training stage and model tuning.

Contribution

It introduces Reinforcement Learning from Internal Feedback (RLIF), analyzing its theoretical properties and empirical performance, highlighting its potential and limitations in LLM training.

Findings

01

RLIF can improve early-stage reasoning performance of LLMs.

02

Performance of RLIF degrades with further training, sometimes below initial levels.

03

RLIF offers limited benefits for instruction-tuned models.

Abstract

Reinforcement learning has emerged as a powerful paradigm for post-training large language models (LLMs) to improve reasoning. Approaches like Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) have shown strong results, but they require extensive external supervision. We investigate an alternative class of methods, Reinforcement Learning from Internal Feedback (RLIF), which relies solely on intrinsic model-derived signals instead of external rewards. In particular, we leverage unsupervised reward proxies such as token-level entropy, trajectory-level entropy, and self-certainty. Our theoretical analysis shows these internal objectives are partially equivalent, and we empirically evaluate various RLIF strategies on challenging math reasoning benchmarks. Experimental results demonstrate that RLIF can boost the reasoning performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling

MethodsBalanced Selection