No Free Lunch: Rethinking Internal Feedback for LLM Reasoning
Yanzhi Zhang, Zhaoxi Zhang, Haoxiang Guan, Yilin Cheng, Yitong Duan, Chen Wang, Yue Wang, Shuxin Zheng, Jiyan He

TL;DR
This paper explores internal, model-derived feedback methods for improving large language model reasoning, offering an alternative to external reward-based reinforcement learning, with mixed results depending on training stage and model tuning.
Contribution
It introduces Reinforcement Learning from Internal Feedback (RLIF), analyzing its theoretical properties and empirical performance, highlighting its potential and limitations in LLM training.
Findings
RLIF can improve early-stage reasoning performance of LLMs.
Performance of RLIF degrades with further training, sometimes below initial levels.
RLIF offers limited benefits for instruction-tuned models.
Abstract
Reinforcement learning has emerged as a powerful paradigm for post-training large language models (LLMs) to improve reasoning. Approaches like Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) have shown strong results, but they require extensive external supervision. We investigate an alternative class of methods, Reinforcement Learning from Internal Feedback (RLIF), which relies solely on intrinsic model-derived signals instead of external rewards. In particular, we leverage unsupervised reward proxies such as token-level entropy, trajectory-level entropy, and self-certainty. Our theoretical analysis shows these internal objectives are partially equivalent, and we empirically evaluate various RLIF strategies on challenging math reasoning benchmarks. Experimental results demonstrate that RLIF can boost the reasoning performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling
MethodsBalanced Selection
