Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities
Pengyi Li, Elizaveta Goncharova, Andrey Kuznetsov, Ivan Oseledets

TL;DR
This paper introduces a novel advantage re-weighting mechanism to improve exploration and diversity in reinforcement learning for large language models, addressing mode collapse and entropy reduction.
Contribution
It proposes a new advantage re-weighting method that balances confidence levels across responses, enhancing diversity without sacrificing accuracy in LLM reasoning tasks.
Findings
Significantly increases generative diversity and response entropy.
Outperforms existing methods by 5.7% in Pass@1 and 13.9% in Pass@32 on Qwen2.5-7B.
Effectively mitigates entropy collapse in reinforcement learning for LLMs.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to severe mode collapse and limited output diversity. We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths, thereby suppressing valid alternative reasoning chains. To address this, we propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses. By incorporating Prompt Perplexity and Answer Confidence into the advantage estimation, our method dynamically reshapes the reward signal to attenuate the gradient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
