Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
Charles Arnal, Ga\"etan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, Remi Munos

TL;DR
This paper introduces an asymmetric off-policy REINFORCE algorithm that emphasizes positive rewards to improve reinforcement learning for language models, backed by theoretical analysis and empirical validation.
Contribution
It provides a theoretical framework for off-policy REINFORCE with a tunable baseline and demonstrates the benefits of focusing on positive rewards in off-policy RL for LLMs.
Findings
Off-policy REINFORCE with a lower-bounded baseline guarantees policy improvement.
Focusing more on positive rewards enhances off-policy RL performance.
Experimental results show improved reasoning tasks in LLM fine-tuning.
Abstract
Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as , with a reward and some tunable baseline. Intuitively, lowering emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBehavioral and Psychological Studies · Reinforcement Learning in Robotics
MethodsREINFORCE · ALIGN
