Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Charles Arnal; Ga\"etan Narozniak; Vivien Cabannes; Yunhao Tang; Julia Kempe; Remi Munos

arXiv:2506.20520·cs.LG·December 1, 2025

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Charles Arnal, Ga\"etan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, Remi Munos

PDF

Open Access

TL;DR

This paper introduces an asymmetric off-policy REINFORCE algorithm that emphasizes positive rewards to improve reinforcement learning for language models, backed by theoretical analysis and empirical validation.

Contribution

It provides a theoretical framework for off-policy REINFORCE with a tunable baseline and demonstrates the benefits of focusing on positive rewards in off-policy RL for LLMs.

Findings

01

Off-policy REINFORCE with a lower-bounded baseline guarantees policy improvement.

02

Focusing more on positive rewards enhances off-policy RL performance.

03

Experimental results show improved reasoning tasks in LLM fine-tuning.

Abstract

Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A = r - V$ , with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBehavioral and Psychological Studies · Reinforcement Learning in Robotics

MethodsREINFORCE · ALIGN