Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

Rasool Fakoor; Murdock Aubry; Nicholas Stranges; Alexander J. Smola

arXiv:2605.12380·cs.LG·May 13, 2026

Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

Rasool Fakoor, Murdock Aubry, Nicholas Stranges, Alexander J. Smola

PDF

1 Repo

TL;DR

This paper introduces an adaptive policy optimization method for reinforcement learning that dynamically adjusts to data distribution shifts without extra hyper-parameters, improving robustness and performance.

Contribution

It proposes a batch-adaptive objective using normalized effective sample size to replace fixed hyper-parameters, simplifying tuning and enhancing stability.

Findings

01

Method matches or exceeds tuned baselines across various settings.

02

It removes the need for additional hyper-parameters and retuning.

03

Experiments demonstrate improved robustness to data distribution mismatches.

Abstract

Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sampling, and other implementation details. Existing methods manage this fragility by adding hyper-parameters to the training objective, which makes the algorithm more sensitive to its configuration and requires retuning whenever the task, model scale, or distribution mismatch changes. This fragility traces to two concerns that current objectives entangle through hyper-parameters set before training begins: a trust-region concern, that updates should not move the policy too far from its current value, and an off-policy concern, that data from older or different behavior policies should influence the update only to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

FeynRL-project/FeynRL
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.