TL;DR
This paper introduces an adaptive policy optimization method for reinforcement learning that dynamically adjusts to data distribution shifts without extra hyper-parameters, improving robustness and performance.
Contribution
It proposes a batch-adaptive objective using normalized effective sample size to replace fixed hyper-parameters, simplifying tuning and enhancing stability.
Findings
Method matches or exceeds tuned baselines across various settings.
It removes the need for additional hyper-parameters and retuning.
Experiments demonstrate improved robustness to data distribution mismatches.
Abstract
Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sampling, and other implementation details. Existing methods manage this fragility by adding hyper-parameters to the training objective, which makes the algorithm more sensitive to its configuration and requires retuning whenever the task, model scale, or distribution mismatch changes. This fragility traces to two concerns that current objectives entangle through hyper-parameters set before training begins: a trust-region concern, that updates should not move the policy too far from its current value, and an off-policy concern, that data from older or different behavior policies should influence the update only to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
