SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, Andrew D. Bagdanov

TL;DR
SOPE introduces an adaptive early-stopping method for offline training in online reinforcement learning, using off-policy evaluation signals to optimize training duration and improve efficiency.
Contribution
It proposes a novel automated mechanism that dynamically determines training length, reducing manual tuning and computational costs in reinforcement learning.
Findings
SOPE improves performance by up to 45.6% on continuous control tasks.
It reduces computational costs by up to 22x compared to baseline methods.
Adaptive training schedules outperform static, manually tuned schedules.
Abstract
Incorporating prior data into online reinforcement learning accelerates training but typically forces a difficult trade-off between high computational costs and long, multi-stage training pipelines. While fixed-length stabilization phases are significantly more computationally efficient than static update schedules, they require task-dependent manual tuning, risking either the waste of prior knowledge or severe overfitting. To address this, we propose SOPE, an algorithm that uses an actor-aligned Off-Policy Policy Evaluation (OPE) signal as an automated early-stopping mechanism to dynamically control the length of offline training phases. By evaluating the critic on a held-out validation split under the current policy's action distribution, SOPE halts gradient updates exactly when out-of-distribution benefits saturate, eliminating the need for manual schedule tuning. Evaluated on 25…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
