COOPO: Cyclic Offline-Online Policy Optimization Algorithm
Qisai Liu, Zhanhong Jiang, Joshua Russell Waite, Aditya Balu, Cody Fleming, Soumik Sarkar

TL;DR
COOPO introduces a cyclic offline-online policy optimization framework that reduces environment interactions, minimizes distributional shift, and improves sample efficiency and performance in reinforcement learning.
Contribution
It presents a novel cyclic training framework combining offline and online RL to prevent forgetting and distribution drift, enhancing efficiency and robustness.
Findings
Outperforms state-of-the-art hybrid methods in D4RL benchmarks.
Reduces online environment interactions while maintaining or improving returns.
Guarantees monotonic improvement under standard coverage assumptions.
Abstract
Offline reinforcement learning struggles with distributional shift and constrained performance due to static dataset limitations, while online RL demands prohibitive environment interactions. The recent advent of hybrid offline-to-online methods bridges these domains but suffers from distribution drift during transitions and catastrophic forgetting of offline knowledge. We introduce COOPO (Cyclic Offline-Online Policy Optimization), a generalized framework that repeatedly cycles between constrained offline training and online fine-tuning. Each cycle first anchors the policy to the dataset via KL-regularized advantage-weighted offline updates to minimize distributional shift and then fine-tunes it online using any policy optimization for stable exploration. Crucially, periodically returning to offline training eliminates forgetting and drift while maximizing dataset reuse. The cyclic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
