OMPO: A Unified Framework for RL under Policy and Dynamics Shifts
Yu Luo, Tianying Ji, Fuchun Sun, Jianwei Zhang, Huazhe Xu, Xianyuan, Zhan

TL;DR
OMPO introduces a unified approach for reinforcement learning under policy and dynamics shifts by matching transition occupancy, leading to improved performance across diverse environments and settings.
Contribution
The paper proposes a novel occupancy-matching framework with a tractable min-max formulation and an actor-critic architecture, advancing RL under policy and dynamics shifts.
Findings
OMPO outperforms existing baselines in various environments.
OMPO performs well with domain randomization in robotics.
The method is effective under both stationary and nonstationary dynamics.
Abstract
Training reinforcement learning policies using environment interaction data collected from varying policies or dynamics presents a fundamental challenge. Existing works often overlook the distribution discrepancies induced by policy or dynamics shifts, or rely on specialized algorithms with task priors, thus often resulting in suboptimal policy performances and high learning variances. In this paper, we identify a unified strategy for online RL policy learning under diverse settings of policy and dynamics shifts: transition occupancy matching. In light of this, we introduce a surrogate policy learning objective by considering the transition occupancy discrepancies and then cast it into a tractable min-max optimization problem through dual reformulation. Our method, dubbed Occupancy-Matching Policy Optimization (OMPO), features a specialized actor-critic structure equipped with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Software Reliability and Analysis Research
