Robust Regularized Policy Iteration under Transition Uncertainty
Hongqiang Lin, Zhenghui Fu, Weihao Tang, Pengfei Wang, Yiding Sun, Qixian Huang, Dongxu Zhang

TL;DR
This paper introduces RRPI, a robust policy iteration method for offline RL that explicitly accounts for transition uncertainty, improving performance and safety under distribution shifts.
Contribution
The paper proposes a novel robust regularized policy iteration framework that handles transition uncertainty with theoretical guarantees and practical efficiency.
Findings
RRPI outperforms recent baselines on D4RL benchmarks.
RRPI maintains robust performance by aligning low Q-values with high uncertainty.
The method guarantees monotonic improvement and convergence in robust policy optimization.
Abstract
Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Model Reduction and Neural Networks
