Online Policy Optimization for Robust MDP
Jing Dong, Jingwei Li, Baoxiang Wang, Jingzhao Zhang

TL;DR
This paper introduces an efficient online robust policy optimization algorithm for Markov decision processes, addressing environmental uncertainties and providing the first regret bounds in this setting.
Contribution
It proposes a novel optimistic policy optimization method for online robust MDPs with theoretical guarantees, incorporating a new update rule via Fenchel conjugates.
Findings
First regret bound established for online robust MDPs.
Algorithm demonstrates provable efficiency in uncertain environments.
Addresses the challenge of exploration-exploitation trade-off under adversarial conditions.
Abstract
Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go. However, real-world deployment of end-to-end RL models is less common, as RL models can be very sensitive to slight perturbation of the environment. The robust Markov decision process (MDP) framework -- in which the transition probabilities belong to an uncertainty set around a nominal model -- provides one way to develop robust models. While previous analysis shows RL algorithms are effective assuming access to a generative model, it remains unclear whether RL can be efficient under a more realistic online setting, which requires a careful balance between exploration and exploitation. In this work, we consider online robust MDP by interacting with an unknown nominal system. We propose a robust optimistic policy optimization algorithm that is provably efficient. To address…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Malware Detection Techniques · Artificial Intelligence in Games
