Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees
Sourav Ganguly, Kartik Pandit, Arnob Ghosh

TL;DR
This paper introduces a novel reinforcement learning framework that models exogenous factors as adversaries, ensuring safety and optimality guarantees in environments with strategic external influences.
Contribution
It proposes RHC-UCRL, a model-based algorithm that explicitly accounts for adversarial dynamics and separates different types of uncertainty, achieving regret and constraint violation guarantees.
Findings
RHC-UCRL achieves sub-linear regret.
The algorithm guarantees bounded constraint violations.
Explicit adversarial modeling improves safety in RL.
Abstract
Real-world decision-making systems operate in environments where state transitions depend not only on the agent's actions, but also on \textbf{exogenous factors outside its control}--competing agents, environmental disturbances, or strategic adversaries--formally, where is the adversary/external action, is the agent's action, and is an additive noise. Ignoring such factors can yield policies that are optimal in isolation but \textbf{fail catastrophically in deployment}, particularly when safety constraints must be satisfied. Standard Constrained MDP formulations assume the agent is the sole driver of state evolution, an assumption that breaks down in safety-critical settings. Existing robust RL approaches address this via distributional robustness over transition kernels, but do not explicitly model the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
