Sample Complexity of Distributionally Robust Off-Dynamics Reinforcement Learning with Online Interaction
Yiting He, Zhishuai Liu, Weixin Wang, Pan Xu

TL;DR
This paper investigates the sample complexity of distributionally robust off-dynamics reinforcement learning in online settings, introducing a new measure of exploration difficulty and proposing an optimal algorithm with theoretical guarantees.
Contribution
It introduces the supremal visitation ratio to quantify exploration difficulty and presents the first efficient algorithm with matching regret lower bounds for online RMDPs.
Findings
Supremal visitation ratio measures the mismatch between training and deployment dynamics.
Unbounded ratio makes online learning exponentially hard.
Proposed algorithm achieves optimal regret dependence on the visitation ratio and episodes.
Abstract
Off-dynamics reinforcement learning (RL), where training and deployment transition dynamics are different, can be formulated as learning in a robust Markov decision process (RMDP) where uncertainties in transition dynamics are imposed. Existing literature mostly assumes access to generative models allowing arbitrary state-action queries or pre-collected datasets with a good state coverage of the deployment environment, bypassing the challenge of exploration. In this work, we study a more realistic and challenging setting where the agent is limited to online interaction with the training environment. To capture the intrinsic difficulty of exploration in online RMDPs, we introduce the supremal visitation ratio, a novel quantity that measures the mismatch between the training dynamics and the deployment dynamics. We show that if this ratio is unbounded, online learning becomes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization
