Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning
Harin Lee, Kevin Jamieson

TL;DR
This paper introduces an optimal reinforcement learning algorithm for environments with delayed state observations, providing tight regret bounds and a general analytical framework for structured MDPs.
Contribution
It proposes a novel algorithm combining augmentation and UCB for delayed observations and establishes its optimal regret bounds for tabular MDPs.
Findings
Regret bound of (H \u00a0 ext{D}_{ ext{max}} S A K) for the proposed method.
Matching lower bound up to logarithmic factors, confirming optimality.
General framework for structured MDPs with decomposed transition dynamics.
Abstract
We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper confidence bound approach. For tabular Markov decision processes (MDPs), we derive a regret bound of , where and are the cardinalities of the state and action spaces, is the time horizon, is the number of episodes, and is the maximum length of the delay. We also provide a matching lower bound up to logarithmic factors, showing the optimality of our approach. Our analytical framework formulates this problem as a special case of a broader class of MDPs, where their transition dynamics decompose into a known component and an unknown but structured component. We establish general results for this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization
