Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

Harin Lee; Kevin Jamieson

arXiv:2603.03480·cs.LG·March 5, 2026

Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

Harin Lee, Kevin Jamieson

PDF

Open Access

TL;DR

This paper introduces an optimal reinforcement learning algorithm for environments with delayed state observations, providing tight regret bounds and a general analytical framework for structured MDPs.

Contribution

It proposes a novel algorithm combining augmentation and UCB for delayed observations and establishes its optimal regret bounds for tabular MDPs.

Findings

01

Regret bound of (H \u00a0 ext{D}_{ ext{max}} S A K) for the proposed method.

02

Matching lower bound up to logarithmic factors, confirming optimality.

03

General framework for structured MDPs with decomposed transition dynamics.

Abstract

We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper confidence bound approach. For tabular Markov decision processes (MDPs), we derive a regret bound of $\tilde{O} (H D_{m a x} S A K)$ , where $S$ and $A$ are the cardinalities of the state and action spaces, $H$ is the time horizon, $K$ is the number of episodes, and $D_{m a x}$ is the maximum length of the delay. We also provide a matching lower bound up to logarithmic factors, showing the optimality of our approach. Our analytical framework formulates this problem as a special case of a broader class of MDPs, where their transition dynamics decompose into a known component and an unknown but structured component. We establish general results for this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization