Model-based Bootstrap of Controlled Markov Chains
Ziwei Su, Imon Banerjee, Diego Klabjan

TL;DR
This paper develops a bootstrap method for transition kernel estimation in controlled Markov chains, enabling valid confidence intervals for offline reinforcement learning tasks.
Contribution
It introduces a novel bootstrap law of large numbers and martingale CLT for controlled Markov chains, extending to policy evaluation and recovery with practical experiments.
Findings
Bootstrap CIs outperform baselines in coverage accuracy.
Method provides asymptotically valid confidence intervals for value functions.
Experiments on RiverSwim demonstrate effectiveness at small sample sizes.
Abstract
We propose and analyze a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) with possibly nonstationary or history-dependent control policies, a setting that arises naturally in offline reinforcement learning (RL) when the behavior policy generating the data is unknown. We establish distributional consistency of the bootstrap transition estimator in both a single long-chain regime and the episodic offline RL regime. The key technical tools are a novel bootstrap law of large numbers (LLN) for the visitation counts and a novel use of the martingale central limit theorem (CLT) for the bootstrap transition increments. We extend bootstrap distributional consistency to the downstream targets of offline policy evaluation (OPE) and optimal policy recovery (OPR) via the delta method by verifying Hadamard differentiability of the Bellman operators, yielding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
