Provable Offline Reinforcement Learning for Structured Cyclic MDPs
Kyungbok Lee, Angelica Cristello Sarteau, Michael R. Kosorok

TL;DR
This paper introduces CycleFQI, a modular offline RL algorithm for structured cyclic MDPs, providing theoretical guarantees and demonstrating effectiveness on diabetes data.
Contribution
We propose CycleFQI, a stage-wise fitted Q-iteration method with theoretical analysis for cyclic MDPs, addressing offline learning challenges and enabling partial policy control.
Findings
CycleFQI achieves finite-sample suboptimality bounds.
It demonstrates global convergence under Besov regularity.
Effective in simulated and real-world diabetes data.
Abstract
We introduce a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle. In this setting, offline learning is challenging: optimizing a policy at any stage shifts the state distributions of subsequent stages, propagating mismatch across the cycle. To address this, we propose a modular structural framework that decomposes the cyclic process into stage-wise sub-problems. While generally applicable, we instantiate this principle as CycleFQI, an extension of fitted Q-iteration enabling theoretical analysis and interpretation. It uses a vector of stage-specific Q-functions, tailored to each stage, to capture within-stage sequences and transitions between stages. This modular design enables partial control, allowing some stages to be optimized while others follow…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research
