Provable Offline Reinforcement Learning for Structured Cyclic MDPs

Kyungbok Lee; Angelica Cristello Sarteau; Michael R. Kosorok

arXiv:2602.11679·stat.ML·February 13, 2026

Provable Offline Reinforcement Learning for Structured Cyclic MDPs

Kyungbok Lee, Angelica Cristello Sarteau, Michael R. Kosorok

PDF

Open Access

TL;DR

This paper introduces CycleFQI, a modular offline RL algorithm for structured cyclic MDPs, providing theoretical guarantees and demonstrating effectiveness on diabetes data.

Contribution

We propose CycleFQI, a stage-wise fitted Q-iteration method with theoretical analysis for cyclic MDPs, addressing offline learning challenges and enabling partial policy control.

Findings

01

CycleFQI achieves finite-sample suboptimality bounds.

02

It demonstrates global convergence under Besov regularity.

03

Effective in simulated and real-world diabetes data.

Abstract

We introduce a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle. In this setting, offline learning is challenging: optimizing a policy at any stage shifts the state distributions of subsequent stages, propagating mismatch across the cycle. To address this, we propose a modular structural framework that decomposes the cyclic process into stage-wise sub-problems. While generally applicable, we instantiate this principle as CycleFQI, an extension of fitted Q-iteration enabling theoretical analysis and interpretation. It uses a vector of stage-specific Q-functions, tailored to each stage, to capture within-stage sequences and transitions between stages. This modular design enables partial control, allowing some stages to be optimized while others follow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research