Generalized Linear Markov Decision Process

Sinian Zhang; Kaicheng Zhang; Ziping Xu; Tianxi Cai; Doudou Zhou

arXiv:2506.00818·stat.ML·June 3, 2025

Generalized Linear Markov Decision Process

Sinian Zhang, Kaicheng Zhang, Ziping Xu, Tianxi Cai, Doudou Zhou

PDF

Open Access 3 Reviews

TL;DR

This paper introduces the Generalized Linear MDP framework, extending linear MDPs to model nonlinear rewards with theoretical guarantees, improving sample efficiency in reward-scarce RL settings.

Contribution

It proposes the GLMDP framework with nonlinear reward modeling, establishes Bellman completeness, and develops offline RL algorithms with provable performance guarantees.

Findings

01

Algorithms achieve theoretical policy suboptimality bounds.

02

Demonstrates improved sample efficiency in reward-limited scenarios.

03

Handles nonlinear and discrete reward structures effectively.

Abstract

The linear Markov Decision Process (MDP) framework offers a principled foundation for reinforcement learning (RL) with strong theoretical guarantees and sample efficiency. However, its restrictive assumption-that both transition dynamics and reward functions are linear in the same feature space-limits its applicability in real-world domains, where rewards often exhibit nonlinear or discrete structures. Motivated by applications such as healthcare and e-commerce, where data is scarce and reward signals can be binary or count-valued, we propose the Generalized Linear MDP (GLMDP) framework-an extension of the linear MDP framework-that models rewards using generalized linear models (GLMs) while maintaining linear transition dynamics. We establish the Bellman completeness of GLMDPs with respect to a new function class that accommodates nonlinear rewards and develop two offline RL algorithms:…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 2

Strengths

- A new framework: it extends the linear MDP framework to generalized linear models (GLMDP), enabling nonlinear or discrete reward modeling while preserving tractable Bellman updates. - Theoretical completeness: Proves Bellman completeness and sample-efficient guarantees for both offline and online algorithms under the generalized setting. - Algorithmic soundness: Designs both pessimistic (GPEVI) and optimistic (GLSVI-UCB) algorithms with rigorous finite-sample and regret bounds. - Empirical val

Weaknesses

The regret analysis assumes globally bounded derivatives of the link function (Assumption 3), which may exclude common choices like the logistic function. Could the authors explain more about this?

Reviewer 02Rating 4Confidence 3

Strengths

1. This paper is well-written and easy to follow. 2. This paper is well executed. It proposes a Generalized Linear MDP framework, which retains linear transitions while modeling rewards with generalized linear models under potentially different feature maps. The authors show that GLMDPs are Bellman complete with respect to a new function class. The authors design algorithms with provable guarantees for GLMDPs in both offline and online settings.

Weaknesses

1. MDPs with general function approximation are widely studied in the RL theory literature, e.g., RKHS MDPs and MDPs with Bellman Eluder dimension. The authors should elaborate more on the advantages and motivation of the proposed Generalized Linear MDP framework compared to existing frameworks for MDPs with general function approximation. 2. The algorithm design and theoretical analysis in this paper seem to be a combination of existing techniques in linear MDPs (i.e., least squares value itera

Reviewer 03Rating 2Confidence 2

Strengths

The theoretical analysis carried out is rigorous and thorough. The problem setting seems interesting and promising, especially the semi-supervised nature of the problem.

Weaknesses

Altogether, the work that has been presented by the authors has all the ingredients needed for a compelling paper. However, the writing in the paper lacks focus and requires significant improvements. As a reader, the paper is very difficult to follow. For instance, consider the concept of a link function, perhaps the most crucial element of the authors’ proposed framework. This concept is introduced in line 49, subsequently mentioned numerous times in the paper, yet never defined (either inform

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications