Generalized Linear Markov Decision Process
Sinian Zhang, Kaicheng Zhang, Ziping Xu, Tianxi Cai, Doudou Zhou

TL;DR
This paper introduces the Generalized Linear MDP framework, extending linear MDPs to model nonlinear rewards with theoretical guarantees, improving sample efficiency in reward-scarce RL settings.
Contribution
It proposes the GLMDP framework with nonlinear reward modeling, establishes Bellman completeness, and develops offline RL algorithms with provable performance guarantees.
Findings
Algorithms achieve theoretical policy suboptimality bounds.
Demonstrates improved sample efficiency in reward-limited scenarios.
Handles nonlinear and discrete reward structures effectively.
Abstract
The linear Markov Decision Process (MDP) framework offers a principled foundation for reinforcement learning (RL) with strong theoretical guarantees and sample efficiency. However, its restrictive assumption-that both transition dynamics and reward functions are linear in the same feature space-limits its applicability in real-world domains, where rewards often exhibit nonlinear or discrete structures. Motivated by applications such as healthcare and e-commerce, where data is scarce and reward signals can be binary or count-valued, we propose the Generalized Linear MDP (GLMDP) framework-an extension of the linear MDP framework-that models rewards using generalized linear models (GLMs) while maintaining linear transition dynamics. We establish the Bellman completeness of GLMDPs with respect to a new function class that accommodates nonlinear rewards and develop two offline RL algorithms:…
Peer Reviews
Decision·Submitted to ICLR 2026
- A new framework: it extends the linear MDP framework to generalized linear models (GLMDP), enabling nonlinear or discrete reward modeling while preserving tractable Bellman updates. - Theoretical completeness: Proves Bellman completeness and sample-efficient guarantees for both offline and online algorithms under the generalized setting. - Algorithmic soundness: Designs both pessimistic (GPEVI) and optimistic (GLSVI-UCB) algorithms with rigorous finite-sample and regret bounds. - Empirical val
The regret analysis assumes globally bounded derivatives of the link function (Assumption 3), which may exclude common choices like the logistic function. Could the authors explain more about this?
1. This paper is well-written and easy to follow. 2. This paper is well executed. It proposes a Generalized Linear MDP framework, which retains linear transitions while modeling rewards with generalized linear models under potentially different feature maps. The authors show that GLMDPs are Bellman complete with respect to a new function class. The authors design algorithms with provable guarantees for GLMDPs in both offline and online settings.
1. MDPs with general function approximation are widely studied in the RL theory literature, e.g., RKHS MDPs and MDPs with Bellman Eluder dimension. The authors should elaborate more on the advantages and motivation of the proposed Generalized Linear MDP framework compared to existing frameworks for MDPs with general function approximation. 2. The algorithm design and theoretical analysis in this paper seem to be a combination of existing techniques in linear MDPs (i.e., least squares value itera
The theoretical analysis carried out is rigorous and thorough. The problem setting seems interesting and promising, especially the semi-supervised nature of the problem.
Altogether, the work that has been presented by the authors has all the ingredients needed for a compelling paper. However, the writing in the paper lacks focus and requires significant improvements. As a reader, the paper is very difficult to follow. For instance, consider the concept of a link function, perhaps the most crucial element of the authors’ proposed framework. This concept is introduced in line 49, subsequently mentioned numerous times in the paper, yet never defined (either inform
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications
