Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

Jongmin Lee; Ernest K. Ryu

arXiv:2510.18340·cs.LG·February 11, 2026

Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

Jongmin Lee, Ernest K. Ryu

PDF

Open Access 3 Reviews

TL;DR

This paper extends the theoretical understanding of policy gradient algorithms to the undiscounted total-reward setting in reinforcement learning, addressing challenges when the discount factor is one, which is relevant for large language models.

Contribution

It introduces a novel analysis framework for policy gradients in undiscounted MDPs, utilizing transient visitation measures and state classification invariance.

Findings

01

Policy gradient methods work for undiscounted total-reward MDPs.

02

Recurrent and transient states classification remains invariant under certain policies.

03

A new transient visitation measure replaces classical state visitation measures.

Abstract

The classical policy gradient method is the theoretical and conceptual foundation of modern policy-based reinforcement learning (RL) algorithms. Most rigorous analyses of such methods, particularly those establishing convergence guarantees, assume a discount factor $γ < 1$ . In contrast, however, a recent line of work on policy-based RL for large language models uses the undiscounted total-reward setting with $γ = 1$ , rendering much of the existing theory inapplicable. In this paper, we provide analyses of the policy gradient method for undiscounted expected total-reward infinite-horizon MDPs based on two key insights: (i) the classification of the MDP states into recurrent and transient states is invariant over the set of policies that assign strictly positive probability to every action (as is typical in deep RL models employing a softmax output layer) and (ii) the classical…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

The problem studied is of importance but has not received much attention due to some pathological conditions (such as lack of continuity of value function wrt policy and existence of solutions to the bellman equation). The authors provide a principled approach to this and analyze standard policy gradient algorithms in this setting. The paper is over well written and the core intuition behind their analysis is easy to follow.

Weaknesses

The technical novelty is not entirely clear. The analysis is pretty much exactly same as prior works except for considering the transient state probabilities instead of the entire probability transition matrix. The policy gradient analysis seems moot since the NPG bounds seem to perform much better and have better convergence properties (no dependence on state and action spaces). The results would be a lot more compelling if the state and actions spaces are countable (In practice many systems

Reviewer 02Rating 8Confidence 4

Strengths

1. The paper is nicely written and to the point. The presentation is clear and the paper is well organized, making the paper easy to read (at least for someone familiar with the theory of policy gradient methods). I enjoyed reading it. 2. The contributions are solid and novelty is clear, building on a few insights from the fundamental theory of (finite state) Markov chains (recurrence and transience). The simplification of Lemma 1, 2 which also leads to the policy gradient theorem is a nice ins

Weaknesses

1. **Technical novelty**: the development of the results mainly follows existent techniques (Xiao 2022 in particular) with a few adaptations using the theory of Markov chains (for finite states) which are fairly straightforward (as the proofs p. 13 show for instance). Overall, I think that technical novelty is limited but this is not a major weakness in my opinion as there are some interesting distinctions with the discounted setting and the analysis is well executed. 2. **Corollary 1**: $\al

Reviewer 03Rating 0Confidence 5

Strengths

None.

Weaknesses

- The paper overlooks the extensive literature on average-cost MDPs, where the undiscounted infinite-horizon problem has already been analyzed in depth from multiple theoretical perspectives. - The assumption that the value function remains finite without discounting or averaging is unjustified and mathematically inconsistent under standard reward structures. The analysis ignores the fundamental role of discounting or averaging in ensuring convergence of value functions, leaving the proposed fr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques