Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs
Abhishek Gupta, Aditya Mahajan

TL;DR
This paper develops a new operator-theoretic framework for MDPs, establishing existence of optimal policies, deriving policy gradients, and proposing algorithms with improved efficiency over existing methods.
Contribution
It introduces a novel existence theorem, a policy difference lemma, and a majorization-minimization policy gradient algorithm for general MDPs, extending reinforcement learning techniques.
Findings
New existence result for optimal policies in general MDPs.
A policy gradient algorithm based on integral probability metrics.
The MM-RKHS algorithm outperforms PPO in efficiency and convergence.
Abstract
Markov decision processes (MDPs) is viewed as an optimization of an objective function over certain linear operators over general function spaces. A new existence result is established for the existence of optimal policies in general MDPs, which differs from the existence result derived previously in the literature. Using the well-established perturbation theory of linear operators, policy difference lemma is established for general MDPs and the Gauteaux derivative of the objective function as a function of the policy operator is derived. By upper bounding the policy difference via the theory of integral probability metric, a new majorization-minimization type policy gradient algorithm for general MDPs is derived. This leads to generalization of many well-known algorithms in reinforcement learning to cases with general state and action spaces. Further, by taking the integral probability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
