Policy Gradient Methods for Non-Markovian Reinforcement Learning
Avik Kar,Siddharth Chandak,Rahul Singh,Soumitra Sinhahajari,Eric Moulines,Shalabh Bhatnagar,Nicholas Bambos

TL;DR
This paper introduces a new policy gradient method for reinforcement learning in non-Markovian environments, using agent state dynamics optimized for reward maximization.
Contribution
It develops a novel policy gradient theorem for Agent State-Markov policies and proposes an efficient algorithm with convergence guarantees.
Findings
ASMPG outperforms baseline methods on non-Markovian tasks.
The proposed gradient theorem extends classical results to non-Markovian settings.
Finite-time and almost sure convergence are established for ASMPG.
Abstract
We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
