Policy Gradient Methods for Non-Markovian Reinforcement Learning

Avik Kar,Siddharth Chandak,Rahul Singh,Soumitra Sinhahajari,Eric Moulines,Shalabh Bhatnagar,Nicholas Bambos

arXiv:2605.10816·cs.LG·May 12, 2026

Policy Gradient Methods for Non-Markovian Reinforcement Learning

Avik Kar,Siddharth Chandak,Rahul Singh,Soumitra Sinhahajari,Eric Moulines,Shalabh Bhatnagar,Nicholas Bambos

PDF

TL;DR

This paper introduces a new policy gradient method for reinforcement learning in non-Markovian environments, using agent state dynamics optimized for reward maximization.

Contribution

It develops a novel policy gradient theorem for Agent State-Markov policies and proposes an efficient algorithm with convergence guarantees.

Findings

01

ASMPG outperforms baseline methods on non-Markovian tasks.

02

The proposed gradient theorem extends classical results to non-Markovian settings.

03

Finite-time and almost sure convergence are established for ASMPG.

Abstract

We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.