Safe Policy Improvement for POMDPs via Finite-State Controllers
Thiago D. Sim\~ao, Marnix Suilen, Nils Jansen

TL;DR
This paper introduces a safe policy improvement method for POMDPs using finite-state controllers, enabling offline policy enhancement without environment access, by mapping POMDPs to fully observable MDPs and leveraging historical data.
Contribution
It proposes a novel approach to safe policy improvement in POMDPs by assuming finite-state controllers, allowing offline policy updates through history-based MDP estimation.
Findings
The method reliably improves policies with high probability.
Experimental results demonstrate effectiveness on benchmark problems.
Applicable even when finite memory is insufficient.
Abstract
We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs). SPI is an offline reinforcement learning (RL) problem that assumes access to (1) historical data about an environment, and (2) the so-called behavior policy that previously generated this data by interacting with the environment. SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner. Existing methods make the strong assumption that the environment is fully observable. In our novel approach to the SPI problem for POMDPs, we assume that a finite-state controller (FSC) represents the behavior policy and that finite memory is sufficient to derive optimal policies. This assumption allows us to map the POMDP to a finite-state fully observable MDP, the history MDP. We estimate this MDP by combining the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics
