TL;DR
This paper addresses the complex problem of synthesizing policies in POMDPs that guarantee reaching a goal state almost surely while avoiding bad states, introducing two novel algorithms and demonstrating their effectiveness.
Contribution
The paper introduces two new algorithms—a SAT-based iterative method and a decision-diagram approach—for computing winning regions in POMDPs with almost-sure reachability.
Findings
Algorithms are feasible and effective in practice.
Empirical evaluation confirms the approaches' efficiency.
Provides tools for safe exploration in POMDPs.
Abstract
Partially-Observable Markov Decision Processes (POMDPs) are a well-known stochastic model for sequential decision making under limited information. We consider the EXPTIME-hard problem of synthesising policies that almost-surely reach some goal state without ever visiting a bad state. In particular, we are interested in computing the winning region, that is, the set of system configurations from which a policy exists that satisfies the reachability specification. A direct application of such a winning region is the safe exploration of POMDPs by, for instance, restricting the behavior of a reinforcement learning agent to the region. We present two algorithms: A novel SAT-based iterative approach and a decision-diagram based alternative. The empirical evaluation demonstrates the feasibility and efficacy of the approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
