Partially Observable Reference Policy Programming: Solving POMDPs Sans Numerical Optimisation
Edward Kim, Hanna Kurniawati

TL;DR
This paper introduces a new online approximate POMDP solver that samples deep future histories and guarantees bounded performance loss, outperforming existing benchmarks in complex, dynamic environments.
Contribution
It presents Partially Observable Reference Policy Programming, a novel algorithm with theoretical performance guarantees and superior empirical results on large-scale, dynamic POMDP problems.
Findings
Outperforms current online benchmarks in complex scenarios
Provides theoretical bounds on performance loss based on sampling errors
Successfully applied to large-scale, dynamic environments like helicopter emergency scenarios
Abstract
This paper proposes Partially Observable Reference Policy Programming, a novel anytime online approximate POMDP solver which samples meaningful future histories very deeply while simultaneously forcing a gradual policy update. We provide theoretical guarantees for the algorithm's underlying scheme which say that the performance loss is bounded by the average of the sampling approximation errors rather than the usual maximum, a crucial requirement given the sampling sparsity of online planning. Empirical evaluations on two large-scale problems with dynamically evolving environments -- including a helicopter emergency scenario in the Corsica region requiring approximately 150 planning steps -- corroborate the theoretical results and indicate that our solver considerably outperforms current online benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAntibiotics Pharmacokinetics and Efficacy · Machine Learning and Algorithms
