Provable Distributional Value Iteration under Partial Observability
Larry Preuett III, Qiuyi Zhang, and Muhammad Aurangzeb Ahmad

TL;DR
This paper extends Distributional Reinforcement Learning to POMDPs, introducing new operators and algorithms that handle uncertainty and partial observability in planning tasks.
Contribution
It proposes a distributional Bellman operator for POMDPs, proves its convergence, and develops DPBVI, a novel planning algorithm combining distributional RL with point-based methods.
Findings
DPBVI recovers classical PBVI in the risk-neutral case
The new operators converge under the supremum p-Wasserstein metric
Distributional approach captures the full return distribution in POMDPs
Abstract
In many real-world planning tasks, agents must tackle uncertainty about the environment's state and variability in the outcomes induced by stochastic dynamics and rewards. Motivated by recent progress in world model approaches, where latent models approximate beliefs and support planning, we extend Distributional Reinforcement Learning (DistRL), which models the entire return distribution for fully observable domains, to Partially Observable Markov Decision Processes (POMDPs). Concretely, we introduce new distributional Bellman operators for partial observability and prove their convergence under the supremum p-Wasserstein metric. We also propose a finite representation of these return distributions via psi-vectors, generalizing the classical alpha-vectors in POMDP solvers. Building on this, we develop Distributional Point-Based Value Iteration (DPBVI), which integrates psi-vectors into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
