MDP Planning as Policy Inference

David Tolpin

arXiv:2602.17375·cs.LG·April 14, 2026

MDP Planning as Policy Inference

David Tolpin

PDF

TL;DR

This paper presents a Bayesian inference approach to MDP planning, treating policies as latent variables and using variational Monte Carlo to approximate the posterior, enabling uncertainty-aware decision making.

Contribution

It introduces a novel policy inference framework using variational sequential Monte Carlo for discrete MDPs, with a Thompson-sampling based acting strategy.

Findings

01

Inferred policy distributions reveal structure and uncertainty in various domains.

02

The approach produces qualitatively different behaviors compared to Soft Actor-Critic.

03

Posterior dispersion captures uncertainty over optimal policies.

Abstract

We cast episodic Markov decision process (MDP) planning as Bayesian inference over policies. A policy is treated as the latent variable and is assigned an unnormalized probability of optimality that is monotone in its expected return, yielding a posterior distribution whose modes coincide with return-maximizing solutions while posterior dispersion represents uncertainty over optimal behavior. To approximate this posterior in discrete domains, we adapt variational sequential Monte Carlo (VSMC) to inference over deterministic policies under stochastic dynamics, introducing a sweep that enforces policy consistency across revisited states and couples transition randomness across particles to avoid confounding from simulator noise. Acting is performed by posterior predictive sampling, which induces a stochastic control policy through a Thompson-sampling interpretation rather than entropy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.