Duality and Policy Evaluation in Distributionally Robust Bayesian Diffusion Control
Jose Blanchet, Jiayi Cheng, Yuewei Ling, Hao Liu, Yang Liu

TL;DR
This paper introduces a distributionally robust Bayesian control framework for diffusion control problems under parameter uncertainty, addressing prior misspecification and distribution shifts with a duality-based optimization approach.
Contribution
It proposes a novel DRBC formulation with a duality result that simplifies prior evaluation and enables practical policy learning under uncertainty.
Findings
Efficient algorithm validated on linear-quadratic control example.
Demonstrated effectiveness in real-data portfolio selection.
Reduces over-pessimism compared to classical robust control.
Abstract
We study diffusion control problems under parameter uncertainty. Controllers based on plug-in estimation can be brittle due to potential distribution shifts. Bayesian control with a prior on the parameters offers a formulation with beliefs about such shifts. However, as with any Bayesian model, the prior may be misspecified. To mitigate misspecification and reduce over-pessimism compared to classical robust control approaches (e.g. \citet{hansen2008robustness}), we propose a distributionally robust Bayesian control (DRBC) formulation in which an adversary perturbs the prior within a divergence neighborhood of a baseline prior. We develop a strong duality result that reduces the distributionally robust prior evaluation to a low-dimensional optimization and yields a practical simulation-based policy evaluation and learning procedure with structured policy parameterizations. We validate…
Peer Reviews
Decision·Submitted to ICLR 2026
The problem formulation seems to capture an important special case of the distributionally-robust control problem, where the uncertainty is confined to the drift term in a diffusion process. The mathematical approach is highly sophisticated, with some nontrivial results from the authors and some usage of recent advanced Monte Carlo estimation tools. I read the theorem statements in the main body, but did not check the proofs in the appendix. However, the tools used seem appropriate. I am not f
The introduction to this paper seemed to suggest general-purpose implications in learning-based control. For example, the statement: *"Our motivating application is continuous-time control with unknown dynamics"*, or the list of related work citing contextual bandits, policy gradient, $Q$-learning, etc. However, as far as I can tell, the contributions are highly specialized to the specific finance-inspired diffusion problem considered here. As a reviewer from the RL/control side of things: If
- The paper's motivation is well-grounded. The detrimental effect of an imprecise prior is also illustrated through numerical experiments. - Both theoretical tractability and numerical perfomance are considered in the paper, and both of them consider well-motivated settings with convincing results. - Assumptions and experimental setups are accompanied by proper discussions.
- The motivation for using phi-divergence beyond the simple strong duality formula seems inadequate. For example, phi-divergence ambiguity set requires that the distribution in the first argument is absolutely continuous with respect to the second argument. It also cannot capture the geometry of the support of the distribution. For finance applications this would be less of an issue through normalization but for most control tasks the geometry of the support does matter.
- While prior-only robustness has been explored in static DRO and Bayesian optimization, extending it to a linear-diffusion model is interesting, although the setting is quite specific and the resulting time-inconsistency issue is not addressed. - The quotient-space construction and strong duality (Thm 2) are neat, and the asymptotic $\mathcal{O}_p(n^{-1/2})$ convergence rate for the rMLMC estimator is well motivated. - The quotient-space duality and the rMLMC-based unbiased evaluation are app
The entire analysis and algorithm are developed entirely under a specific linear diffusion model with constant drift and volatility. The “policy learning” step optimizes terminal wealth rather than a general state-action mapping, and there’s no evidence that the ideas extend beyond this setup. As such, it is not clear how the approach contributes to the broader ICLR community, which typically values methods applicable to generic stochastic control, reinforcement learning, or optimization under u
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Advanced Control Systems Optimization
MethodsDiffusion
