Constrained Stochastic Optimal Control with a Baseline Performance Guarantee
Yinlam Chow, Mohammad Ghavamzadeh

TL;DR
This paper introduces a method to derive a policy from a simulated MDP that guarantees better real-world performance than a baseline policy, with applications in various online decision-making fields.
Contribution
It presents an algorithm to compute a superior policy using simulated MDPs with performance guarantees, advancing safe policy improvement techniques.
Findings
Performance bound on sub-optimality of the derived policy
Algorithm effectively improves baseline policy in simulated environments
Applicable to real-world domains like healthcare and marketing
Abstract
In this paper, we show how a simulated Markov decision process (MDP) built by the so-called \emph{baseline} policies, can be used to compute a different policy, namely the \emph{simulated optimal} policy, for which the performance of this policy is guaranteed to be better than the baseline policy in the real environment. This technique has immense applications in fields such as news recommendation systems, health care diagnosis and digital online marketing. Our proposed algorithm iteratively solves for a "good" policy in the simulated MDP in an offline setting. Furthermore, we provide a performance bound on sub-optimality for the control policy generated by the proposed algorithm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRisk and Portfolio Optimization · Advanced Control Systems Optimization · Stochastic processes and financial applications
