An Incremental Off-policy Search in a Model-free Markov Decision Process Using a Single Sample Path
Ajin George Joseph, Shalabh Bhatnagar

TL;DR
This paper introduces a data-efficient, stable, and robust stochastic approximation algorithm based on the cross entropy method for solving a modified control problem in model-free MDPs using a single sample trajectory, without a generative model.
Contribution
It presents a novel off-policy search algorithm that converges globally to the optimal policy in a restricted setting with limited data and no generative model.
Findings
Algorithm is proven to converge to a globally optimal policy.
Experimental results show superior performance over state-of-the-art methods.
Method is computationally and storage efficient.
Abstract
In this paper, we consider a modified version of the control problem in a model free Markov decision process (MDP) setting with large state and action spaces. The control problem most commonly addressed in the contemporary literature is to find an optimal policy which maximizes the value function, i.e., the long run discounted reward of the MDP. The current settings also assume access to a generative model of the MDP with the hidden premise that observations of the system behaviour in the form of sample trajectories can be obtained with ease from the model. In this paper, we consider a modified version, where the cost function is the expectation of a non-convex function of the value function without access to the generative model. Rather, we assume that a sample trajectory generated using a priori chosen behaviour policy is made available. In this restricted setting, we solve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
