Sample-efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs
Siow Meng Low, Akshat Kumar, Scott Sanner

TL;DR
This paper introduces ILBO, a novel iterative lower bound optimization method for deep reactive policies in continuous MDPs, significantly improving sample efficiency and solution quality over existing approaches.
Contribution
The paper proposes a new ILBO framework that iteratively optimizes DRPs using lower bounds, reducing sample complexity and enhancing solution quality.
Findings
ILBO outperforms state-of-the-art DRP planners in sample efficiency.
ILBO produces solutions with lower variance and higher quality.
ILBO generalizes well to new problem instances without retraining.
Abstract
Recent advances in deep learning have enabled optimization of deep reactive policies (DRPs) for continuous MDP planning by encoding a parametric policy as a deep neural network and exploiting automatic differentiation in an end-to-end model-based gradient descent framework. This approach has proven effective for optimizing DRPs in nonlinear continuous MDPs, but it requires a large number of sampled trajectories to learn effectively and can suffer from high variance in solution quality. In this work, we revisit the overall model-based DRP objective and instead take a minorization-maximization perspective to iteratively optimize the DRP w.r.t. a locally tight lower-bounded objective. This novel formulation of DRP learning as iterative lower bound optimization (ILBO) is particularly appealing because (i) each step is structurally easier to optimize than the overall objective, (ii) it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Algorithms · Oil and Gas Production Techniques
