Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning
Abdullah Akg\"ul, Manuel Hau{\ss}mann, Melih Kandemir

TL;DR
This paper introduces MOMBO, a deterministic method for propagating uncertainty in model-based offline reinforcement learning, which improves convergence speed and provides tighter suboptimality guarantees compared to Monte Carlo sampling methods.
Contribution
The paper proposes MOMBO, a novel deterministic approach using moment matching for uncertainty propagation, reducing sampling variance and enhancing convergence in offline RL.
Findings
MOMBO converges faster than Monte Carlo-based methods.
Tighter suboptimality guarantees are achieved with MOMBO.
MOMBO outperforms existing approaches on benchmark tasks.
Abstract
Current approaches to model-based offline reinforcement learning often incorporate uncertainty-based reward penalization to address the distributional shift problem. These approaches, commonly known as pessimistic value iteration, use Monte Carlo sampling to estimate the Bellman target to perform temporal difference-based policy evaluation. We find out that the randomness caused by this sampling step significantly delays convergence. We present a theoretical result demonstrating the strong dependency of suboptimality on the number of Monte Carlo samples taken per Bellman target calculation. Our main contribution is a deterministic approximation to the Bellman target that uses progressive moment matching, a method developed originally for deterministic variational inference. The resulting algorithm, which we call Moment Matching Offline Model-Based Policy Optimization (MOMBO), propagates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsFault Detection and Control Systems · Smart Grid Security and Resilience · Software Reliability and Analysis Research
MethodsSparse Evolutionary Training
