Model-based Offline Reinforcement Learning with Local Misspecification
Kefan Dong, Yannis Flet-Berliac, Allen Nie, Emma Brunskill

TL;DR
This paper introduces a new model-based offline reinforcement learning approach that explicitly accounts for model misspecification and distribution mismatch, providing theoretical guarantees and an empirical algorithm for policy selection.
Contribution
It presents a novel lower bound on policy performance considering model misspecification and proposes an empirical algorithm for optimal offline policy selection.
Findings
Proves a safe policy improvement theorem with pessimism approximations.
Analyzes the lower bound in the LQR setting.
Demonstrates competitive performance on D4RL tasks.
Abstract
We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch and we propose an empirical algorithm for optimal offline policy selection. Theoretically, we prove a novel safe policy improvement theorem by establishing pessimism approximations to the value function. Our key insight is to jointly consider selecting over dynamics models and policies: as long as a dynamics model can accurately represent the dynamics of the state-action pairs visited by a given policy, it is possible to approximate the value of that particular policy. We analyze our lower bound in the LQR setting and also show competitive performance to previous lower bounds on policy selection across a set of D4RL tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Smart Grid Energy Management · Advanced Bandit Algorithms Research
