Model-based RL as a Minimalist Approach to Horizon-Free and Second-Order Bounds
Zhiyong Wang, Dongruo Zhou, John C.S. Lui, Wen Sun

TL;DR
This paper demonstrates that simple model-based RL algorithms using MLE and optimistic/pessimistic planning can achieve nearly horizon-free and second-order regret bounds, with broad applicability and straightforward analysis.
Contribution
It shows that standard MLE-based model learning combined with optimistic/pessimistic planning attains strong theoretical guarantees without complex algorithmic modifications.
Findings
Achieves nearly horizon-free regret bounds.
Attains second-order, instance-dependent bounds.
Applicable to both online and offline RL settings.
Abstract
Learning a transition model via Maximum Likelihood Estimation (MLE) followed by planning inside the learned model is perhaps the most standard and simplest Model-based Reinforcement Learning (RL) framework. In this work, we show that such a simple Model-based RL scheme, when equipped with optimistic and pessimistic planning procedures, achieves strong regret and sample complexity bounds in online and offline RL settings. Particularly, we demonstrate that under the conditions where the trajectory-wise reward is normalized between zero and one and the transition is time-homogenous, it achieves nearly horizon-free and second-order bounds. Nearly horizon-free means that our bounds have no polynomial dependence on the horizon of the Markov Decision Process. A second-order bound is a type of instance-dependent bound that scales with respect to the variances of the returns of the policies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModeling and Simulation Systems
