Model-Based Reinforcement Learning with Double Oracle Efficiency in Policy Optimization and Offline Estimation
Haichen Hu, Jian Qian, David Simchi-Levi

TL;DR
This paper introduces a novel offline oracle-efficient reinforcement learning algorithm that achieves optimal regret bounds with oracle complexity independent of state and action space sizes, applicable to large and infinite environments.
Contribution
It presents the first doubly oracle-efficient RL algorithm with oracle complexity independent of environment size, extending to infinite state and action spaces.
Findings
Achieves $ ilde{O}( oot{T} ull)$ regret with minimal oracle calls.
Oracle complexity is independent of state and action space sizes.
Generalizes to linear MDPs with infinite state and action spaces.
Abstract
Reinforcement learning (RL) in large environments often suffers from severe computational bottlenecks, as conventional regret minimization algorithms require repeated, costly calls to planning and statistical estimation oracles. While recent advances have explored offline oracle-efficient algorithms, their computational complexity typically scales with the cardinality of the state and action spaces, rendering them intractable for large-scale or continuous environments. In this paper, we address this fundamental limitation by studying offline oracle-efficient episodic RL through the lens of log-barrier and log-determinant regularization. Specifically, for tabular Markov Decision Processes (MDPs), we propose a novel algorithm that achieves the optimal regret bound while requiring only calls to both the offline statistical estimation and planning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
