Efficient Offline Policy Optimization with a Learned Model
Zichen Liu, Siyi Li, Wee Sun Lee, Shuicheng Yan, Zhongwen Xu

TL;DR
This paper introduces a regularized one-step look-ahead method for offline policy optimization that outperforms MuZero Unplugged in efficiency and stability, especially with inaccurate models, demonstrated on Atari benchmarks.
Contribution
It proposes a novel regularized one-step look-ahead approach that reduces computational costs and improves stability in offline RL, addressing limitations of MCTS-based methods.
Findings
Achieves 43% better performance than MuZero Unplugged on Atari benchmarks.
Uses only 5.6% of the wall-clock time required by MuZero Unplugged.
Maintains stable performance even with inaccurate learned models.
Abstract
MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. For good performance, MCTS requires accurate learned models and a large number of simulations, thus costing huge computing time. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline RL settings, including 1) learning with limited data coverage; 2) learning from offline data of stochastic environments; 3) improperly parameterized models given the offline data; 4) with a low compute budget. We propose to use a regularized one-step look-ahead approach to tackle the above issues. Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Reinforcement Learning in Robotics · Advanced Neural Network Applications
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Batch Normalization · Convolution · Residual Block · Average Pooling · Prioritized Experience Replay · MuZero · Monte-Carlo Tree Search
