Efficient Offline Policy Optimization with a Learned Model

Zichen Liu; Siyi Li; Wee Sun Lee; Shuicheng Yan; Zhongwen Xu

arXiv:2210.05980·cs.LG·February 16, 2023

Efficient Offline Policy Optimization with a Learned Model

Zichen Liu, Siyi Li, Wee Sun Lee, Shuicheng Yan, Zhongwen Xu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a regularized one-step look-ahead method for offline policy optimization that outperforms MuZero Unplugged in efficiency and stability, especially with inaccurate models, demonstrated on Atari benchmarks.

Contribution

It proposes a novel regularized one-step look-ahead approach that reduces computational costs and improves stability in offline RL, addressing limitations of MCTS-based methods.

Findings

01

Achieves 43% better performance than MuZero Unplugged on Atari benchmarks.

02

Uses only 5.6% of the wall-clock time required by MuZero Unplugged.

03

Maintains stable performance even with inaccurate learned models.

Abstract

MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. For good performance, MCTS requires accurate learned models and a large number of simulations, thus costing huge computing time. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline RL settings, including 1) learning with limited data coverage; 2) learning from offline data of stochastic environments; 3) improperly parameterized models given the offline data; 4) with a low compute budget. We propose to use a regularized one-step look-ahead approach to tackle the above issues. Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sail-sg/rosmo
jaxOfficial

Videos

Efficient Offline Policy Optimization with a Learned Model· slideslive

Taxonomy

TopicsMachine Learning and Data Classification · Reinforcement Learning in Robotics · Advanced Neural Network Applications

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Batch Normalization · Convolution · Residual Block · Average Pooling · Prioritized Experience Replay · MuZero · Monte-Carlo Tree Search