COMBO: Conservative Offline Model-Based Policy Optimization

Tianhe Yu; Aviral Kumar; Rafael Rafailov; Aravind Rajeswaran; Sergey; Levine; Chelsea Finn

arXiv:2102.08363·cs.LG·January 28, 2022·52 cites

COMBO: Conservative Offline Model-Based Policy Optimization

Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey, Levine, Chelsea Finn

PDF

Open Access 4 Repos 1 Video

TL;DR

COMBO introduces a conservative offline RL algorithm that regularizes the value function on out-of-support data without explicit uncertainty estimation, achieving tighter bounds and improved performance on benchmarks.

Contribution

The paper proposes COMBO, a novel offline RL method that enforces conservatism through value regularization, bypassing the need for explicit uncertainty quantification in complex models.

Findings

01

COMBO outperforms prior offline RL methods on standard benchmarks.

02

It provides a tighter lower bound on policy value than previous approaches.

03

The method is effective on image-based offline RL tasks.

Abstract

Model-based algorithms, which learn a dynamics model from logged experience and perform some sort of pessimistic planning under the learned model, have emerged as a promising paradigm for offline reinforcement learning (offline RL). However, practical variants of such model-based algorithms rely on explicit uncertainty quantification for incorporating pessimism. Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We overcome this limitation by developing a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-action tuples generated via rollouts under the learned model. This results in a conservative estimate of the value function for out-of-support state-action tuples, without requiring explicit uncertainty estimation. We theoretically show that our method optimizes a lower bound…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

COMBO: Conservative Offline Model-Based Policy Optimization· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Advanced Bandit Algorithms Research