OGBench: Benchmarking Offline Goal-Conditioned RL

Seohong Park; Kevin Frans; Benjamin Eysenbach; Sergey Levine

arXiv:2410.20092·cs.LG·February 14, 2025

OGBench: Benchmarking Offline Goal-Conditioned RL

Seohong Park, Kevin Frans, Benjamin Eysenbach, Sergey Levine

PDF

Open Access 1 Repo 3 Reviews

TL;DR

OGBench is a comprehensive benchmark designed to evaluate offline goal-conditioned reinforcement learning algorithms across diverse environments and capabilities, addressing the lack of standardized evaluation tools in this domain.

Contribution

This work introduces OGBench, a new benchmark with environments, datasets, and reference algorithms to systematically assess offline GCRL methods.

Findings

01

Reveals strengths and weaknesses of algorithms in different capabilities

02

Provides a standardized platform for evaluating offline GCRL

03

Highlights the need for diverse evaluation environments

Abstract

Offline goal-conditioned reinforcement learning (GCRL) is a major problem in reinforcement learning (RL) because it provides a simple, unsupervised, and domain-agnostic way to acquire diverse behaviors and representations from unlabeled data without rewards. Despite the importance of this setting, we lack a standard benchmark that can systematically evaluate the capabilities of offline GCRL algorithms. In this work, we propose OGBench, a new, high-quality benchmark for algorithms research in offline goal-conditioned RL. OGBench consists of 8 types of environments, 85 datasets, and reference implementations of 6 representative offline GCRL algorithms. We have designed these challenging and realistic environments and datasets to directly probe different capabilities of algorithms, such as stitching, long-horizon reasoning, and the ability to handle high-dimensional inputs and…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

- **Originality**: - Recently, there has been an extensive push towards the development of benchmarks that focus on different aspects of reinforcement learning (e.g., for offline RL [1], [2]). However, as noted by the authors, there existed no centralized evaluation suite for algorithms in offline GCRL and the community often resorted to different set of tasks (not tailored for, making difficult to access the true performance of the proposed methods. As such, the development of a benchmark tha

Weaknesses

My concerns with the current version of the work are the following (none of them particularly major): - Currently, it is challenging to assess the varying levels of novelty present in the environments available in OGBench. Throughout Section 7, I found myself questioning whether each environment was accessible elsewhere (in another benchmark or repository) or if it was unique to this benchmark. I do understand that some tasks (e.g., AntMaze) are heavily inspired by/available in other benchmarks.

Reviewer 02Rating 8Confidence 4

Strengths

* Well-motivated research efforts and a solid execution with many considerations customised to goal-conditioned offline RL. * A good coverage of task variations and challenges to state-of-the-art algorithms based on different principles. * Proper task difficulties demonstrated from the results. Could be promising to spur new algorithmic research. * Decent writing with clear presentation and well organised flow of narratives.

Weaknesses

* The benchmark is motivated for the generality of goal-conditioned navigation tasks, aiming at learning transferrable representations for down-stream tasks. However, this is not reflected in the task design and experimental results. It would be better to include some functionalities to facilitate analysing and transferring the learned latent representations, with a report on a few SotA algorithms' performance on this aspect. * The tasks seem to only challenge policies with different initial st

Reviewer 03Rating 6Confidence 3

Strengths

* Offline GCRL is an important research direction, and this work addresses the current lack of a challenging and comprehensive benchmark. * This benchmark evaluates GCRL algorithms across diverse aspects, including stitching, stochastic environments, high-dimensional control, and long-horizon control. * The benchmark includes popular baseline results, highlighting some unsolved tasks and areas for improvement.

Weaknesses

Some benchmark designs need further discussion or improvement. * The evaluation seems limited to five tasks. Does this mean there are only five evaluation goals? If so, this number is restricted compared to several prior goal-conditioned tasks such as the Fetch environments. * The design of the transparent arm in pixel-based manipulation tasks seems unrealistic. It would be better to let users choose between a transparent or solid option. Additionally, demonstrating how transparency affects pe

Code & Models

Repositories

seohongpark/ogbench
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel-Driven Software Engineering Techniques