OGBench: Benchmarking Offline Goal-Conditioned RL
Seohong Park, Kevin Frans, Benjamin Eysenbach, Sergey Levine

TL;DR
OGBench is a comprehensive benchmark designed to evaluate offline goal-conditioned reinforcement learning algorithms across diverse environments and capabilities, addressing the lack of standardized evaluation tools in this domain.
Contribution
This work introduces OGBench, a new benchmark with environments, datasets, and reference algorithms to systematically assess offline GCRL methods.
Findings
Reveals strengths and weaknesses of algorithms in different capabilities
Provides a standardized platform for evaluating offline GCRL
Highlights the need for diverse evaluation environments
Abstract
Offline goal-conditioned reinforcement learning (GCRL) is a major problem in reinforcement learning (RL) because it provides a simple, unsupervised, and domain-agnostic way to acquire diverse behaviors and representations from unlabeled data without rewards. Despite the importance of this setting, we lack a standard benchmark that can systematically evaluate the capabilities of offline GCRL algorithms. In this work, we propose OGBench, a new, high-quality benchmark for algorithms research in offline goal-conditioned RL. OGBench consists of 8 types of environments, 85 datasets, and reference implementations of 6 representative offline GCRL algorithms. We have designed these challenging and realistic environments and datasets to directly probe different capabilities of algorithms, such as stitching, long-horizon reasoning, and the ability to handle high-dimensional inputs and…
Peer Reviews
Decision·ICLR 2025 Poster
- **Originality**: - Recently, there has been an extensive push towards the development of benchmarks that focus on different aspects of reinforcement learning (e.g., for offline RL [1], [2]). However, as noted by the authors, there existed no centralized evaluation suite for algorithms in offline GCRL and the community often resorted to different set of tasks (not tailored for, making difficult to access the true performance of the proposed methods. As such, the development of a benchmark tha
My concerns with the current version of the work are the following (none of them particularly major): - Currently, it is challenging to assess the varying levels of novelty present in the environments available in OGBench. Throughout Section 7, I found myself questioning whether each environment was accessible elsewhere (in another benchmark or repository) or if it was unique to this benchmark. I do understand that some tasks (e.g., AntMaze) are heavily inspired by/available in other benchmarks.
* Well-motivated research efforts and a solid execution with many considerations customised to goal-conditioned offline RL. * A good coverage of task variations and challenges to state-of-the-art algorithms based on different principles. * Proper task difficulties demonstrated from the results. Could be promising to spur new algorithmic research. * Decent writing with clear presentation and well organised flow of narratives.
* The benchmark is motivated for the generality of goal-conditioned navigation tasks, aiming at learning transferrable representations for down-stream tasks. However, this is not reflected in the task design and experimental results. It would be better to include some functionalities to facilitate analysing and transferring the learned latent representations, with a report on a few SotA algorithms' performance on this aspect. * The tasks seem to only challenge policies with different initial st
* Offline GCRL is an important research direction, and this work addresses the current lack of a challenging and comprehensive benchmark. * This benchmark evaluates GCRL algorithms across diverse aspects, including stitching, stochastic environments, high-dimensional control, and long-horizon control. * The benchmark includes popular baseline results, highlighting some unsolved tasks and areas for improvement.
Some benchmark designs need further discussion or improvement. * The evaluation seems limited to five tasks. Does this mean there are only five evaluation goals? If so, this number is restricted compared to several prior goal-conditioned tasks such as the Fetch environments. * The design of the transparent arm in pixel-based manipulation tasks seems unrealistic. It would be better to let users choose between a transparent or solid option. Additionally, demonstrating how transparency affects pe
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques
