The Generalization Gap in Offline Reinforcement Learning

Ishita Mediratta; Qingfei You; Minqi Jiang; Roberta Raileanu

arXiv:2312.05742·cs.LG·March 18, 2024·2 cites

The Generalization Gap in Offline Reinforcement Learning

Ishita Mediratta, Qingfei You, Minqi Jiang, Roberta Raileanu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper evaluates the generalization capabilities of offline reinforcement learning methods compared to online methods, introduces a benchmark for testing generalization, and finds offline methods currently underperform in new environments, with data diversity being crucial.

Contribution

It introduces the first benchmark for offline RL generalization and provides empirical evidence that current offline algorithms struggle to generalize, emphasizing the importance of data diversity.

Findings

01

Offline algorithms perform worse than online RL on new environments.

02

Behavioral cloning outperforms other offline methods when trained on diverse data.

03

Increasing data diversity improves generalization across all offline algorithms.

Abstract

Despite recent progress in offline learning, these methods are still trained and tested on the same environment. In this paper, we compare the generalization abilities of widely used online and offline learning methods such as online reinforcement learning (RL), offline RL, sequence modeling, and behavioral cloning. Our experiments show that offline learning algorithms perform worse on new environments than online learning ones. We also introduce the first benchmark for evaluating generalization in offline learning, collecting datasets of varying sizes and skill-levels from Procgen (2D video games) and WebShop (e-commerce websites). The datasets contain trajectories for a limited number of game levels or natural language instructions and at test time, the agent has to generalize to new levels or instructions. Our experiments reveal that existing offline learning algorithms struggle to…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 4

Strengths

1. This paper presents new results on the generalization of offline learning. As we know, it is important to understand the generalization ability of offline learning methods in order to apply offline methods to real-world problems. Although not very surprising, this paper first confirms that offline RL and sequence modeling approaches can be struggling to generalize to new environments. 2. The results may have a broad impact on the community. Indeed, we can no longer ignore the generalization p

Weaknesses

1. The paper does not discuss in depth the root causes of the generalization problem. I think it would be a great credit to the paper if the authors could share some thoughts on why. 2. There are minor problems: - The color of the lines in Figure 2(a) is wrong. - The results on Leaper are missing in Figure 11 and Figure 12.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- This paper investigates an important problem in offline RL. - It introduced a collection of offline RL datasets of different sizes and skill-levels from the Procgen and WebShop environments. - The experiments are thorough. - The writing is clear and easy to follow.

Weaknesses

- The novelty of the study is somewhat restricted. Given that many existing offline RL algorithms do not inherently prioritize generalization ability in their design, so the current experimental results are mostly within expected outcomes. - There are many duplicate references: "Leveraging procedural generation to benchmark reinforcement learning", "Offline q- learning on diverse multi-task data both scales and generalizes", "Deep residual learning for image recognition.", "The nethack learning

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The problem studied in this paper is important and interesting. Existing offline RL benchmarks indeed require more practical metrics for evaluation. 2. The experiments conducted in this paper seem solid and abundant. The conclusions made sound convincing. 3. Hyperparameters are provided and the results seem reproducible.

Weaknesses

1. The algorithms included are a bit limited. Methods like model-based learning [1-2], curriculum imitation [3], and other methods are not involved. More useful conclusions can be made when the benchmarking algorithms are expanded. 2. Benchmark included is a bit limited. Most of the results are concluded from a simulated benchmark ProcGen, only a small part of experiments are conducted on the real-world dataset "WebShop". The author may consider expanding their tested benchmark to more real-wor

Code & Models

Repositories

facebookresearch/gen_dgrl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications · Advanced Bandit Algorithms Research