Yes, Q-learning Helps Offline In-Context RL

Denis Tarasov; Alexander Nikulin; Ilya Zisman; Albina Klepach; Andrei Polubarov; Nikita Lyubaykin; Alexander Derevyagin; Igor Kiselev; Vladislav Kurenkov

arXiv:2502.17666·cs.LG·May 20, 2025

Yes, Q-learning Helps Offline In-Context RL

Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Andrei Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Igor Kiselev, Vladislav Kurenkov

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that integrating RL objectives into offline in-context reinforcement learning significantly enhances performance across multiple environments, surpassing traditional supervised approaches and highlighting the potential of offline RL for ICRL.

Contribution

The study introduces the use of RL objectives in offline ICRL, showing substantial performance improvements over existing supervised methods across diverse datasets and environments.

Findings

01

RL objectives improve performance by ~30% on average.

02

In XLand-MiniGrid, RL objectives doubled performance.

03

Adding conservatism further enhances value learning results.

Abstract

Existing offline in-context reinforcement learning (ICRL) methods have predominantly relied on supervised training objectives, which are known to have limitations in offline RL settings. In this study, we explore the integration of RL objectives within an offline ICRL framework. Through experiments on more than 150 GridWorld and MuJoCo environment-derived datasets, we demonstrate that optimizing RL objectives directly improves performance by approximately 30% on average compared to widely adopted Algorithm Distillation (AD), across various dataset coverages, structures, expertise levels, and environmental complexities. Furthermore, in the challenging XLand-MiniGrid environment, RL objectives doubled the performance of AD. Our results also reveal that the addition of conservatism during value learning brings additional improvements in almost all settings tested. Our findings emphasize…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

S1. (Broad empirical evaluation) This work provides a broad empirical comparison between AD and RL-based ICRL methods (IC-DQN, IC-CQL, IC-IQL, IC-TD3(+BC)) across a wide variety of datasets, including discrete and continuous action spaces. The analyses cover factors such as data coverage, expertise levels, and unordered histories. S2. (Robustness and Generalization) Empirical results consistently demonstrate that RL-based optimization yields higher NAUC scores, improved robustness to limited da

Weaknesses

W1. (Limited novelty) Although the proposed RL-based ICRL framework is empirically valuable, its architectural novelty is limited. The main contribution seems to involve replacing the supervised objective of AD with standard RL objectives, while adopting several design choices already introduced in prior works (e.g., AMAGO), such as incorporating done flags and timestep embeddings in the context representation. W2. (Simplified continuous policy evaluation) Experiments in continuous domains rely

Reviewer 02Rating 4Confidence 4

Strengths

1. The problem is very clear: in Offline-ICLR, supervised training objective does not optimize the return directly, making it hard to learn in the suboptimal trajectories, or keeping robust in low quality datasets. This paper adopts a unified backbone and compares AD with various RL objectives in its experiments. The conclusion that RL consistently outperforms AD is straightforward and significant. 1. The experiments study a wide range of settings including training targets * number of histories

Weaknesses

1. This paper is more like an experimental report than a scientific research paper. It systematically studies the RL objectives and supervised objective in a unified AD backbone, but it lacks algorithmic design on how to bridge the two objectives or theoretical insights on why RL-based objectives yield better generalization and robustness in offline in-context settings. 1. This paper primarily provides empirical observations but lacks in depth analysis of some important findings. For example, t

Reviewer 03Rating 8Confidence 4

Strengths

**S1. Clear organization and presentation** * The paper is well structured, allowing the reader to easily follow the main argument and its supporting evidence. * Experimental results are presented clearly, and the setup is well designed to validate the central claim that “RL objectives improve AD.” **S2. Useful insights in offline ICRL** * Through its experimental findings, the paper provides valuable insights into the role of offline RL objectives in ICRL. * For example, it highlights the bene

Weaknesses

**W1. (Slightly) limited novelty** * The methodological novelty of the work is somewhat limited, as prior studies have explored similarly applying RL objectives in online ICRL settings, namely AMAGO-2 [2], ReLIC [3] (note that AMAGO-2 also employs Transformers in a similar context). * That said, I view this as a moderate rather than critical weakness; offline RL setting poses unique challenges, the paper’s analytical perspective and insights remain valuable even if the methodological novelty is

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning