In-context Exploration-Exploitation for Reinforcement Learning
Zhenwen Dai, Federico Tomasi, Sina Ghiassian

TL;DR
This paper introduces ICEE, an efficient in-context exploration-exploitation algorithm for reinforcement learning that reduces computational costs and learns new tasks with fewer episodes by performing inference-time exploration-exploitation within a Transformer.
Contribution
ICEE enables inference-time exploration-exploitation in RL without gradient optimization, significantly improving efficiency over existing in-context learning methods.
Findings
ICEE learns new RL tasks with only tens of episodes.
ICEE matches Gaussian process efficiency in Bayesian optimization.
ICEE reduces computational costs compared to prior in-context learning methods.
Abstract
In-context learning is a promising approach for online policy learning of offline reinforcement learning (RL) methods, which can be achieved at inference time without gradient optimization. However, this method is hindered by significant computational costs resulting from the gathering of large training trajectory sets and the need to train large Transformer models. We address this challenge by introducing an In-context Exploration-Exploitation (ICEE) algorithm, designed to optimize the efficiency of in-context policy learning. Unlike existing models, ICEE performs an exploration-exploitation trade-off at inference time within a Transformer model, without the need for explicit Bayesian inference. Consequently, ICEE can solve Bayesian optimization problems as efficiently as Gaussian process biased methods do, but in significantly less time. Through experiments in grid world environments,…
Peer Reviews
Decision·ICLR 2024 poster
- Simple and well articulated but technically precise characterization of the issues surrounding in context policy learning, especially related to epistemic uncertainty. - The experimental results, while in simple domains at small scale, are nevertheless well motivated and designed to illustrate the promise of the proposed ideas. - For the BO benchmark, the proposed approach is competitive with EI, but at a small fraction of the cost. For the sequential RL tasks, it achieves better performance c
The context length requires a sequence of episodes for in-context learning, which can make it fundamentally quite challenging in terms of scale to go beyond small dimensional problems. Nits/minor typos: - GP biased -> GP based - pg 3: indefinite -> infinite - Sec 6: wildly -> widely
+ In-context RL is a timely topic in the study of RL and this work provides some interesting idea about the design of in-context RL algorithms. + The design of cross-episode reward successfully removes the requirement that the offline dataset is generated from some RL learning algorithms. This design improves the applicability of in-context RL. + The experimental results show promising performance of ICEE in the early episodes.
- The experimental evaluation is not sufficient. There is no comparison between in-context RL algorithms and traditional offline RL algorithms, multi-task RL algorithms or posterior sampling based algorithms. - The proposed ICEE algorithm lacks theoretical performance guarantees. - The ICEE algorithm heavily relies on importance sampling and Monte Carlo approximation, which are widely used techniques in traditional RL and not new.
The experimental results are very promising. While the algorithm is an extension of Decision Transformer(CT), it certainly outperforms DT significantly. The application of in-context learning in RL is innovative and has potentials to bring advancement to learning.
The explanation provided in the paper on why the proposed algorithm is not very convincing. There is no quantitative analysis provided for identifying the difference between ICEE and DT. As pointed below, some of the explanations of the key ideas in the paper are not very clear.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsAttention Is All You Need · Linear Layer · Dropout · Multi-Head Attention · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings · Softmax · Dense Connections · Label Smoothing
