AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents
Jake Grigsby, Linxi Fan, Yuke Zhu

TL;DR
AMAGO introduces a scalable in-context reinforcement learning agent using sequence models, capable of handling long-term memory, generalization, and meta-learning in complex, goal-conditioned environments with sparse rewards.
Contribution
The paper presents a novel approach to train long-sequence Transformers for in-context RL, enabling scalable, end-to-end training over entire rollouts and broad applicability across diverse problems.
Findings
Strong empirical performance in meta-RL and memory tasks
Effective in goal-conditioned problems with sparse rewards
Solves complex open-world domains with procedural environments
Abstract
We introduce AMAGO, an in-context Reinforcement Learning (RL) agent that uses sequence models to tackle the challenges of generalization, long-term memory, and meta-learning. Recent works have shown that off-policy learning can make in-context RL with recurrent policies viable. Nonetheless, these approaches require extensive tuning and limit scalability by creating key bottlenecks in agents' memory capacity, planning horizon, and model size. AMAGO revisits and redesigns the off-policy in-context approach to successfully train long-sequence Transformers over entire rollouts in parallel with end-to-end RL. Our agent is scalable and applicable to a wide range of problems, and we demonstrate its strong performance empirically in meta-RL and long-term memory domains. AMAGO's focus on sparse rewards and off-policy data also allows in-context learning to extend to goal-conditioned problems…
Peer Reviews
Decision·ICLR 2024 spotlight
- By redesigning the off-policy in-context approach and using sequence models, AMAGO overcomes previous bottlenecks in memory capacity, planning horizon, and model size. - The introduction of a hindsight relabeling scheme broadens the applicability and scalability of this approach to open-world environments. - The evaluations illustrate capabilities of AMAGO in diverse large-scale, heavy-memory and meta-learning RL challenges. - The structure and presentation of the paper are clear and well-orga
- The AMAGO architecture's complexity, particularly due to the use of transformers, can be computationally intense. This may limit its efficiency in resource-limited situations or possibly make it less suitable for complicated applications. - Although AMAGO has made improvements in performance, it still has a success rate of 0 in collecting some key resources (e.g. iron), so AMAGO still faces limitations in exploration challenges.
- AMAGO improves over baselines on long-horizon tasks while still performing well on simpler tasks. - The authors perform a series of ablations in the Appendix to isolate the effect of each design choice. - The authors test AMAGO on a very thorough set of experiments and show good results.
- It is a bit hard to keep track of what exactly the different components of the proposed method are. I think it would greatly improve the paper's clarity if the authors could include a table outlining exactly what the contribution is (e.g. Transformer architecture change, multi-gamma update, relabeling, etc.). Generally, I think the main improvement the authors can make on this paper is clarity. - Similarly, an algorithm box for the relabeling scheme would be nice for clarity. I know the author
- The paper is well written and easy to follow - The authors set a new record on a standard benchmark, and in general provide a large number of experiments across many environments - There are a large number of ablation studies - This is effectively a more useful version of the AdA paper that does not rely on closed-source benchmarks and does not require policy distilation or corporation-scale resources, making it much more useful to the academic community
- The high-level concept is not novel -- applying transformers to POMDPs or for in-context RL has been heavily studied - This method still requires server-grade GPUs to run, limiting this approach to well-off labs or industry - For the POPGym benchmark, they are comparing their off-policy method to an on-policy method given the same number of sampled timesteps. This is not a fair comparison, but they do label the baselines as on-policy. - They do not list any shortcomings
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsFocus
