Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning
Gunshi Gupta, Karmesh Yadav, Zsolt Kira, Yarin Gal, Rahaf Aljundi

TL;DR
Memo introduces a transformer-based architecture with memory compression for reinforcement learning, enabling embodied agents to handle long-horizon tasks efficiently by summarizing and retrieving relevant information, outperforming existing methods.
Contribution
The paper presents Memo, a novel transformer architecture with integrated memory summarization for RL, improving efficiency and long-term context handling in embodied agents.
Findings
Memo outperforms naive long-context transformers.
Memo is more compute and storage efficient.
Memo generalizes better to longer contexts.
Abstract
To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment. In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories. Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance. In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Human Pose and Action Recognition
