From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
Andrew Szot, Bogdan Mazoure, Omar Attia, Aleksei Timofeev, Harsh, Agrawal, Devon Hjelm, Zhe Gan, Zsolt Kira, Alexander Toshev

TL;DR
This paper explores adapting Multimodal Large Language Models into a unified generalist embodied agent capable of functioning across diverse domains like AI, games, and UI control, using supervised learning and online reinforcement learning.
Contribution
Introduces a method to adapt MLLMs into a single generalist embodied agent with a multi-embodiment action tokenizer trained via supervised learning and online RL.
Findings
GEA achieves strong generalization to unseen tasks.
Training with cross-domain data improves performance.
Online RL enhances the adaptability of the model.
Abstract
We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language and vision tasks these models are typically trained on. Specifically, our focus lies in areas such as Embodied AI, Games, UI Control, and Planning. To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). GEA is a single unified model capable of grounding itself across these varied domains through a multi-embodiment action tokenizer. GEA is trained with supervised learning on a large dataset of embodied experiences and with online RL in interactive simulators. We explore the data and algorithmic choices necessary to develop such a model. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents. The final GEA model achieves strong generalization performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies
MethodsFocus
