Grounding Multimodal Large Language Models in Actions
Andrew Szot, Bogdan Mazoure, Harsh Agrawal, Devon Hjelm, Zsolt Kira,, Alexander Toshev

TL;DR
This paper explores how to effectively ground Multimodal Large Language Models in various embodied action spaces, enhancing their performance across diverse tasks by proposing unified methods and analyzing different adapters.
Contribution
It introduces a unified architecture for grounding MLLMs in actions, demonstrating effective tokenization for continuous actions and semantic alignment for discrete actions.
Findings
Learned tokenization improves continuous action modeling.
Semantic alignment enhances discrete action performance.
Unified approach outperforms previous methods across multiple environments.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
