Magma: A Foundation Model for Multimodal AI Agents
Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng,, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars, Liden, Jianfeng Gao

TL;DR
Magma is a versatile foundation model that integrates vision-language understanding with spatial-temporal reasoning, enabling advanced multimodal AI agents to perform tasks in digital and physical environments.
Contribution
Magma extends vision-language models by incorporating spatial-temporal reasoning and agentic capabilities through novel training with heterogeneous datasets and action grounding techniques.
Findings
Achieves state-of-the-art results in UI navigation and robotic manipulation.
Outperforms existing models on image and video multimodal tasks.
Demonstrates effective synergy of SoM and ToM for spatial-temporal intelligence.
Abstract
We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Magma: A foundation model for multimodal AI Agents | Microsoft Research Forum· youtube
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Semantic Web and Ontologies
MethodsSelf-Organizing Map
