MageBench: Bridging Large Multimodal Models to Agents

Miaosen Zhang; Qi Dai; Yifan Yang; Jianmin Bao; Dongdong Chen; Kai; Qiu; Chong Luo; Xin Geng; Baining Guo

arXiv:2412.04531·cs.CV·December 9, 2024

MageBench: Bridging Large Multimodal Models to Agents

Miaosen Zhang, Qi Dai, Yifan Yang, Jianmin Bao, Dongdong Chen, Kai, Qiu, Chong Luo, Xin Geng, Baining Guo

PDF

Open Access 1 Repo

TL;DR

MageBench introduces a new multimodal agent benchmark with diverse environments to evaluate reasoning, visual understanding, and interaction skills, revealing current models' significant limitations compared to humans.

Contribution

This work presents MageBench, a novel benchmark for assessing multimodal reasoning and interaction in agents, emphasizing visual feedback and imagination capabilities.

Findings

01

Current models perform near random in the benchmark.

02

Models lack ability to adapt plans based on visual feedback.

03

Models are far below human-level performance.

Abstract

LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part, where the chain-of-thought is entirely composed of text.We consider the scenario where visual signals are continuously updated and required along the decision making process. Such vision-in-the-chain reasoning paradigm is more aligned with the needs of multimodal agents, while being rarely evaluated. In this paper, we introduce MageBench, a reasoning capability oriented multimodal agent benchmark that, while having light-weight environments, poses significant reasoning challenges and holds substantial practical value. This benchmark currently includes three types of environments: WebUI, Sokoban, and Football, comprising a total of 483…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/magebench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation