MageBench: Bridging Large Multimodal Models to Agents
Miaosen Zhang, Qi Dai, Yifan Yang, Jianmin Bao, Dongdong Chen, Kai, Qiu, Chong Luo, Xin Geng, Baining Guo

TL;DR
MageBench introduces a new multimodal agent benchmark with diverse environments to evaluate reasoning, visual understanding, and interaction skills, revealing current models' significant limitations compared to humans.
Contribution
This work presents MageBench, a novel benchmark for assessing multimodal reasoning and interaction in agents, emphasizing visual feedback and imagination capabilities.
Findings
Current models perform near random in the benchmark.
Models lack ability to adapt plans based on visual feedback.
Models are far below human-level performance.
Abstract
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part, where the chain-of-thought is entirely composed of text.We consider the scenario where visual signals are continuously updated and required along the decision making process. Such vision-in-the-chain reasoning paradigm is more aligned with the needs of multimodal agents, while being rarely evaluated. In this paper, we introduce MageBench, a reasoning capability oriented multimodal agent benchmark that, while having light-weight environments, poses significant reasoning challenges and holds substantial practical value. This benchmark currently includes three types of environments: WebUI, Sokoban, and Football, comprising a total of 483…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation
