Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

Qianshan Wei; Yishan Yang; Siyi Wang; Jinglin Chen; Binyu Wang; Jiaming Wang; Shuang Chen; Zechen Li; Yang Shi; Yuqi Tang; Weining Wang; Yi Yu; Chaoyou Fu; Qi Li; Yi-Fan Zhang

arXiv:2604.03016·cs.AI·April 6, 2026

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

Qianshan Wei, Yishan Yang, Siyi Wang, Jinglin Chen, Binyu Wang, Jiaming Wang, Shuang Chen, Zechen Li, Yang Shi, Yuqi Tang, Weining Wang, Yi Yu, Chaoyou Fu, Qi Li, Yi-Fan Zhang

PDF

1 Repo 1 Datasets

TL;DR

Agentic-MME introduces a comprehensive benchmark for evaluating multimodal models' ability to actively use visual and web tools, emphasizing process verification and efficiency measurement.

Contribution

It presents a new process-verified benchmark with real-world tasks, stepwise evaluation, and fine-grained state auditing for multimodal agentic capabilities.

Findings

01

Gemini3-pro achieves 56.3% accuracy overall.

02

Performance drops to 23.0% on the hardest tasks.

03

The benchmark reveals the difficulty of real-world multimodal problem solving.

Abstract

Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chos3ne11ven/Agentic-MME
github

Datasets

Agentic-MME/Agentic-MME
dataset· 414 dl
414 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.