M^3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

Yang Zhou; Mingyu Zhao; Zhenting Wang; Difei Gu; Bangwei Guo; Ruosong Ye; Ligong Han; Can Jin; Dimitris N. Metaxas

arXiv:2511.17729·cs.AI·February 5, 2026

M^3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

Yang Zhou, Mingyu Zhao, Zhenting Wang, Difei Gu, Bangwei Guo, Ruosong Ye, Ligong Han, Can Jin, Dimitris N. Metaxas

PDF

Open Access 1 Datasets

TL;DR

M^3-Bench is a comprehensive benchmark designed to evaluate multimodal, multi-hop, multi-threaded tool use in large language models, emphasizing realistic workflows, visual grounding, and reasoning across tools.

Contribution

It introduces a novel similarity-driven alignment method and standardized evaluation pipeline for assessing complex multimodal tool-using workflows in LLMs.

Findings

01

State-of-the-art models show gaps in argument fidelity.

02

Workflow structure consistency is often lacking.

03

Benchmark reveals areas for improvement in multimodal reasoning.

Abstract

We present M^3-Bench, the first benchmark for evaluating multimodal tool use under the Model Context Protocol. The benchmark targets realistic, multi-hop and multi-threaded workflows that require visual grounding and textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps. We introduce a similarity-driven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similarity-bucketed Hungarian matching to obtain auditable one-to-one correspondences. On top of this alignment, we report interpretable metrics that decouple semantic fidelity from workflow consistency. The benchmark spans 28 servers with 231 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification; an auxiliary four large language models (LLMs) judge ensemble reports end-task Task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

EtaYang10th/Open-M3-Bench
dataset· 11 dl
11 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques