TL;DR
This paper introduces MM-ToolBench, a comprehensive benchmark for evaluating task-oriented omni-modal tool-using agents in realistic workflows, emphasizing closed-loop verification and scalability.
Contribution
It presents a new benchmark with 100 tasks, grounded evaluators, and a semi-automated pipeline to assess and advance omni-modal tool-using agents in real-world scenarios.
Findings
Current models perform significantly below human benchmarks.
MM-ToolBench is highly challenging for contemporary agentic models.
Claude Opus 4.6 achieves only 32.0% task success.
Abstract
Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
