TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

Zhiqiang Liu; Wenhui Dong; Yilang Tan; Yuwen Qu; Haochen Yin; Chenyang Si

arXiv:2605.16909·cs.AI·May 19, 2026

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

Zhiqiang Liu, Wenhui Dong, Yilang Tan, Yuwen Qu, Haochen Yin, Chenyang Si

PDF

1 Repo

TL;DR

This paper introduces MM-ToolBench, a comprehensive benchmark for evaluating task-oriented omni-modal tool-using agents in realistic workflows, emphasizing closed-loop verification and scalability.

Contribution

It presents a new benchmark with 100 tasks, grounded evaluators, and a semi-automated pipeline to assess and advance omni-modal tool-using agents in real-world scenarios.

Findings

01

Current models perform significantly below human benchmarks.

02

MM-ToolBench is highly challenging for contemporary agentic models.

03

Claude Opus 4.6 achieves only 32.0% task success.

Abstract

Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pi3ai/TOBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.