ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation

Yinuo Liu; Zi Qian; Heng Zhou; Jiahao Zhang; Yajie Zhang; Zhihang Li; Mengyu Zhou; Erchao Zhao; Xiaoxi Jiang; Guanjun Jiang

arXiv:2603.29902·cs.AI·April 1, 2026

ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation

Yinuo Liu, Zi Qian, Heng Zhou, Jiahao Zhang, Yajie Zhang, Zhihang Li, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

PDF

1 Repo

TL;DR

ATP-Bench introduces a new benchmark and evaluation system for agentic tool planning in multimodal large language models, emphasizing autonomous tool invocation for interleaved text and image generation.

Contribution

The paper presents ATP-Bench, a comprehensive benchmark and a Multi-Agent MLLM-as-a-Judge system to evaluate and improve agentic tool planning in MLLMs.

Findings

01

Models struggle with coherent interleaved planning.

02

Significant variation exists in tool-use behavior among models.

03

Room for improvement in autonomous tool invocation is identified.

Abstract

Interleaved text-and-image generation represents a significant frontier for Multimodal Large Language Models (MLLMs), offering a more intuitive way to convey complex information. Current paradigms rely on either image generation or retrieval augmentation, yet they typically treat the two as mutually exclusive paths, failing to unify factuality with creativity. We argue that the next milestone in this field is Agentic Tool Planning, where the model serves as a central controller that autonomously determines when, where, and which tools to invoke to produce interleaved responses for visual-critical queries. To systematically evaluate this paradigm, we introduce ATP-Bench, a novel benchmark comprising 7,702 QA pairs (including 1,592 VQA pairs) across eight categories and 25 visual-critical intents, featuring human-verified queries and ground truths. Furthermore, to evaluate agentic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Qwen-Applications/ATP-Bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.