TL;DR
ATP-Bench introduces a new benchmark and evaluation system for agentic tool planning in multimodal large language models, emphasizing autonomous tool invocation for interleaved text and image generation.
Contribution
The paper presents ATP-Bench, a comprehensive benchmark and a Multi-Agent MLLM-as-a-Judge system to evaluate and improve agentic tool planning in MLLMs.
Findings
Models struggle with coherent interleaved planning.
Significant variation exists in tool-use behavior among models.
Room for improvement in autonomous tool invocation is identified.
Abstract
Interleaved text-and-image generation represents a significant frontier for Multimodal Large Language Models (MLLMs), offering a more intuitive way to convey complex information. Current paradigms rely on either image generation or retrieval augmentation, yet they typically treat the two as mutually exclusive paths, failing to unify factuality with creativity. We argue that the next milestone in this field is Agentic Tool Planning, where the model serves as a central controller that autonomously determines when, where, and which tools to invoke to produce interleaved responses for visual-critical queries. To systematically evaluate this paradigm, we introduce ATP-Bench, a novel benchmark comprising 7,702 QA pairs (including 1,592 VQA pairs) across eight categories and 25 visual-critical intents, featuring human-verified queries and ground truths. Furthermore, to evaluate agentic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
