ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use
Mengjie Deng, Guanting Dong, Zhicheng Dou

TL;DR
ToolScope is a novel framework that enhances multimodal large language models' ability to perform long-horizon visual question answering by integrating global planning with local perception tools, leading to improved accuracy.
Contribution
It introduces ToolScope, a unified agentic framework combining global planning and local perception tools for better multimodal reasoning in long-horizon tasks.
Findings
Achieves up to +6.69% performance improvement across benchmarks.
Demonstrates strong generalization across diverse domains.
Effectively mitigates visual context degradation in VQA tasks.
Abstract
Recently, large language models (LLMs) have demonstrated remarkable problem-solving capabilities by autonomously integrating with external tools for collaborative reasoning. However, due to the inherently complex and diverse nature of multimodal information, enabling multimodal large language models (MLLMs) to flexibly and efficiently utilize external tools during reasoning remains an underexplored challenge. In this work, we introduce ToolScope, an agentic framework designed to unify global planning with local multimodal perception, adopting a specialized Perceive tool to mitigates visual context degradation in long-horizon VQA task. ToolScope comprises three primary components: the Global Navigator, the Agentic Executor, and the Response Synthesizer. The Global Navigator functions as a "telescope", offering high-level strategic guidance. The Agentic Executor operates iteratively to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
