ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools
Shaofeng Yin, Ting Lei, Yang Liu

TL;DR
ToolVQA introduces a large, real-world multimodal dataset with multi-step reasoning tasks and diverse tools, enabling better evaluation and training of foundation models for practical tool-augmented VQA.
Contribution
We present ToolVQA, a new dataset with real-world contexts and multi-step reasoning, along with ToolEngine, a data generation pipeline for simulating human-like tool use in multimodal tasks.
Findings
Fine-tuned 7B LFMs perform well on ToolVQA test set.
Models surpass GPT-3.5-turbo on out-of-distribution datasets.
ToolVQA bridges the gap between synthetic datasets and real-world applications.
Abstract
Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
