ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Shaofeng Yin; Ting Lei; Yang Liu

arXiv:2508.03284·cs.AI·March 5, 2026

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Shaofeng Yin, Ting Lei, Yang Liu

PDF

1 Datasets

TL;DR

ToolVQA introduces a large, real-world multimodal dataset with multi-step reasoning tasks and diverse tools, enabling better evaluation and training of foundation models for practical tool-augmented VQA.

Contribution

We present ToolVQA, a new dataset with real-world contexts and multi-step reasoning, along with ToolEngine, a data generation pipeline for simulating human-like tool use in multimodal tasks.

Findings

01

Fine-tuned 7B LFMs perform well on ToolVQA test set.

02

Models surpass GPT-3.5-turbo on out-of-distribution datasets.

03

ToolVQA bridges the gap between synthetic datasets and real-world applications.

Abstract

Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

DietCoke4671/ToolVQA
dataset· 1.7k dl
1.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.