FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
Shubhankar Singh, Purvi Chaurasia, Yerram Varun, Pranshu Pandya,, Vatsal Gupta, Vivek Gupta, Dan Roth

TL;DR
FlowVQA is a new benchmark designed to evaluate multimodal language models' reasoning abilities using flowcharts, addressing gaps in existing visual question answering benchmarks related to spatial reasoning and complexity.
Contribution
The paper introduces FlowVQA, a comprehensive dataset with flowchart images and questions, and provides baseline evaluations to advance multimodal reasoning research.
Findings
Baseline models show room for improvement in reasoning tasks.
FlowVQA effectively challenges models in spatial and logical reasoning.
The benchmark reveals directional biases in current models.
Abstract
Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Advanced Text Analysis Techniques
