What Lies Beneath: A Call for Distribution-based Visual Question & Answer Datasets
Jill P. Naiman, Daniel J. Evans, JooYoung Seo

TL;DR
This paper advocates for a new distribution-based VQA benchmark focused on scientific charts, emphasizing the importance of underlying data understanding rather than surface-level visual features, and provides a synthetic dataset for research.
Contribution
It introduces a novel VQA dataset for scientific charts that incorporates underlying data, addressing limitations of existing datasets that lack data-driven reasoning.
Findings
Generated synthetic histogram charts with ground truth data
Human and model question-answering on charts requiring data access
Open-source dataset with figures, data, and annotations
Abstract
Visual Question Answering (VQA) has become an important benchmark for assessing how large multimodal models (LMMs) interpret images. However, most VQA datasets focus on real-world images or simple diagrammatic analysis, with few focused on interpreting complex scientific charts. Indeed, many VQA datasets that analyze charts do not contain the underlying data behind those charts or assume a 1-to-1 correspondence between chart marks and underlying data. In reality, charts are transformations (i.e. analysis, simplification, modification) of data. This distinction introduces a reasoning challenge in VQA that the current datasets do not capture. In this paper, we argue for a dedicated VQA benchmark for scientific charts where there is no 1-to-1 correspondence between chart marks and underlying data. To do so, we survey existing VQA datasets and highlight limitations of the current field. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
