VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering
Yuyi Li, Daoyuan Chen, Zhen Wang, Yutong Lu, and Yaliang Li

TL;DR
VeriSciQA is a large, high-quality dataset for scientific visual question answering, created through a cross-modal verification framework that ensures accurate question-answer pairs from scientific figures and their citing paragraphs.
Contribution
The paper introduces VeriSciQA, a novel dataset for SVQA, generated using a verification framework that filters out errors, improving data quality for scientific visual reasoning tasks.
Findings
Models fine-tuned on VeriSciQA outperform those trained on previous datasets.
There is a significant accuracy gap between open-source and proprietary models on SVQA.
Scaling data with VeriSciQA enhances model performance on SVQA benchmarks.
Abstract
Large Vision-Language Models (LVLMs) show promise for scientific applications, yet open-source models still struggle with Scientific Visual Question Answering (SVQA), namely answering questions about figures from scientific papers. A key bottleneck is the lack of public, large-scale, high-quality SVQA datasets. Although recent work uses LVLMs to synthesize data at scale, we identify systematic errors in their resulting QA pairs, stemming from LVLMs' inherent limitations and information asymmetry between figures and text. To address these challenges, we propose a Cross-Modal verification framework that generates questions and answers purely from figure-citing paragraphs, then verifies them against the figures themselves, leveraging the inherent text-figure alignment in scientific papers to filter out erroneous QA pairs. We instantiate this framework to curate VeriSciQA, a dataset of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques
