Multimodal QUD: Inquisitive Questions from Scientific Figures
Yating Wu, William Rudman, Venkata S Govindarajan, Alexandros G. Dimakis, Junyi Jessy Li

TL;DR
This paper introduces MQUD, a dataset of scientific papers with multimodal inquisitive questions, and demonstrates that fine-tuning vision-language models on it enhances their ability to generate high-level, context-aware questions involving both figures and text.
Contribution
The paper extends the linguistic theory of Questions Under Discussion to multimodal scientific discourse and creates a dataset with author-annotated questions to improve multimodal reasoning in models.
Findings
Fine-tuning on MQUD shifts models from generic questions to content-specific, high-level reasoning.
Models trained on MQUD generate more visually grounded and context-aware questions.
The dataset enables better understanding of scientific figures in conjunction with text.
Abstract
Asking inquisitive questions while reading, and looking for their answers, is an important part in human discourse comprehension, curiosity, and creative ideation, and prior work has investigated this in text-only scenarios. However, in scientific or research papers, many of the critical takeaways are conveyed through both figures and the text that analyzes them. While scientific visualizations have been used to evaluate Vision-Language Models (VLMs) capabilities, current benchmarks are limited to questions that focus simply on extracting information from them. Such questions only require lower-level reasoning, do not take into account the context in which a figure appears, and do not reflect the communicative goals the authors wish to achieve. We generate inquisitive questions that reach the depth of questions humans generate when engaging with scientific papers, conditioned on both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
