Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling
Prahitha Movva, Naga Harshita Marupaka

TL;DR
This paper advances scientific visual question answering by developing models with multimodal reasoning, ensemble techniques, and prompt optimization, significantly improving accuracy on scholarly figures and data interpretation tasks.
Contribution
The paper introduces a novel ensemble approach with multimodal reasoning and prompt optimization for scientific VQA, achieving state-of-the-art results on the SciVQA 2025 shared task.
Findings
InternVL3 achieved high ROUGE and BERTScore metrics.
Ensemble models improved performance over individual models.
Prompt optimization and chain-of-thought reasoning enhanced VQA accuracy.
Abstract
Technical reports and articles often contain valuable information in the form of semi-structured data like charts, and figures. Interpreting these and using the information from them is essential for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles. We conducted a series of experiments using models with 5B to 8B parameters. Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of \textbf{0.740} and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques
