Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling

Prahitha Movva; Naga Harshita Marupaka

arXiv:2507.06183·cs.CV·July 9, 2025

Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling

Prahitha Movva, Naga Harshita Marupaka

PDF

Open Access

TL;DR

This paper advances scientific visual question answering by developing models with multimodal reasoning, ensemble techniques, and prompt optimization, significantly improving accuracy on scholarly figures and data interpretation tasks.

Contribution

The paper introduces a novel ensemble approach with multimodal reasoning and prompt optimization for scientific VQA, achieving state-of-the-art results on the SciVQA 2025 shared task.

Findings

01

InternVL3 achieved high ROUGE and BERTScore metrics.

02

Ensemble models improved performance over individual models.

03

Prompt optimization and chain-of-thought reasoning enhanced VQA accuracy.

Abstract

Technical reports and articles often contain valuable information in the form of semi-structured data like charts, and figures. Interpreting these and using the information from them is essential for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles. We conducted a series of experiments using models with 5B to 8B parameters. Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of \textbf{0.740} and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques