MuSciClaims: Multimodal Scientific Claim Verification

Yash Kumar Lal; Manikanta Bandham; Mohammad Saqib Hasan; Apoorva Kashi; Mahnaz Koupaee; Niranjan Balasubramanian

arXiv:2506.04585·cs.CL·July 31, 2025

MuSciClaims: Multimodal Scientific Claim Verification

Yash Kumar Lal, Manikanta Bandham, Mohammad Saqib Hasan, Apoorva Kashi, Mahnaz Koupaee, Niranjan Balasubramanian

PDF

Open Access 1 Datasets

TL;DR

MuSciClaims introduces a new benchmark for scientific claim verification using multimodal data, revealing current models' poor performance and highlighting key challenges in evidence localization and multimodal reasoning.

Contribution

The paper presents MuSciClaims, a novel benchmark with diagnostic tasks for evaluating multimodal scientific claim verification models, including automatically extracted and manually perturbed claims.

Findings

01

Most vision-language models perform poorly (~0.3-0.5 F1)

02

Even the best model achieves only 0.72 F1

03

Models struggle with evidence localization and multimodal reasoning

Abstract

Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.72…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

StonyBrookNLP/MuSciClaims
dataset· 143 dl
143 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Biomedical Text Mining and Ontologies

MethodsSparse Evolutionary Training