SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

Chengye Wang; Yifei Shen; Zexi Kuang; Arman Cohan; Yilun Zhao

arXiv:2506.15569·cs.CL·June 19, 2025

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

Chengye Wang, Yifei Shen, Zexi Kuang, Arman Cohan, Yilun Zhao

PDF

Open Access 1 Datasets 1 Video

TL;DR

SciVer is a new benchmark for evaluating multimodal scientific claim verification, revealing significant gaps between current models and human experts and providing insights for future improvements.

Contribution

Introduces SciVer, the first specialized benchmark for multimodal scientific claim verification, with expert annotations and analysis of current model limitations.

Findings

01

Current models lag behind human performance on SciVer.

02

Expert-annotated evidence enables fine-grained evaluation.

03

Analysis highlights key limitations in existing multimodal models.

Abstract

We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

chengyewang/SciVer
dataset· 56 dl
56 dl

Videos

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification· underline

Taxonomy

TopicsTopic Modeling