Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning

Shashidhar Reddy Javaji; Yupeng Cao; Haohang Li; Yangyang Yu; Nikhil Muralidhar; Zining Zhu

arXiv:2506.08235·cs.CL·June 11, 2025

Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning

Shashidhar Reddy Javaji, Yupeng Cao, Haohang Li, Yangyang Yu, Nikhil Muralidhar, Zining Zhu

PDF

Open Access 1 Repo

TL;DR

This paper introduces CLAIM-BENCH, a benchmark for assessing large language models' ability to understand and validate scientific claims and evidence, revealing current limitations and strengths across different models and prompting strategies.

Contribution

The study presents CLAIM-BENCH, a novel benchmark for scientific claim-evidence reasoning in LLMs, and systematically evaluates multiple models and approaches, highlighting their capabilities and shortcomings.

Findings

01

GPT-4 and Claude outperform open-source models in accuracy.

02

Three-pass and one-by-one prompting improve claim-evidence linking.

03

Significant limitations remain in LLMs' understanding of complex scientific content.

Abstract

Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs' capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shashidharjavaji/RC_BENCH
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Biomedical Text Mining and Ontologies · Topic Modeling

MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · GPT-4