Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning
Shashidhar Reddy Javaji, Yupeng Cao, Haohang Li, Yangyang Yu, Nikhil Muralidhar, Zining Zhu

TL;DR
This paper introduces CLAIM-BENCH, a benchmark for assessing large language models' ability to understand and validate scientific claims and evidence, revealing current limitations and strengths across different models and prompting strategies.
Contribution
The study presents CLAIM-BENCH, a novel benchmark for scientific claim-evidence reasoning in LLMs, and systematically evaluates multiple models and approaches, highlighting their capabilities and shortcomings.
Findings
GPT-4 and Claude outperform open-source models in accuracy.
Three-pass and one-by-one prompting improve claim-evidence linking.
Significant limitations remain in LLMs' understanding of complex scientific content.
Abstract
Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs' capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Biomedical Text Mining and Ontologies · Topic Modeling
MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · GPT-4
