SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

Udari Madhushani Sehwag; Elaine Lau; Haniyeh Ehsani Oskouie; Shayan Shabihi; Erich Liang; Andrea Toledo; Guillermo Mangialardi; Sergio Fonrouge; Ed-Yeremai Hernandez Cardona; Paula Vergara; Utkarsh Tyagi; Chen Bo Calvin Zhang; Pavi Bhatter; Nicholas Johnson; Furong Huang; Ernesto Gabriel Hernandez Montoya; Bing Liu

arXiv:2604.10718·cs.AI·April 14, 2026

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

Udari Madhushani Sehwag, Elaine Lau, Haniyeh Ehsani Oskouie, Shayan Shabihi, Erich Liang, Andrea Toledo, Guillermo Mangialardi, Sergio Fonrouge, Ed-Yeremai Hernandez Cardona, Paula Vergara, Utkarsh Tyagi, Chen Bo Calvin Zhang, Pavi Bhatter, Nicholas Johnson, Furong Huang

PDF

1 Repo

TL;DR

SciPredict introduces a benchmark to evaluate LLMs' ability to predict scientific experiment outcomes, revealing current models' limitations and emphasizing the need for reliable prediction awareness.

Contribution

The paper presents SciPredict, a new benchmark with 405 tasks across physics, biology, and chemistry, to assess LLMs' predictive accuracy and reliability in scientific experiments.

Findings

01

Model accuracies are 14-26%, close to human experts' 20%.

02

Models fail to reliably distinguish between reliable and unreliable predictions.

03

Human experts' accuracy improves from 5% to 80% as they judge outcomes as predictable.

Abstract

Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

scaleapi/scipredict
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.