AI scientists produce results without reasoning scientifically

Marti\~no R\'ios-Garc\'ia; Nawaf Alampara; Chandan Gupta; Indrajeet Mandal; Sajid Mannan; Ali Asghar Aghajani; N. M. Anoop Krishnan; Kevin Maik Jablonka

arXiv:2604.18805·cs.AI·April 22, 2026

AI scientists produce results without reasoning scientifically

Marti\~no R\'ios-Garc\'ia, Nawaf Alampara, Chandan Gupta, Indrajeet Mandal, Sajid Mannan, Ali Asghar Aghajani, N. M. Anoop Krishnan, Kevin Maik Jablonka

PDF

1 Models 13 Datasets

TL;DR

This study evaluates whether large language model-based scientific agents follow epistemic norms of scientific reasoning, finding they largely do not, despite executing workflows successfully.

Contribution

It provides a comprehensive analysis showing LLM-based agents lack key epistemic reasoning patterns, highlighting the need to target reasoning in training.

Findings

01

Base model explains 41.4% of variance in performance and behavior.

02

Evidence is ignored in 68% of traces across all configurations.

03

Refutation-driven belief revision occurs in 26% of cases.

Abstract

Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
SofiTesfay2010/scientific-reasoning-training
model· ♡ 2
♡ 2

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.