AI scientists produce results without reasoning scientifically
Marti\~no R\'ios-Garc\'ia, Nawaf Alampara, Chandan Gupta, Indrajeet Mandal, Sajid Mannan, Ali Asghar Aghajani, N. M. Anoop Krishnan, Kevin Maik Jablonka

TL;DR
This study evaluates whether large language model-based scientific agents follow epistemic norms of scientific reasoning, finding they largely do not, despite executing workflows successfully.
Contribution
It provides a comprehensive analysis showing LLM-based agents lack key epistemic reasoning patterns, highlighting the need to target reasoning in training.
Findings
Base model explains 41.4% of variance in performance and behavior.
Evidence is ignored in 68% of traces across all configurations.
Refutation-driven belief revision occurs in 26% of cases.
Abstract
Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- jablonkagroup/corral-oss-trace-logprobsdataset· 1.3k dl1.3k dl
- jablonkagroup/corral-QAsdataset· 684 dl684 dl
- jablonkagroup/corral-QAs-reportsdataset· 7.2k dl7.2k dl
- jablonkagroup/corral-QAs-topic_reportsdataset· 3.5k dl3.5k dl
- jablonkagroup/corral-tracesdataset· 6.3k dl6.3k dl
- jablonkagroup/corral_runs_reportsdataset· 1.2k dl1.2k dl
- jablonkagroup/corral-environment-tasksdataset· 478 dl478 dl
- jablonkagroup/corral_lfm_binomial_resultsdataset· 451 dl451 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
