Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts
Dhruv Trehan, Paras Chopra

TL;DR
This paper examines four attempts to create autonomous ML research systems using LLMs, highlighting common failure modes and proposing design principles for future robust AI-scientist systems.
Contribution
It provides a detailed case study of autonomous research attempts, identifies key failure modes, and offers design principles to improve AI-driven scientific discovery.
Findings
Three attempts failed during implementation or evaluation.
One successful attempt was accepted to a scientific venue with AI as first author.
Identified six recurring failure modes in autonomous research systems.
Abstract
We report a case study of four end-to-end attempts to autonomously generate ML research papers using a pipeline of six LLM agents mapped to stages of the scientific workflow. Of these four, three attempts failed during implementation or evaluation. One completed the pipeline and was accepted to Agents4Science 2025, an experimental inaugural venue that required AI systems as first authors, passing both human and multi-AI review. From these attempts, we document six recurring failure modes: bias toward training data defaults, implementation drift under execution pressure, memory and context degradation across long-horizon tasks, overexcitement that declares success despite obvious failures, insufficient domain intelligence, and weak scientific taste in experimental design. We conclude by discussing four design principles for more robust AI-scientist systems, implications for autonomous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Ethics and Social Impacts of AI · Artificial Intelligence in Law
