FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Zhen Wang; Fan Bai; Zhongyan Luo; Jinyan Su; Kaiser Sun; Xinle Yu; Jieyuan Liu; Kun Zhou; Claire Cardie; Mark Dredze; Eric P. Xing; Zhiting Hu

arXiv:2602.02905·cs.AI·February 4, 2026

FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Zhen Wang, Fan Bai, Zhongyan Luo, Jinyan Su, Kaiser Sun, Xinle Yu, Jieyuan Liu, Kun Zhou, Claire Cardie, Mark Dredze, Eric P. Xing, Zhiting Hu

PDF

Open Access

TL;DR

FIRE-Bench is a new benchmark designed to evaluate autonomous agents' ability to rediscover scientific findings from recent research, highlighting current limitations and guiding future improvements in agent-driven scientific discovery.

Contribution

The paper introduces FIRE-Bench, a comprehensive benchmark for assessing autonomous agents' full-cycle scientific discovery capabilities using high-impact machine learning research.

Findings

01

Current agents achieve limited rediscovery success (<50 F1)

02

High variance observed across different runs

03

Recurring failure modes in experimental design and reasoning

Abstract

Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery end-to-end, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either heavily rely on LLM-as-judge evaluations of automatically generated research outputs or optimize convenient yet isolated performance metrics that provide coarse proxies for scientific insight. To address this gap, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question extracted from a published, verified study and must autonomously explore ideas, design experiments, implement code, execute their plans, and derive conclusions supported…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Artificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications