FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights
Zhen Wang, Fan Bai, Zhongyan Luo, Jinyan Su, Kaiser Sun, Xinle Yu, Jieyuan Liu, Kun Zhou, Claire Cardie, Mark Dredze, Eric P. Xing, Zhiting Hu

TL;DR
FIRE-Bench is a new benchmark designed to evaluate autonomous agents' ability to rediscover scientific findings from recent research, highlighting current limitations and guiding future improvements in agent-driven scientific discovery.
Contribution
The paper introduces FIRE-Bench, a comprehensive benchmark for assessing autonomous agents' full-cycle scientific discovery capabilities using high-impact machine learning research.
Findings
Current agents achieve limited rediscovery success (<50 F1)
High variance observed across different runs
Recurring failure modes in experimental design and reasoning
Abstract
Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery end-to-end, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either heavily rely on LLM-as-judge evaluations of automatically generated research outputs or optimize convenient yet isolated performance metrics that provide coarse proxies for scientific insight. To address this gap, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question extracted from a published, verified study and must autonomously explore ideas, design experiments, implement code, execute their plans, and derive conclusions supported…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Artificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications
