Toward Evaluation Frameworks for Multi-Agent Scientific AI Systems
Marcin Abram

TL;DR
This paper discusses the challenges and strategies for benchmarking multi-agent scientific AI systems, emphasizing realistic evaluation methods and the importance of multi-turn interactions.
Contribution
It introduces approaches for constructing contamination-resistant problems and scalable task families, and presents an initial dataset for testing out-of-sample performance.
Findings
Constructed a dataset of novel research ideas for evaluation
Identified key challenges in benchmarking scientific AI systems
Gathered insights from quantum science researchers on AI interaction expectations
Abstract
We analyze the challenges of benchmarking scientific (multi)-agentic systems, including the difficulty of distinguishing reasoning from retrieval, the risks of data/model contamination, the lack of reliable ground truth for novel research problems, the complications introduced by tool use, and the replication challenges due to the continuously changing/updating knowledge base. We discuss strategies for constructing contamination-resistant problems, generating scalable families of tasks, and the need for evaluating systems through multi-turn interactions that better reflect real scientific practice. As an early feasibility test, we demonstrate how to construct a dataset of novel research ideas to test the out-of-sample performance of our system. We also discuss the results of interviews with several researchers and engineers working in quantum science. Through those interviews, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
