Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop

Elizabeth Fahsbender; Alma Andersson; Jeremy Ash; Polina Binder; Daniel Burkhardt; Benjamin Chang; Georg K. Gerber; Anthony Gitter; Patrick Godau; Ankit Gupta; Genevieve Haliburton; Siyu He; Trey Ideker; Ivana Jelic; Aly Khan; Yang-Joon Kim; Aditi Krishnapriyan; Jon M. Laurent; Tianyu Liu; Emma Lundberg; Shalin B. Mehta; Rob Moccia; Angela Oliveira Pisco; Katherine S. Pollard; Suresh Ramani; Julio Saez-Rodriguez; Yasin Senbabaoglu; Elana Simon; Srinivasan Sivanandan; Gustavo Stolovitzky; Marc Valer; Bo Wang; Xikun Zhang; James Zou; Katrina Kalantar

arXiv:2507.10502·cs.LG·July 17, 2025

Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop

Elizabeth Fahsbender, Alma Andersson, Jeremy Ash, Polina Binder, Daniel Burkhardt, Benjamin Chang, Georg K. Gerber, Anthony Gitter, Patrick Godau, Ankit Gupta, Genevieve Haliburton, Siyu He, Trey Ideker, Ivana Jelic, Aly Khan, Yang-Joon Kim, Aditi Krishnapriyan, Jon M. Laurent

PDF

TL;DR

This paper discusses the importance of standardized benchmarking in AI for biology, highlighting challenges and proposing recommendations to develop robust, reproducible, and comprehensive evaluation frameworks to advance biological AI research.

Contribution

It introduces a set of recommendations for creating effective benchmarking frameworks in biological AI, addressing current systemic and technical bottlenecks.

Findings

01

Identification of key bottlenecks like data heterogeneity and noise

02

Proposal of standardized evaluation metrics and open platforms

03

Emphasis on high-quality data curation and collaborative tools

Abstract

Artificial intelligence holds immense promise for transforming biology, yet a lack of standardized, cross domain, benchmarks undermines our ability to build robust, trustworthy models. Here, we present insights from a recent workshop that convened machine learning and computational biology experts across imaging, transcriptomics, proteomics, and genomics to tackle this gap. We identify major technical and systemic bottlenecks such as data heterogeneity and noise, reproducibility challenges, biases, and the fragmented ecosystem of publicly available resources and propose a set of recommendations for building benchmarking frameworks that can efficiently compare ML models of biological systems across tasks and data modalities. By promoting high quality data curation, standardized tooling, comprehensive evaluation metrics, and open, collaborative platforms, we aim to accelerate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training