TL;DR
BAISBench is a new benchmark for evaluating AI scientists on real single-cell transcriptomic data, assessing their ability to perform cell type annotation and scientific discovery tasks, highlighting current capabilities and limitations.
Contribution
Introduction of BAISBench, a comprehensive benchmark for assessing AI scientists' performance on real biological data and discovery tasks in single-cell transcriptomics.
Findings
AI scientists show potential but do not match human experts.
Current AI systems outperform baseline models in some tasks.
Benchmark provides a realistic evaluation of AI in biological research.
Abstract
Recent advances in large language models have enabled the emergence of AI scientists that aim to autonomously analyze biological data and assist scientific discovery. Despite rapid progress, it remains unclear to what extent these systems can extract meaningful biological insights from real experimental data. Existing benchmarks either evaluate reasoning in the absence of data or focus on predefined analytical outputs, failing to reflect realistic, data-driven biological research. Here, we introduce BAISBench (Biological AI Scientist Benchmark), a benchmark for evaluating AI scientists on real single-cell transcriptomic datasets. BAISBench comprises two tasks: cell type annotation across 15 expert-labeled datasets, and scientific discovery through 193 multiple-choice questions derived from biological conclusions reported in 41 published single-cell studies. We evaluated several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
