AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
Edward De Brouwer, Carl Edwards, Alexander Wu, Jenna Collier, Graham Heimberg, Xiner Li, Meena Subramaniam, Ehsan Hajiramezanali, David Richmond, Jan-Christian H\"utter, Sara Mostafavi, Gabriele Scalia

TL;DR
AssayBench is a new benchmark dataset for evaluating large language models and agents on phenotypic cell screening tasks, enabling progress in virtual cell modeling and drug discovery.
Contribution
This work introduces AssayBench, a comprehensive benchmark for phenotypic screen prediction using diverse CRISPR data, and evaluates LLMs' performance on this task.
Findings
Zero-shot LLMs outperform biology-specific models.
Fine-tuning and prompt optimization improve LLM performance.
Existing methods are far from performance ceilings.
Abstract
Recent advances in machine learning and large-scale biological data collections have revived the prospect of building a virtual cell, a computational model of cellular behavior that could accelerate biological discovery. One of the most compelling promises of this vision is the ability to perform in silico phenotypic screens, in which a model predicts the effects of cellular perturbations in unseen biological contexts. This task combines heterogeneous textual inputs with diverse phenotypic outputs, making it particularly well-suited to LLMs and agentic systems. Yet, no standard benchmark currently exists for this task, as existing efforts focus on narrower molecular readouts that are only indirectly aligned with the phenotypic endpoints driving many real-world drug discovery workflows. In this work, we present AssayBench, a benchmark for phenotypic screen prediction, built from 1,920…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
