SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation
Jiahao Zhao, Feng Jiang, Shaowei Qin, Zhonghui Zhang, Junhao Liu, Guibing Guo, Hamid Alinejad-Rokny, Min Yang

TL;DR
SC-Arena introduces a comprehensive, knowledge-augmented natural language evaluation framework for single-cell biology models, enabling more biologically faithful and interpretable assessment of reasoning capabilities.
Contribution
It unifies evaluation tasks through a virtual cell abstraction and incorporates external biological knowledge for more meaningful model assessment.
Findings
Models show uneven performance on complex biological tasks.
Knowledge-augmented evaluation improves interpretability and biological correctness.
Framework discriminates model capabilities beyond traditional metrics.
Abstract
Large language models (LLMs) are increasingly applied in scientific research, offering new capabilities for knowledge discovery and reasoning. In single-cell biology, however, evaluation practices for both general and specialized LLMs remain inadequate: existing benchmarks are fragmented across tasks, adopt formats such as multiple-choice classification that diverge from real-world usage, and rely on metrics lacking interpretability and biological grounding. We present SC-ARENA, a natural language evaluation framework tailored to single-cell foundation models. SC-ARENA formalizes a virtual cell abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions. Within this paradigm, we define five natural language tasks (cell type annotation, captioning, generation, perturbation prediction, and scientific QA) that probe core reasoning…
Peer Reviews
Decision·ICLR 2026 Poster
1. Combining vast amounts of single-cell data with natural language based knowledge available to gain insights into cellular function is beneficial to the biology community so this is a timely topic. 2. Authors have benchmarked their proposed framework for evaluating LLM performance across many different LLM models or other domain specific models. 3. Definition of the knowledge cell class is well thought out in considering multiple sources of information available for analyzing cellular dynami
- The novelty/value of this framework for evaluation is unclear, many current models like Cell2Sentence already combine single cell rna data with text based information and have shown use cases for downstream tasks like cell type prediction, perturbation response prediction etc. - Considering existing methods that can perform some of the tasks mentioned in the multi-task benchmark like Cell2Sentence, CellReasoning etc, to fully evaluate this work, performance of existing methods on individual t
• Very creative and well-motivated benchmark that treats LLMs as reasoning agents over biological cell states. • The Eval-RAG strategy is an elegant idea that improves evaluation by incorporating biological context and semantic plausibility, moving beyond token-level correctness. • The paper provides a valuable framework for comparing different LLMs under biologically grounded tasks. • The paper does extensive evaluation of general and domain-specific models. • Discussion includes relevant next
The paper would benefit from more detailed examples—for instance, elaborating on the process shown in Figure 2, panel B, to clearly explain how the biological plausibility scoring is computed step by step.
- The Virtual Cell abstraction and multi-task natural language evaluation is an interesting idea to more objectively test different models’ capacity to understand cellular processes. - Integrating external biological knowledge (Cell Ontology, UniProt, GO, CellMarker, PubMed) into the evaluation pipeline is a major strength and a clever way to address the limitations of string-matching metrics. - The paper benchmarks a wide range of models (Qwen, GPT-4o, DeepSeek-R1, Kimi-K2, C2S-Scale, scGen
- While the benchmark covers different tasks and the datasets are open-source, the paper does not address the risk of benchmark dataset leakage; such as, whether the datasets used to construct the SC-Arena benchmark were present in the pretraining or fine-tuning data of the evaluated models. - The rationale for model selection and the fairness of comparisons (e.g., fine-tuning protocols, input formats) could be better discussed. - The knowledge-augmented LLM-as-a-judge is promising, but its
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics · Biomedical Text Mining and Ontologies · Bioinformatics and Genomic Networks
