BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks
Loka Li, Duzhen Zhang, Xingbo Du, Leonard Song, Zixiao Wang, Assanali Aukenov, Noel Thomas, Shakhnazar Sailaukan, Yonghan Yang, Feilong Chen, Jiahua Dong, Kun Zhang, Bin Zhang, Le Song

TL;DR
BioXArena is a comprehensive biomedical machine learning benchmark that evaluates LLM agents on diverse multi-modal tasks, highlighting their capabilities and limitations in generating models across various biomedical domains.
Contribution
The paper introduces BioXArena, a new benchmark with 76 tasks across 9 biomedical domains to assess LLM agents' ability to generate task-specific ML pipelines.
Findings
MLEvolve with Gemini-3.1-Pro scored highest with 0.666
No single agent outperforms across all domains
Extensive ablation and robustness studies conducted
Abstract
Large language model (LLM) agents are increasingly capable of automating components of machine learning development, yet existing biomedical benchmarks mainly focus on question answering, reasoning, and tool usage, or evaluate only narrow aspects of biomedical ML coding. We present BioXArena, a biomedical machine learning benchmark designed to evaluate whether agents can generate task-specific model training pipelines for heterogeneous and multi-modal biomedical datasets. BioXArena contains 76 end-to-end tasks across 9 domains, including sequence modeling, single-cell analysis, structural biology, network biology, chemical biology, perturbation dynamics, phenotype-disease modeling, biomedical imaging, and text-integrated learning. Each task is curated from primary biomedical sources into a unified evaluation framework with hidden labels, held-out graders, and biology-aware metrics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
