BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

Loka Li; Duzhen Zhang; Xingbo Du; Leonard Song; Zixiao Wang; Assanali Aukenov; Noel Thomas; Shakhnazar Sailaukan; Yonghan Yang; Feilong Chen; Jiahua Dong; Kun Zhang; Bin Zhang; Le Song

arXiv:2605.15766·cs.CE·May 18, 2026

BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

Loka Li, Duzhen Zhang, Xingbo Du, Leonard Song, Zixiao Wang, Assanali Aukenov, Noel Thomas, Shakhnazar Sailaukan, Yonghan Yang, Feilong Chen, Jiahua Dong, Kun Zhang, Bin Zhang, Le Song

PDF

TL;DR

BioXArena is a comprehensive biomedical machine learning benchmark that evaluates LLM agents on diverse multi-modal tasks, highlighting their capabilities and limitations in generating models across various biomedical domains.

Contribution

The paper introduces BioXArena, a new benchmark with 76 tasks across 9 biomedical domains to assess LLM agents' ability to generate task-specific ML pipelines.

Findings

01

MLEvolve with Gemini-3.1-Pro scored highest with 0.666

02

No single agent outperforms across all domains

03

Extensive ablation and robustness studies conducted

Abstract

Large language model (LLM) agents are increasingly capable of automating components of machine learning development, yet existing biomedical benchmarks mainly focus on question answering, reasoning, and tool usage, or evaluate only narrow aspects of biomedical ML coding. We present BioXArena, a biomedical machine learning benchmark designed to evaluate whether agents can generate task-specific model training pipelines for heterogeneous and multi-modal biomedical datasets. BioXArena contains 76 end-to-end tasks across 9 domains, including sequence modeling, single-cell analysis, structural biology, network biology, chemical biology, perturbation dynamics, phenotype-disease modeling, biomedical imaging, and text-integrated learning. Each task is curated from primary biomedical sources into a unified evaluation framework with hidden labels, held-out graders, and biology-aware metrics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.