BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

Jinge Wu; Hongjian Zhou; Mingde Zeng; Jiayuan Zhu; Junde Wu; Jiazhen Pan; Sean Wu; Honghan Wu; Fenglin Liu; David A. Clifton

arXiv:2605.06177·cs.AI·May 8, 2026

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

Jinge Wu, Hongjian Zhou, Mingde Zeng, Jiayuan Zhu, Junde Wu, Jiazhen Pan, Sean Wu, Honghan Wu, Fenglin Liu, David A. Clifton

PDF

1 Repo

TL;DR

BioMedArena is an open-source toolkit that streamlines the development and evaluation of biomedical research agents, enabling fair comparison and achieving state-of-the-art results across multiple benchmarks.

Contribution

It introduces a modular framework that reduces engineering effort and facilitates benchmarking of biomedical models, tools, and agents with improved performance.

Findings

01

Achieved state-of-the-art results on 8 biomedical benchmarks.

02

Provided 147 biomedical benchmarks and 75 tools within the toolkit.

03

Significantly improved performance with an average lift of +15.03 percentage points.

Abstract

Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring -- and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AI-in-Health/BioMedArena
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.