TL;DR
BioMedArena is an open-source toolkit that streamlines the development and evaluation of biomedical research agents, enabling fair comparison and achieving state-of-the-art results across multiple benchmarks.
Contribution
It introduces a modular framework that reduces engineering effort and facilitates benchmarking of biomedical models, tools, and agents with improved performance.
Findings
Achieved state-of-the-art results on 8 biomedical benchmarks.
Provided 147 biomedical benchmarks and 75 tools within the toolkit.
Significantly improved performance with an average lift of +15.03 percentage points.
Abstract
Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring -- and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
