Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

Benjamin Warner; Ratna Sagari Grandhi; Max Kieffer; Aymane Ouraq; Saurav Panigrahi; Geetu Ambwani; Kunal Bagga; Nikhil Khandekar; Arya Hariharan; Nishant Mishra; Manish Ram; Shamus Sim Zi Yang; Ahmed Essouaied; Adepoju Jeremiah Moyondafoluwa; Robert Scholz; Bofeng Huang; Molly Beavers; Srishti Gureja; Anish Mahishi; Sameed Khan; Maxime Griot; Hunar Batra; Jean-Benoit Delbrouck; Siddhant Bharadwaj; Ronald Clark; Ashish Vashist; Anas Zafar; Leema Krishna Murali; Harsh Deshpande; Ameen Patel; William Brown; Johannes Hagemann; Connor Lane; Paul Steven Scotti; Tanishq Mathew Abraham

arXiv:2605.01417·cs.CL·May 5, 2026

Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer, Aymane Ouraq, Saurav Panigrahi, Geetu Ambwani, Kunal Bagga, Nikhil Khandekar, Arya Hariharan, Nishant Mishra, Manish Ram, Shamus Sim Zi Yang, Ahmed Essouaied, Adepoju Jeremiah Moyondafoluwa, Robert Scholz, Bofeng Huang

PDF

1 Repo

TL;DR

Medmarks introduces an open-source, comprehensive benchmark suite for evaluating large language models on diverse medical tasks, addressing limitations of existing benchmarks.

Contribution

It provides a new open-source evaluation suite with 30 benchmarks, systematic evaluation of 61 models, and insights into model performance and biases in medical AI.

Findings

01

Frontier reasoning models outperform others across benchmarks.

02

Medically fine-tuned models outperform generalist models.

03

Models are susceptible to answer-order bias, especially smaller models.

Abstract

Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open-source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, & GPT-5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open-weight alternatives, medically fine-tuned models outperform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MedARC-AI/Medmarks
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.