Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

Junze Ye; Daniel Tawfik; Alex J. Goodell; Nikhil V. Kotha; Mark K. Buyyounouski; Mohsen Bayati

arXiv:2512.19691·cs.AI·April 14, 2026

Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

Junze Ye, Daniel Tawfik, Alex J. Goodell, Nikhil V. Kotha, Mark K. Buyyounouski, Mohsen Bayati

PDF

TL;DR

This paper audits a clinical benchmark with LLM-assisted labels, revealing significant errors, and introduces a physician-in-the-loop pipeline to improve label accuracy and evaluation reliability.

Contribution

It presents a scalable stewardship pipeline involving physicians to reassess LLM-assisted labels, improving benchmark reliability and model evaluation in medical AI.

Findings

01

27% of test labels are likely erroneous or incomputable.

02

Recomputed labels agree with physician ground truth 74% of the time.

03

Using original labels underestimates LLM accuracy by 16-23 percentage points.

Abstract

Reference labels for machine-learning benchmarks are increasingly synthesized with LLM assistance, but their reliability remains underexamined. We audit MedCalc-Bench, a clinical benchmark for medical score computation whose labels were partly derived with LLM assistance, and develop a scalable physician-in-the-loop stewardship pipeline to reassess them. At least 27% of test labels are likely erroneous or incomputable. On a 50-instance subset validated by physicians, our recomputed labels agree with physician ground truth 74% of the time (95% CI, 60-84%) versus 20% for the originals (95% CI, 11-33%). Using original labels to evaluate frontier LLMs underestimates accuracy by 16-23 percentage points. In a controlled reinforcement-learning experiment, a model trained on recomputed labels outperforms one trained on originals by 13.5 percentage points (95% CI, 10.6-16.6%) on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.