A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework

Chenyu Li; Zohaib Akhtar; Mingu Kwak; Yuelyu Ji; Hang Zhang; Tracey Obi; Yufan Ren; Xizhi Wu; Sonish Sivarajkumar; Harold P. Lehmann; Shyam Visweswaran; Michael J. Becich; Danielle L. Mowery; Renxuan Liu; Haoyang Sun; Yanshan Wang

arXiv:2604.25933·cs.CY·April 30, 2026

A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework

Chenyu Li, Zohaib Akhtar, Mingu Kwak, Yuelyu Ji, Hang Zhang, Tracey Obi, Yufan Ren, Xizhi Wu, Sonish Sivarajkumar, Harold P. Lehmann, Shyam Visweswaran, Michael J. Becich, Danielle L. Mowery, Renxuan Liu, Haoyang Sun, Yanshan Wang

PDF

TL;DR

This scoping review examines the current state of LLM-as-a-Judge in healthcare, highlighting limited validation, safety concerns, and proposing the MedJUDGE framework for better evaluation and governance.

Contribution

It provides a comprehensive overview of existing evaluation practices and introduces the MedJUDGE framework to improve safety and accountability in healthcare LLM evaluation.

Findings

01

Majority of studies focus on evaluation and benchmarking.

02

Limited validation rigor with few studies involving human experts.

03

Few studies assess bias, fairness, or real-world deployment.

Abstract

As large language models (LLMs) increasingly generate and process clinical text, scalable evaluation has become critical. LLM-as-a-Judge (LaaJ), which uses LLMs to evaluate model outputs, offers a scalable alternative to costly expert review, but its healthcare adoption raises safety and bias concerns. We conducted a PRISMA-ScR scoping review of six databases (January 2020-January 2026), screening 11,727 studies and including 49. The landscape was dominated by evaluation and benchmarking applications (n=37, 75.5%), pointwise scoring (n=42, 85.7%), and GPT-family judges (n=36, 73.5%). Despite growing adoption, validation rigor was limited: among 36 studies with human involvement, the median number of expert validators was 3, while 13 (26.5%) used none. Risk of bias testing was absent in 36 studies (73.5%), only 1 (2.0%) examined demographic fairness, and none assessed temporal stability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.