MedVAL: Toward Expert-Level Medical Text Validation with Language Models
Asad Aali, Vasiliki Bikia, Maya Varma, Nicole Chiou, Sophie Ostmeier, Arnav Singhvi, Magdalini Paschali, Ashwin Kumar, Andrew Johnston, Karimar Amador-Martinez, Eduardo Juan Perez Guerrero, Paola Naovi Cruz Rivera, Sergios Gatidis, Christian Bluethgen, Eduardo Pontes Reis

TL;DR
MedVAL introduces a self-supervised distillation method to evaluate medical text accuracy, significantly improving the alignment of language models with physician judgments without requiring manual labels.
Contribution
The paper presents MedVAL, a novel data-efficient distillation approach for training evaluators that assess medical text factuality without needing physician-labeled data.
Findings
MedVAL improves evaluation F1 scores from 66% to 83%.
MedVAL enhances proprietary LM performance by 8%.
MedVAL achieves performance comparable to human experts on a subset.
Abstract
With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LLM-as-a-judge" paradigm offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. We propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling
