MedVAL: Toward Expert-Level Medical Text Validation with Language Models

Asad Aali; Vasiliki Bikia; Maya Varma; Nicole Chiou; Sophie Ostmeier; Arnav Singhvi; Magdalini Paschali; Ashwin Kumar; Andrew Johnston; Karimar Amador-Martinez; Eduardo Juan Perez Guerrero; Paola Naovi Cruz Rivera; Sergios Gatidis; Christian Bluethgen; Eduardo Pontes Reis; Eddy D. Zandee van Rilland; Poonam Laxmappa Hosamani; Kevin R Keet; Minjoung Go; Evelyn Ling; David B. Larson; Curtis Langlotz; Roxana Daneshjou; Jason Hom; Sanmi Koyejo; Emily Alsentzer; Akshay S. Chaudhari

arXiv:2507.03152·cs.CL·February 10, 2026

MedVAL: Toward Expert-Level Medical Text Validation with Language Models

Asad Aali, Vasiliki Bikia, Maya Varma, Nicole Chiou, Sophie Ostmeier, Arnav Singhvi, Magdalini Paschali, Ashwin Kumar, Andrew Johnston, Karimar Amador-Martinez, Eduardo Juan Perez Guerrero, Paola Naovi Cruz Rivera, Sergios Gatidis, Christian Bluethgen, Eduardo Pontes Reis

PDF

Open Access 2 Models 1 Datasets

TL;DR

MedVAL introduces a self-supervised distillation method to evaluate medical text accuracy, significantly improving the alignment of language models with physician judgments without requiring manual labels.

Contribution

The paper presents MedVAL, a novel data-efficient distillation approach for training evaluators that assess medical text factuality without needing physician-labeled data.

Findings

01

MedVAL improves evaluation F1 scores from 66% to 83%.

02

MedVAL enhances proprietary LM performance by 8%.

03

MedVAL achieves performance comparable to human experts on a subset.

Abstract

With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LLM-as-a-judge" paradigm offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. We propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

stanfordmimi/MedVAL-Bench
dataset· 93 dl
93 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling