MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering

Yuexing Hao; Kumail Alhamoud; Hyewon Jeong; Haoran Zhang; Isha Puri; Philip Torr; Mike Schaekermann; Ariel D. Stern; Marzyeh Ghassemi

arXiv:2505.24040·cs.CL·June 2, 2025

MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering

Yuexing Hao, Kumail Alhamoud, Hyewon Jeong, Haoran Zhang, Isha Puri, Philip Torr, Mike Schaekermann, Ariel D. Stern, Marzyeh Ghassemi

PDF

TL;DR

This paper introduces MedPAIR, a dataset for evaluating how well large language models and physicians align in identifying relevant information in medical questions, revealing misalignments and improving accuracy when irrelevant content is filtered out.

Contribution

The study presents MedPAIR, a new dataset with relevance annotations from physicians, and compares relevance estimation between physicians and LLMs in medical QA.

Findings

01

LLMs often do not align with physician relevance judgments.

02

Filtering irrelevant sentences improves accuracy for both LLMs and physicians.

03

Relevance estimation impacts downstream medical QA performance.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance on various medical question-answering (QA) benchmarks, including standardized medical exams. However, correct answers alone do not ensure correct logic, and models may reach accurate conclusions through flawed processes. In this study, we introduce the MedPAIR (Medical Dataset Comparing Physicians and AI Relevance Estimation and Question Answering) dataset to evaluate how physician trainees and LLMs prioritize relevant information when answering QA questions. We obtain annotations on 1,300 QA pairs from 36 physician trainees, labeling each sentence within the question components for relevance. We compare these relevance estimates to those for LLMs, and further evaluate the impact of these "relevant" subsets on downstream task performance for both physician trainees and LLMs. We find that LLMs are frequently not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.