MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

Zhan Qu; Michael F\"arber

arXiv:2512.20822·cs.CL·May 8, 2026

MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

Zhan Qu, Michael F\"arber

PDF

TL;DR

MediEval introduces a comprehensive benchmark linking EHRs to biomedical knowledge for evaluating LLMs' medical reasoning, identifying failure modes, and proposing a fine-tuning method to enhance safety and accuracy.

Contribution

This work presents MediEval, a novel medical benchmark combining patient data and knowledge grounding, and introduces CoRFu, a fine-tuning approach to improve LLM safety and correctness.

Findings

01

CoRFu improves macro-F1 by +16.4 points over the base model.

02

Identifies hallucinated support and truth inversion as common failure modes.

03

Eliminates truth inversion errors with the proposed fine-tuning method.

Abstract

Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.