MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

Zhichao Yang; Gregory D. Lyng; Sanjit Singh Batra; Robert E. Tillman

arXiv:2605.20197·cs.CL·May 21, 2026

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

Zhichao Yang, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman

PDF

TL;DR

MedicalBench is a new benchmark dataset designed to evaluate large language models on their ability to extract implicit medical concepts from electronic health records, emphasizing reasoning and evidence grounding.

Contribution

It introduces a systematic benchmark for implicit, evidence-grounded medical concept extraction, including a curated dataset and evaluation tasks for correctness and interpretability.

Findings

01

State-of-the-art LLMs perform modestly on MedicalBench.

02

Performance is largely invariant to note length.

03

MedicalBench isolates reasoning difficulty from superficial confounders.

Abstract

Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts instead of implicit concepts. We present MedicalBench, a benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note-concept pairs, coupled with sentence-level evidence identification. Built from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset is curated through a multi-stage large language model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.