MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction
Zhichao Yang, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman

TL;DR
MedicalBench is a new benchmark dataset designed to evaluate large language models on their ability to extract implicit medical concepts from electronic health records, emphasizing reasoning and evidence grounding.
Contribution
It introduces a systematic benchmark for implicit, evidence-grounded medical concept extraction, including a curated dataset and evaluation tasks for correctness and interpretability.
Findings
State-of-the-art LLMs perform modestly on MedicalBench.
Performance is largely invariant to note length.
MedicalBench isolates reasoning difficulty from superficial confounders.
Abstract
Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts instead of implicit concepts. We present MedicalBench, a benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note-concept pairs, coupled with sentence-level evidence identification. Built from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset is curated through a multi-stage large language model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
