ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

Alex Stinard

arXiv:2605.11143·cs.CL·May 13, 2026

ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

Alex Stinard

PDF

TL;DR

ClinicalBench evaluates retrieval methods for clinical question answering on MIMIC-IV, emphasizing assertion and temporality, with significant improvements over baseline models demonstrated through physician adjudication.

Contribution

Introduces assertion-aware KG-RAG retrieval architecture and a comprehensive clinical QA benchmark with physician adjudication, advancing real-world clinical NLP evaluation.

Findings

01

EpiKG improves retrieval accuracy with assertion and temporality tagging.

02

Architectural novelty yields +8.84 percentage points over baseline.

03

Physician adjudication reveals 56% of auto-generated answers are defective.

Abstract

Reasoning benchmarks measure clinical performance on clean inputs. We evaluate the step before reasoning: retrieval over real EHR notes, where negation, temporality, and family-versus-patient attribution can flip a correct answer to a wrong one. EpiKG carries an assertion label and a temporality tag with every fact in a patient knowledge graph, then routes retrieval by question intent. ClinicalBench is a 400-question test over 43 MIMIC-IV patients across 9 assertion-sensitive categories. A 7-condition ablation tests each piece of EpiKG across six LLMs (Claude Opus 4.6, GPT-OSS 20B, MedGemma 27B, Gemma 4 31B, MedGemma 1.5 4B, Qwen 3.5 35B). Three physicians blindly adjudicated 100 paired items. The author-blind primary endpoint, leave-author-out paired exact McNemar on 50 unanimous-strict items rated by two external physicians, yields +22.0 percentage points (95 percent Newcombe CI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.