ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV
Alex Stinard

TL;DR
ClinicalBench evaluates retrieval methods for clinical question answering on MIMIC-IV, emphasizing assertion and temporality, with significant improvements over baseline models demonstrated through physician adjudication.
Contribution
Introduces assertion-aware KG-RAG retrieval architecture and a comprehensive clinical QA benchmark with physician adjudication, advancing real-world clinical NLP evaluation.
Findings
EpiKG improves retrieval accuracy with assertion and temporality tagging.
Architectural novelty yields +8.84 percentage points over baseline.
Physician adjudication reveals 56% of auto-generated answers are defective.
Abstract
Reasoning benchmarks measure clinical performance on clean inputs. We evaluate the step before reasoning: retrieval over real EHR notes, where negation, temporality, and family-versus-patient attribution can flip a correct answer to a wrong one. EpiKG carries an assertion label and a temporality tag with every fact in a patient knowledge graph, then routes retrieval by question intent. ClinicalBench is a 400-question test over 43 MIMIC-IV patients across 9 assertion-sensitive categories. A 7-condition ablation tests each piece of EpiKG across six LLMs (Claude Opus 4.6, GPT-OSS 20B, MedGemma 27B, Gemma 4 31B, MedGemma 1.5 4B, Qwen 3.5 35B). Three physicians blindly adjudicated 100 paired items. The author-blind primary endpoint, leave-author-out paired exact McNemar on 50 unanimous-strict items rated by two external physicians, yields +22.0 percentage points (95 percent Newcombe CI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
