A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

Maria Mahbub; Gregory M. Dams; Josh Arnold; Caitlin Rizy; Sudarshan Srinivasan; Elliot M. Fielstein; Minu A. Aghevli; Kamonica L. Craig; Elizabeth M. Oliva; Joseph Erdos; Jodie Trafton; and Ioana Danciu

arXiv:2604.06028·cs.CL·April 8, 2026

A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan, Elliot M. Fielstein, Minu A. Aghevli, Kamonica L. Craig, Elizabeth M. Oliva, Joseph Erdos, Jodie Trafton, and Ioana Danciu

PDF

TL;DR

This paper introduces a comprehensive multi-stage validation framework for trustworthy large-scale clinical information extraction using LLMs, enabling rigorous assessment without extensive manual annotation.

Contribution

It presents a novel multi-stage validation approach combining prompt calibration, plausibility filtering, semantic grounding, and expert review for scalable clinical data extraction.

Findings

01

Rule-based filtering removed 14.59% of unsupported extractions.

02

Judge LLM assessments showed Gwet's AC1=0.80 with experts.

03

Extracted SUD diagnoses predicted care engagement with AUC=0.80.

Abstract

Large language models (LLMs) show promise for extracting clinically meaningful information from unstructured health records, yet their translation into real-world settings is constrained by the lack of scalable and trustworthy validation approaches. Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale. We propose a multi-stage validation framework for LLM-based clinical information extraction that enables rigorous assessment under weak supervision. The framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis to quantify uncertainty and characterize error modes without exhaustive manual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.