Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation
Angelo Ziletti, Leonardo D'Ambrosi

TL;DR
This paper introduces an automated system that uses large language models and retrieval-augmented generation to translate clinical criteria into SQL queries, enabling efficient patient cohort retrieval from electronic health records.
Contribution
The novel system combines criteria parsing, knowledge base retrieval, concept standardization, and SQL generation, advancing automated cohort identification from EHR data.
Findings
Achieved 0.75 F1-score in cohort identification
Effectively captures complex temporal and logical relationships
Demonstrates feasibility of automated cohort generation
Abstract
Clinical cohort definition is crucial for patient recruitment and observational studies, yet translating inclusion/exclusion criteria into SQL queries remains challenging and manual. We present an automated system utilizing large language models that combines criteria parsing, two-level retrieval augmented generation with specialized knowledge bases, medical concept standardization, and SQL generation to retrieve patient cohorts with patient funnels. The system achieves 0.75 F1-score in cohort identification on EHR data, effectively capturing complex temporal and logical relationships. These results demonstrate the feasibility of automated cohort generation for epidemiological research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
