Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation

Angelo Ziletti; Leonardo D'Ambrosi

arXiv:2502.21107·cs.CL·October 20, 2025

Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation

Angelo Ziletti, Leonardo D'Ambrosi

PDF

TL;DR

This paper introduces an automated system that uses large language models and retrieval-augmented generation to translate clinical criteria into SQL queries, enabling efficient patient cohort retrieval from electronic health records.

Contribution

The novel system combines criteria parsing, knowledge base retrieval, concept standardization, and SQL generation, advancing automated cohort identification from EHR data.

Findings

01

Achieved 0.75 F1-score in cohort identification

02

Effectively captures complex temporal and logical relationships

03

Demonstrates feasibility of automated cohort generation

Abstract

Clinical cohort definition is crucial for patient recruitment and observational studies, yet translating inclusion/exclusion criteria into SQL queries remains challenging and manual. We present an automated system utilizing large language models that combines criteria parsing, two-level retrieval augmented generation with specialized knowledge bases, medical concept standardization, and SQL generation to retrieve patient cohorts with patient funnels. The system achieves 0.75 F1-score in cohort identification on EHR data, effectively capturing complex temporal and logical relationships. These results demonstrate the feasibility of automated cohort generation for epidemiological research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.