Leveraging Foundation Language Models (FLMs) for Automated Cohort Extraction from Large EHR Databases
Purity Mugambi, Alexandra Meliou, Madalina Fiterau

TL;DR
This paper introduces a method using foundation language models to automate and improve the accuracy of extracting cohorts from multiple large electronic health record databases, reducing manual effort.
Contribution
The study presents a novel approach leveraging FLMs for automated column matching and query generation in multi-dataset cohort extraction, demonstrating high accuracy on MIMIC-III and eICU.
Findings
Achieves 92% top-three accuracy in column matching
Maintains accuracy as database size increases
Automates cohort extraction process effectively
Abstract
A crucial step in cohort studies is to extract the required cohort from one or more study datasets. This step is time-consuming, especially when a researcher is presented with a dataset that they have not previously worked with. When the cohort has to be extracted from multiple datasets, cohort extraction can be extremely laborious. In this study, we present an approach for partially automating cohort extraction from multiple electronic health record (EHR) databases. We formulate the guided multi-dataset cohort extraction problem in which selection criteria are first converted into queries, translating them from natural language text to language that maps to database entities. Then, using FLMs, columns of interest identified from the queries are automatically matched between the study databases. Finally, the generated queries are run across all databases to extract the study cohort. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling · Data Quality and Management
