Leveraging Foundation Language Models (FLMs) for Automated Cohort   Extraction from Large EHR Databases

Purity Mugambi; Alexandra Meliou; Madalina Fiterau

arXiv:2412.11472·cs.LG·December 17, 2024

Leveraging Foundation Language Models (FLMs) for Automated Cohort Extraction from Large EHR Databases

Purity Mugambi, Alexandra Meliou, Madalina Fiterau

PDF

Open Access

TL;DR

This paper introduces a method using foundation language models to automate and improve the accuracy of extracting cohorts from multiple large electronic health record databases, reducing manual effort.

Contribution

The study presents a novel approach leveraging FLMs for automated column matching and query generation in multi-dataset cohort extraction, demonstrating high accuracy on MIMIC-III and eICU.

Findings

01

Achieves 92% top-three accuracy in column matching

02

Maintains accuracy as database size increases

03

Automates cohort extraction process effectively

Abstract

A crucial step in cohort studies is to extract the required cohort from one or more study datasets. This step is time-consuming, especially when a researcher is presented with a dataset that they have not previously worked with. When the cohort has to be extracted from multiple datasets, cohort extraction can be extremely laborious. In this study, we present an approach for partially automating cohort extraction from multiple electronic health record (EHR) databases. We formulate the guided multi-dataset cohort extraction problem in which selection criteria are first converted into queries, translating them from natural language text to language that maps to database entities. Then, using FLMs, columns of interest identified from the queries are automatically matched between the study databases. Finally, the generated queries are run across all databases to extract the study cohort. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Data Quality and Management