# Development and Validation of a Generative Artificial Intelligence-Based Pipeline for Automated Clinical Data Extraction From Electronic Health Records: Technical Implementation Study

**Authors:** Marvin N Carlisle, William A Pace, Andrew W Liu, Robert Krumm, Janet E Cowan, Peter R Carroll, Matthew R Cooperberg, Anobel Y Odisho

PMC · DOI: 10.2196/70708 · JMIR Bioinformatics and Biotechnology · 2026-01-06

## TL;DR

This paper describes a new AI system that automatically extracts clinical data from electronic health records, making research faster and more efficient.

## Contribution

A novel LLM-based pipeline (UODBLLM) for secure, scalable, and cost-effective clinical data extraction from EHRs.

## Key findings

- UODBLLM processed 1800 MRI reports with 100% completion rate and 8.90 seconds per report.
- The system extracted 16 structured clinical elements per report with minimal processing cost ($0.009 per report).
- UODBLLM showed consistent performance across 18 batches, demonstrating scalability and reliability.

## Abstract

The manual abstraction of unstructured clinical data is often necessary for granular clinical outcomes research but is time consuming and can be of variable quality. Large language models (LLMs) show promise in medical data extraction yet integrating them into research workflows remains challenging and poorly described.

This study aimed to develop and integrate an LLM-based system for automated data extraction from unstructured electronic health record (EHR) text reports within an established clinical outcomes database.

We implemented a generative artificial intelligence pipeline (UODBLLM) utilizing a flexible language model interface that supports various LLM implementations, including Health Insurance Portability and Accountability Act-compliant cloud services and local open-source models. We used extensible markup language (XML)-structured prompts and integrated using an open database connectivity interface to generate structured data from clinical documentation in the EHR. We evaluated the UODBLLM’s performance on the completion rate, processing time, and extraction capabilities across multiple clinical data elements, including quantitative measurements, categorical assessments, and anatomical descriptions, using sample magnetic resonance imaging (MRI) reports as test cases. System reliability was tested across multiple batches to assess scalability and consistency.

Piloted against MRI reports, UODBLLM processed 1800 clinical documents with a 100% completion rate and an average processing time of 8.90 seconds per report. The token utilization averaged 2692 tokens per report, with an input-to-output ratio of approximately 13:2, resulting in a processing cost of US $0.009 per report. UODBLLM had consistent performance across 18 batches of 100 reports each and completed all processing in 4.45 hours. From each report, UODBLLM extracted 16 structured clinical elements, including prostate volume, prostate-specific antigen values, Prostate Imaging Reporting and Data System scores, clinical staging, and anatomical assessments. All extracted data were automatically validated against predefined schemas and stored in standardized JSON format.

We demonstrated the successful integration of an LLM-based extraction system within an existing clinical outcomes database, achieving rapid, comprehensive data extraction at minimal cost. UODBLLM provides a scalable, efficient solution for automating clinical data extraction while maintaining protected health information security. This approach could significantly accelerate research timelines and expand feasible clinical studies, particularly for large-scale database projects.

## Full-text entities

- **Genes:** KLK3 (kallikrein related peptidase 3) [NCBI Gene 354] {aka APS, KLK2A1, PSA, hK3}
- **Diseases:** LLMs (MESH:D007806), renal and bladder cancer (MESH:D001749), hepatocellular carcinoma (MESH:D006528), prostate cancer (MESH:D011471), urologic cancers (MESH:D014571), HIPAA (OMIM:603663), cancer (MESH:D009369)
- **Chemicals:** GPT (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12774397/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12774397/full.md

## References

23 references — full list in the complete paper: https://tomesphere.com/paper/PMC12774397/full.md

---
Source: https://tomesphere.com/paper/PMC12774397