# Enhanced Prediction of Atrial Fibrillation in Patients With Ischemic Stroke Through Electronic Medical Records and Text Mining: Algorithm Development and Validation

**Authors:** Yu-Wei Chen, Sheng-Feng Sung, Ya-Han Hu, Yu-Hsuan Yang

PMC · DOI: 10.2196/78117 · 2026-03-10

## TL;DR

This study improves the prediction of atrial fibrillation in stroke patients by combining structured and unstructured medical data, enhancing accuracy and generalizability.

## Contribution

The study introduces a novel approach integrating structured and text-mined features from electronic medical records to predict AF in stroke patients.

## Key findings

- Combining structured and unstructured data improved predictive performance in selected models.
- Ensemble learning-based models outperformed alternative algorithms in AF risk prediction.
- Key predictors included E- to A-wave velocity ratio, left atrial size, and age.

## Abstract

Stroke remains one of the leading causes of mortality and long-term disability worldwide. Atrial fibrillation (AF) is a major and often underdiagnosed risk factor for ischemic stroke as it is frequently asymptomatic and may remain undetected until a catastrophic cerebrovascular event occurs. The lack of timely identification and preventive treatment for AF substantially increases stroke risk. Although previous studies have proposed various predictive models for AF detection, many rely primarily on structured clinical variables and are developed using data from a single institution, which limits their generalizability and real-world applicability across different health care settings.

The objective of this study was to develop a robust and generalizable AF risk prediction model for patients with stroke using electronic medical records. By integrating structured clinical variables with features derived from unstructured clinical text, this study aimed to construct a more comprehensive representation of patient health status. Furthermore, this study emphasized systematic internal and external validation, along with calibration assessment, to evaluate model stability and generalizability across multiple hospital datasets, thereby supporting its potential use in routine clinical practice.

This study analyzed datasets from 2 hospitals in Taiwan: Landseed International Hospital (LIH), with 3988 patients, and Chia-Yi Christian Hospital (CYCH), with 5821 patients. We applied 5 feature engineering techniques to extract features from unstructured electronic medical record data, addressed data imbalance using 6 distinct resampling methods, and used 9 classification algorithms to compare model performance across both internal and external validation sets. This study identified the top 20 most important features from the best-performing models for both the LIH and CYCH datasets.

The optimal predictive model for LIH was based solely on structured variables, whereas the model for CYCH achieved superior results by integrating structured variables with text-derived variables obtained from unstructured clinical notes using term frequency–inverse document frequency. Notably, feature importance analysis consistently identified the ratio of E- to A-wave velocities, left atrial size, and age as the top 3 predictive factors across both datasets, underscoring their critical role in AF risk assessment among patients with stroke.

This study demonstrated the development of predictive models for AF in patients with ischemic stroke. Notably, the integration of structured variables with variables derived from unstructured clinical text improved predictive performance in selected model configurations. Rigorous internal and external validation processes confirmed the superior performance of ensemble learning–based machine learning models compared with alternative algorithms, underscoring the potential of this approach for AF risk prediction.

## Linked entities

- **Diseases:** Atrial fibrillation (MONDO:0004981), ischemic stroke (MONDO:1060198)

## Full-text entities

- **Genes:** SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}, CCNH (cyclin H) [NCBI Gene 902] {aka CAK, CycH, p34, p37}, CRP (C-reactive protein) [NCBI Gene 1401] {aka PTX1}, F3 (coagulation factor III, tissue factor) [NCBI Gene 2152] {aka CD142, TF, TFA}
- **Diseases:** neurological deficits (MESH:D009461), AIS (MESH:D000083242), cardiac arrhythmia (MESH:D001145), bleeding (MESH:D006470), dysarthria (MESH:D004401), Stroke (MESH:D020521), diabetes mellitus (MESH:D003920), inflammatory (MESH:D007249), hyperlipidemia (MESH:D006949), NIHSS (MESH:C538175), coronary heart disease (MESH:D003327), disability (MESH:D009069), TIA (MESH:D002546), congestive heart failure (MESH:D006333), cardiovascular disease (MESH:D002318), AF (MESH:D001281), ischemic heart disease (MESH:D017202), Ischemic Stroke (MESH:D002544), cerebrovascular disease (MESH:D002561), LIH (MESH:D003428), long-term disability (MESH:D000088562), hypertension (MESH:D006973), death (MESH:D003643)
- **Chemicals:** edoxaban (MESH:C552171), apixaban (MESH:C522181), triglyceride (MESH:D014280), sugar (MESH:D000073893), warfarin (MESH:D014859), rivaroxaban (MESH:D000069552), lipid (MESH:D008055), dabigatran (MESH:D000069604), BERT (-), TG (MESH:D013866)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12975001/full.md

---
Source: https://tomesphere.com/paper/PMC12975001