# Feasibility of machine learning analysis for the identification of patients with possible primary ciliary dyskinesia

**Authors:** Gully Burns, Carey Kauffman, Michele Manion, Ruth-Anne Pai, Carlos Milla, Michael G. O’Connor, Adam J. Shapiro, Heidi Bjornson-Pennell

PMC · DOI: 10.1186/s13023-025-03966-z · 2025-10-14

## TL;DR

This paper explores using machine learning to screen for primary ciliary dyskinesia in children, a rare disease often underdiagnosed, using claims data to identify potential cases.

## Contribution

The study introduces a feasible machine learning approach for screening PCD using claims data without a specific diagnostic code.

## Key findings

- A random forest model achieved variable performance with sensitivity 0.75–0.94 and positive predictive value 0.45–0.73.
- Expanding the dataset improved model performance, making it suitable for screening.
- The model identified 7705 potential cases in 1.32 million pediatric patients, matching PCD's estimated prevalence.

## Abstract

Significant diagnostic delays are common in primary ciliary dyskinesia (PCD), a rare disease that is significantly underdiagnosed. Scalable screening methods could improve early identification and health outcomes.

Can machine learning (ML) be used to screen for PCD in pediatric patients?

We evaluated the feasibility of a random forest model to screen for PCD using data from the PCD Foundation Registry and a national claims database. We identified a cohort of pediatric patients (< 18 years of age) with diagnostic codes indicative of conditions potentially associated with PCD, and studied diagnostic, procedural, and pharmaceutical codes associated with PCD to develop ML features. Models were trained on composite claims data from confirmed patients with PCD, patients with Q34.8 (Specific Congenital Malformation of the Respiratory System) diagnosed within 6 months of an Electron Microscopy procedure (Q34.8 + EM), and a randomly-selected, matched control group. Model performance was tested through fivefold cross-validation.

Using 82 confirmed pediatric PCD cases and 4161 matched controls, the model demonstrated variable performance (positive predictive value 0.45–0.73, sensitivity 0.75–0.94). Synthetic data augmentation did not improve results (positive predictive value 0.45–0.67, sensitivity 0.71–1.00). Expanding the dataset to include 319 Q34.8 + EM patients and 8214 controls improved performance (positive predictive value 0.51–0.54, sensitivity 0.82–0.90), suitable for screening. In a cohort of 1.32 million pediatric patients, 7705 were classified as positive, consistent with the estimated prevalence of PCD (1:7554).

This study demonstrates the feasibility of using ML to screen for PCD using claims data, even in the absence of a specific International Classification of Disease (ICD) code. While unvalidated, this work may serve as the basis for future ML efforts in rare disease detection. Such screening approaches may aid in the identification of individuals who may benefit from timely diagnostic testing and targeted interventions.

The online version contains supplementary material available at 10.1186/s13023-025-03966-z.

## Linked entities

- **Diseases:** primary ciliary dyskinesia (MONDO:0016575)

## Full-text entities

- **Diseases:** Congenital Malformation of the Respiratory System (MESH:D015619), PCD (MESH:D002925)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12522419/full.md

---
Source: https://tomesphere.com/paper/PMC12522419