# Synthetic data-driven AI approach for fetal chromosomal aneuploidies detection

**Authors:** Changhoe Hwang, Krishna Prasad Adhikari, Gyeongin Oh, Sunshin Kim

PMC · DOI: 10.1093/bioadv/vbaf244 · 2025-10-06

## TL;DR

This paper introduces a synthetic data-driven AI method to detect fetal chromosomal abnormalities, overcoming the challenge of limited real positive data.

## Contribution

A novel synthetic data generation approach for fetal aneuploidy detection with high similarity to real data and improved model performance.

## Key findings

- Synthetic data generation achieved >99.9% similarity to real data for fetal chromosomal aneuploidy detection.
- Logistic regression models maintained 100% sensitivity and PPV for autosomal aneuploidies and ≥99.6% for sex chromosome aneuploidies.
- The method demonstrated 100% accuracy on real positive fetal aneuploidy cases.

## Abstract

A major limitation in the development of fetal chromosomal aneuploidy detection technologies lies in the scarcity of real positive data. To address this issue, we propose a novel methodology to generate virtually unlimited synthetic negative and positive datasets with >99.9% similarity to real data, enabling accurate detection of both autosomal chromosome aneuploidies (ACA) and sex chromosome aneuploidies (SCA). In terms of methods, blood samples from 15 999 pregnant women were analyzed, including 186 clinically confirmed positive cases. Using 701 high-confidence negatives as a reference, we designed algorithms for synthetic data generation. For negatives, multiple real FASTQ files were randomly merged, and fetal fraction (FF) was recalculated to reflect biological variability. For positives, chromosome-specific read counts were adjusted using numerical equations: ACAs were simulated by increasing the target chromosome reads, and SCAs were generated by adjusting sex chromosome read counts using regression models that account for FF and total read count, with the GC distribution preserved. Logistic regression (LR) models were then trained using features including FF, GC content, and chromosomal read counts. Performance was evaluated against conventional z-score methods and real positive cases.

From high-confidence negative samples, ∼160 000 synthetic training datasets were generated for major ACA and ∼35 000 for each SCA. While z-score methods showed declines in sensitivity (T13) or positive predictive value (PPV) (T18, T21) under low prevalence, LR models consistently maintained 100% sensitivity and PPV for ACAs, achieved ≥99.6% sensitivity and PPV for SCAs on synthetic evaluation datasets, and demonstrated 100% accuracy on real positive samples.

All formulas and procedures required for synthetic data generation and model development are implemented in Python and are available at https://github.com/genomecare-rnd/SyntheticData-NIPT.

## Full-text entities

- **Diseases:** ACA (MESH:D000782), SCA (MESH:D025064)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12557104/full.md

---
Source: https://tomesphere.com/paper/PMC12557104