# A Multi-Stage Framework for Kawasaki Disease Prediction Using Clustering-Based Undersampling and Synthetic Data Augmentation: Cross-Institutional Validation with Dual-Center Clinical Data in Taiwan

**Authors:** Heng-Chih Huang, Chuan-Sheng Hung, Chun-Hung Richard Lin, Yi-Zhen Shie, Cheng-Han Yu, Ting-Hsin Huang

PMC · DOI: 10.3390/bioengineering12070742 · 2025-07-07

## TL;DR

This paper introduces a multi-stage AI framework to predict Kawasaki disease using undersampling and data augmentation, validated across two hospitals in Taiwan.

## Contribution

A novel multi-stage AI framework that addresses class imbalance in Kawasaki disease prediction using clustering-based undersampling and synthetic data augmentation.

## Key findings

- The model achieved 97.5% specificity and 53.6% F1-score at 95% recall on the CGMH test set.
- It maintained 74.7% specificity with 23.4% F1-score on the KMUH validation set.
- The framework demonstrates cross-institutional generalizability and practical utility for KD screening.

## Abstract

Kawasaki disease (KD) is a rare yet potentially life-threatening pediatric vasculitis that, if left undiagnosed or untreated, can result in serious cardiovascular complications. Its heterogeneous clinical presentation poses diagnostic challenges, often failing to meet classical criteria and increasing the risk of oversight. Leveraging routine laboratory tests with AI offers a promising strategy for enhancing early detection. However, due to the extremely low prevalence of KD, conventional models often struggle with severe class imbalance, limiting their ability to achieve both high sensitivity and specificity in practice. To address this issue, we propose a multi-stage AI-based predictive framework that incorporates clustering-based undersampling, data augmentation, and stacking ensemble learning. The model was trained and internally tested on clinical blood and urine test data from Chang Gung Memorial Hospital (CGMH, n = 74,641; 2010–2019), and externally validated using an independent dataset from Kaohsiung Medical University Hospital (KMUH, n = 1582; 2012–2020), thereby supporting cross-institutional generalizability. At a fixed recall rate of 95%, the model achieved a specificity of 97.5% and an F1-score of 53.6% on the CGMH test set, and a specificity of 74.7% with an F1-score of 23.4% on the KMUH validation set. These results underscore the model’s ability to maintain high specificity even under sensitivity-focused constraints, while still delivering clinically meaningful predictive performance. This balance of sensitivity and specificity highlights the framework’s practical utility for real-world KD screening.

## Linked entities

- **Diseases:** Kawasaki disease (MONDO:0012727)

## Full-text entities

- **Diseases:** vasculitis (MESH:D014657), KD (MESH:D009080), cardiovascular complications (MESH:D002318)

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12292631/full.md

---
Source: https://tomesphere.com/paper/PMC12292631