Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition
Ganlin Feng, Yuxi Long, Erin Lou, Lianghong Chen, Zihao Jing, Pingzhao Hu, Wei Xu

TL;DR
This study investigates whether synthetic facial images alone can effectively train computer vision models for pediatric rare disease recognition, showing promising results at sufficient scale.
Contribution
It demonstrates that high-fidelity synthetic data can match real data performance in pediatric rare disease classification, enabling privacy-preserving applications.
Findings
Synthetic-only training achieves comparable performance to real-data baselines.
Models trained on synthetic data perform well at sufficient scale across multiple architectures.
Synthetic data can serve as a privacy-preserving resource for genetic education and counseling.
Abstract
Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings. These challenges not only hinder automated diagnosis but also restrict the availability of visual resources for clinical genetic counseling. While prior work has shown that synthetic data can augment real datasets and preserve phenotype-level semantics, it remains unclear whether synthetic data alone is sufficient for learning in ultra-low-resource pediatric settings. In this work, we study the synthetic-only regime for pediatric rare disease recognition. Under a controlled experimental setup, models are trained exclusively on phenotype-aware synthetic facial images at increasing scales. We find that synthetic-only training achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
