Synthetic Simplicity: Unveiling Bias in Medical Data Augmentation
Krishan Agyakari Raja Babu, Rachana Sathish, Mrunal Pattanaik and, Rahul Venkataramani

TL;DR
This paper investigates how models trained on synthetic medical data can exploit superficial source-related features, leading to poor real-world performance, and highlights the importance of understanding bias introduced by synthetic data.
Contribution
It uncovers the phenomenon of simplicity bias in models trained on synthetic data and demonstrates its impact on medical imaging tasks, providing guidelines for synthetic data usage.
Findings
Models exploit source of data as a spurious feature.
Synthetic data can cause models to rely on superficial cues.
Performance drops when source-label correlation is absent.
Abstract
Synthetic data is becoming increasingly integral in data-scarce fields such as medical imaging, serving as a substitute for real data. However, its inherent statistical characteristics can significantly impact downstream tasks, potentially compromising deployment performance. In this study, we empirically investigate this issue and uncover a critical phenomenon: downstream neural networks often exploit spurious distinctions between real and synthetic data when there is a strong correlation between the data source and the task label. This exploitation manifests as \textit{simplicity bias}, where models overly rely on superficial features rather than genuine task-related complexities. Through principled experiments, we demonstrate that the source of data (real vs.\ synthetic) can introduce spurious correlating factors leading to poor performance during deployment when the correlation is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare
