Privacy in Image Datasets: A Case Study on Pregnancy Ultrasounds
Rawisara Lohanimit, Yankun Wu, Amelia Katirai, Yuta Nakashima, Noa Garcia

TL;DR
This paper investigates privacy risks in large-scale image datasets by analyzing pregnancy ultrasound images, revealing sensitive information and proposing ethical data curation practices.
Contribution
It introduces a systematic method to detect private information in publicly shared ultrasound images using CLIP embeddings and highlights privacy concerns in large datasets.
Findings
Thousands of private entities detected in ultrasound images
High-risk information enables potential re-identification
Recommendations for dataset privacy and ethical use
Abstract
The rise of generative models has led to increased use of large-scale datasets collected from the internet, often with minimal or no data curation. This raises concerns about the inclusion of sensitive or private information. In this work, we explore the presence of pregnancy ultrasound images, which contain sensitive personal information and are often shared online. Through a systematic examination of LAION-400M dataset using CLIP embedding similarity, we retrieve images containing pregnancy ultrasound and detect thousands of entities of private information such as names and locations. Our findings reveal that multiple images have high-risk information that could enable re-identification or impersonation. We conclude with recommended practices for dataset curation, data privacy, and ethical use of public image datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Face recognition and analysis · Fetal and Pediatric Neurological Disorders
