MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization
Yalda Zafari, Hongyi Pan, Gorkem Durak, Ulas Bagci, Essam A. Rashed, Mohamed Mabrok

TL;DR
MammoClean is a framework that standardizes mammography datasets to reduce bias and improve AI model generalization across diverse populations and clinical settings.
Contribution
It introduces a comprehensive standardization and bias quantification pipeline for mammography datasets, enhancing reproducibility and cross-domain AI performance.
Findings
Significant distributional shifts in breast density and abnormalities across datasets.
Corrupted datasets lead to notable AI performance degradation.
Bias mitigation improves model robustness and generalization.
Abstract
The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Digital Radiography and Breast Imaging · Artificial Intelligence in Healthcare and Education
