Independent evaluation of state-of-the-art deep networks for mammography
Osvaldo Matias Velarde, Lucas Parra

TL;DR
This study evaluates the robustness of state-of-the-art deep learning models for mammography across multiple datasets, revealing limited generalizability and emphasizing the need for larger, more diverse public datasets for reliable performance.
Contribution
It provides an independent assessment of top mammography models, highlighting their poor out-of-sample performance and the importance of dataset diversity for model robustness.
Findings
Models perform well on their original datasets.
Models perform poorly on out-of-sample data.
Using all four mammogram views improves robustness.
Abstract
Deep neural models have shown remarkable performance in image recognition tasks, whenever large datasets of labeled images are available. The largest datasets in radiology are available for screening mammography. Recent reports, including in high impact journals, document performance of deep models at or above that of trained radiologists. What is not yet known is whether performance of these trained models is robust and replicates across datasets. Here we evaluate performance of five published state-of-the-art models on four publicly available mammography datasets. The limited size of public datasets precludes retraining the model and so we are limited to evaluate those models that have been made available with pre-trained parameters. Where test data was available, we replicated published results. However, the trained models performed poorly on out-of-sample data, except when based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging · AI in cancer detection · COVID-19 diagnosis using AI
MethodsTest
