Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays
Sivaramakrishnan Rajaraman, Ghada Zamzmi, Feng Yang, Zhaohui Liang,, Zhiyun Xue, and Sameer Antani

TL;DR
This study introduces an entropy-based method to identify and remove semantically redundant chest X-ray images, improving deep learning model performance by focusing on more informative training data.
Contribution
It proposes a novel entropy-based sample scoring approach for selecting informative training data, demonstrating improved model performance on chest X-ray classification tasks.
Findings
Model trained on selected data outperforms full dataset training in recall.
Removing redundant data enhances generalizability to external datasets.
Entropy-based selection improves training efficiency and effectiveness.
Abstract
Deep learning (DL) has demonstrated its innate capacity to independently learn hierarchical features from complex and multi-dimensional data. A common understanding is that its performance scales up with the amount of training data. Another data attribute is the inherent variety. It follows, therefore, that semantic redundancy, which is the presence of similar or repetitive information, would tend to lower performance and limit generalizability to unseen data. In medical imaging data, semantic redundancy can occur due to the presence of multiple images that have highly similar presentations for the disease of interest. Further, the common use of augmentation methods to generate variety in DL training may be limiting performance when applied to semantically redundant data. We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging · COVID-19 diagnosis using AI · AI in cancer detection
