CE-Dedup: Cost-Effective Convolutional Neural Nets Training based on Image Deduplication
Xuan Li, Liqiong Chang, Xue Liu

TL;DR
This paper introduces CE-Dedup, a framework that uses image deduplication to reduce dataset size and training resource consumption for CNNs without significantly impacting accuracy.
Contribution
It proposes a hashing-based deduplication method that balances dataset reduction and accuracy stability for CNN training.
Findings
Reduces dataset size by 23% without accuracy loss
Achieves 75% dataset reduction with only 5% accuracy drop
Validates effectiveness on well-known CNN benchmarks
Abstract
Attributed to the ever-increasing large image datasets, Convolutional Neural Networks (CNNs) have become popular for vision-based tasks. It is generally admirable to have larger-sized datasets for higher network training accuracies. However, the impact of dataset quality has not to be involved. It is reasonable to assume the near-duplicate images exist in the datasets. For instance, the Street View House Numbers (SVHN) dataset having cropped house plate digits from 0 to 9 are likely to have repetitive digits from the same/similar house plates. Redundant images may take up a certain portion of the dataset without consciousness. While contributing little to no accuracy improvement for the CNNs training, these duplicated images unnecessarily pose extra resource and computation consumption. To this end, this paper proposes a framework to assess the impact of the near-duplicate images on CNN…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Brain Tumor Detection and Classification
