CE-Dedup: Cost-Effective Convolutional Neural Nets Training based on   Image Deduplication

Xuan Li; Liqiong Chang; Xue Liu

arXiv:2109.00899·cs.CV·September 3, 2021·1 cites

CE-Dedup: Cost-Effective Convolutional Neural Nets Training based on Image Deduplication

Xuan Li, Liqiong Chang, Xue Liu

PDF

Open Access

TL;DR

This paper introduces CE-Dedup, a framework that uses image deduplication to reduce dataset size and training resource consumption for CNNs without significantly impacting accuracy.

Contribution

It proposes a hashing-based deduplication method that balances dataset reduction and accuracy stability for CNN training.

Findings

01

Reduces dataset size by 23% without accuracy loss

02

Achieves 75% dataset reduction with only 5% accuracy drop

03

Validates effectiveness on well-known CNN benchmarks

Abstract

Attributed to the ever-increasing large image datasets, Convolutional Neural Networks (CNNs) have become popular for vision-based tasks. It is generally admirable to have larger-sized datasets for higher network training accuracies. However, the impact of dataset quality has not to be involved. It is reasonable to assume the near-duplicate images exist in the datasets. For instance, the Street View House Numbers (SVHN) dataset having cropped house plate digits from 0 to 9 are likely to have repetitive digits from the same/similar house plates. Redundant images may take up a certain portion of the dataset without consciousness. While contributing little to no accuracy improvement for the CNNs training, these duplicated images unnecessarily pose extra resource and computation consumption. To this end, this paper proposes a framework to assess the impact of the near-duplicate images on CNN…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Brain Tumor Detection and Classification