Exploring Data Redundancy in Real-world Image Classification through Data Selection
Zhenyu Tang, Shaoting Zhang, Xiaosong Wang

TL;DR
This paper introduces novel data valuation metrics and algorithms for data selection in image classification, reducing data requirements and training time while maintaining accuracy, especially in real-world scenarios like medical imaging.
Contribution
It proposes two new data valuation metrics based on Synaptic Intelligence and gradient norms, along with online and offline data selection algorithms for real-world image datasets.
Findings
Online data selection accelerates training with fewer epochs and data subsets.
Offline coreset construction reduces data to 18-30% of original while preserving accuracy.
Methods are effective on various real-world datasets, including medical imaging.
Abstract
Deep learning models often require large amounts of data for training, leading to increased costs. It is particularly challenging in medical imaging, i.e., gathering distributed data for centralized training, and meanwhile, obtaining quality labels remains a tedious job. Many methods have been proposed to address this issue in various training paradigms, e.g., continual learning, active learning, and federated learning, which indeed demonstrate certain forms of the data valuation process. However, existing methods are either overly intuitive or limited to common clean/toy datasets in the experiments. In this work, we present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study the redundancy in real-world image data. Novel online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Image Segmentation Techniques · Image Retrieval and Classification Techniques · COVID-19 diagnosis using AI
