A Survey of Dataset Refinement for Problems in Computer Vision Datasets
Zhijing Wan, Zhixiang Wang, CheukTing Chung, Zheng Wang

TL;DR
This survey reviews recent dataset refinement techniques in computer vision, addressing issues like class imbalance and noisy labels, and categorizes methods into sampling, subset selection, and active learning to guide future research.
Contribution
It provides a comprehensive classification and analysis of dataset refinement methods for computer vision datasets, highlighting their advantages and disadvantages.
Findings
Class imbalance and noisy labels are major dataset problems.
Refinement methods include sampling, subset selection, and active learning.
Different methods have distinct strengths and limitations.
Abstract
Large-scale datasets have played a crucial role in the advancement of computer vision. However, they often suffer from problems such as class imbalance, noisy labels, dataset bias, or high resource costs, which can inhibit model performance and reduce trustworthiness. With the advocacy of data-centric research, various data-centric solutions have been proposed to solve the dataset problems mentioned above. They improve the quality of datasets by re-organizing them, which we call dataset refinement. In this survey, we provide a comprehensive and structured overview of recent advances in dataset refinement for problematic computer vision datasets. Firstly, we summarize and analyze the various problems encountered in large-scale computer vision datasets. Then, we classify the dataset refinement algorithms into three categories based on the refinement process: data sampling, data subset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Machine Learning and ELM
