Dataset Bias Mitigation Through Analysis of CNN Training Scores
Ekberjan Derman

TL;DR
This paper introduces a score-based resampling method to identify and augment under-represented samples in training datasets, reducing bias and improving CNN performance across diverse groups.
Contribution
The paper proposes a novel, domain-independent score-based resampling approach for mitigating dataset bias in CNN training datasets.
Findings
The method effectively identifies under-represented samples.
Resampling reduces categorical bias in gender classification.
Results outperform VAE-based bias mitigation techniques.
Abstract
Training datasets are crucial for convolutional neural network-based algorithms, which directly impact their overall performance. As such, using a well-structured dataset that has minimum level of bias is always desirable. In this paper, we proposed a novel, domain-independent approach, called score-based resampling (SBR), to locate the under-represented samples of the original training dataset based on the model prediction scores obtained with that training set. In our method, once trained, we use the same CNN model to infer on its own training samples, obtain prediction scores, and based on the distance between predicted and ground-truth, we identify samples that are far away from their ground-truth and augment them in the original training set. The temperature term of the Sigmoid function is decreased to better differentiate scores. For experimental evaluation, we selected one Kaggle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Adversarial Robustness in Machine Learning · Human Pose and Action Recognition
