Clustering and Learning from Imbalanced Data

Naman D. Singh; Abhinav Dhall

arXiv:1811.00972·cs.LG·November 13, 2018·21 cites

Clustering and Learning from Imbalanced Data

Naman D. Singh, Abhinav Dhall

PDF

Open Access

TL;DR

This paper introduces a clustering-based oversampling method to improve learning from imbalanced datasets by generating synthetic minority class samples that respect the data distribution, enhancing classifier performance.

Contribution

The paper proposes a novel resampling technique that leverages clustering to generate synthetic minority samples, reducing dependence on centroid methods and preventing overfitting.

Findings

01

Improves classifier performance on imbalanced datasets.

02

Outperforms existing synthetic resampling techniques.

03

Effective across multiple datasets and evaluation metrics.

Abstract

A learning classifier must outperform a trivial solution, in case of imbalanced data, this condition usually does not hold true. To overcome this problem, we propose a novel data level resampling method - Clustering Based Oversampling for improved learning from class imbalanced datasets. The essential idea behind the proposed method is to use the distance between a minority class sample and its respective cluster centroid to infer the number of new sample points to be generated for that minority class sample. The proposed algorithm has very less dependence on the technique used for finding cluster centroids and does not effect the majority class learning in any way. It also improves learning from imbalanced data by incorporating the distribution structure of minority class samples in generation of new data samples. The newly generated minority class data is handled in a way as to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Data Mining Algorithms and Applications · Spam and Phishing Detection