TL;DR
SMOTE-ENC introduces a new synthetic data generation technique that encodes nominal features as numeric values, improving over SMOTE-NC especially for datasets with many categorical features or purely nominal data.
Contribution
The paper proposes SMOTE-ENC, a novel over-sampling method that encodes nominal features numerically, addressing limitations of SMOTE-NC and applicable to both mixed and nominal-only datasets.
Findings
SMOTE-ENC outperforms SMOTE-NC in datasets with many nominal features.
The method effectively handles datasets with only nominal features.
SMOTE-ENC improves classification accuracy in imbalanced datasets.
Abstract
Real world datasets are heavily skewed where some classes are significantly outnumbered by the other classes. In these situations, machine learning algorithms fail to achieve substantial efficacy while predicting these under-represented instances. To solve this problem, many variations of synthetic minority over-sampling methods (SMOTE) have been proposed to balance the dataset which deals with continuous features. However, for datasets with both nominal and continuous features, SMOTE-NC is the only SMOTE-based over-sampling technique to balance the data. In this paper, we present a novel minority over-sampling method, SMOTE-ENC (SMOTE - Encoded Nominal and Continuous), in which, nominal features are encoded as numeric values and the difference between two such numeric value reflects the amount of change of association with minority class. Our experiments show that the classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
