Experimental Identification of Hard Data Sets for Classification and Feature Selection Methods with Insights on Method Selection
Cuiju Luan, Guozhu Dong

TL;DR
This study systematically evaluates 48 method combinations on 129 UCI datasets to identify which are hard for classification and feature selection, providing new insights into method performance and rankings.
Contribution
It introduces a systematic evaluation of dataset difficulty and ranks classification methods separately for hard and easy datasets, revealing new insights into method effectiveness.
Findings
15 datasets identified as hard for classification
Random Forest remains the top-performing method
Method rankings differ from previous literature
Abstract
The paper reports an experimentally identified list of benchmark data sets that are hard for representative classification and feature selection methods. This was done after systematically evaluating a total of 48 combinations of methods, involving eight state-of-the-art classification algorithms and six commonly used feature selection methods, on 129 data sets from the UCI repository (some data sets with known high classification accuracy were excluded). In this paper, a data set for classification is called hard if none of the 48 combinations can achieve an AUC over 0.8 and none of them can achieve an F-Measure value over 0.8; it is called easy otherwise. A total of 15 out of the 129 data sets were found to be hard in that sense. This paper also compares the performance of different methods, and it produces rankings of classification methods, separately on the hard data sets and on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Imbalanced Data Classification Techniques · Data Mining Algorithms and Applications
