Similarity Based Stratified Splitting: an approach to train better classifiers
Felipe Farias, Teresa Ludermir, Carmelo Bastos-Filho

TL;DR
This paper introduces a similarity-based stratified splitting method that improves data partitioning for training classifiers, leading to more realistic performance estimates across various datasets and classifiers.
Contribution
The paper presents a novel SBSS technique that uses similarity functions to create more representative data splits, enhancing classifier evaluation accuracy.
Findings
Outperforms standard stratified cross-validation in 75% of scenarios
Effective across multiple classifiers and similarity functions
Provides more realistic performance estimates in real-world applications
Abstract
We propose a Similarity-Based Stratified Splitting (SBSS) technique, which uses both the output and input space information to split the data. The splits are generated using similarity functions among samples to place similar samples in different splits. This approach allows for a better representation of the data in the training phase. This strategy leads to a more realistic performance estimation when used in real-world applications. We evaluate our proposal in twenty-two benchmark datasets with classifiers such as Multi-Layer Perceptron, Support Vector Machine, Random Forest and K-Nearest Neighbors, and five similarity functions Cityblock, Chebyshev, Cosine, Correlation, and Euclidean. According to the Wilcoxon Sign-Rank test, our approach consistently outperformed ordinary stratified 10-fold cross-validation in 75\% of the assessed scenarios.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Time Series Analysis and Forecasting · Machine Learning and Data Classification
