TL;DR
This paper introduces a novel method combining hierarchical clustering of variables with random forest-based feature selection to improve high-dimensional classification, especially with mixed data types, enhancing interpretability and performance.
Contribution
The proposed approach automatically identifies variable groups and selects relevant synthetic variables without prior knowledge of group structure, handling mixed numerical and categorical data.
Findings
Improved classification accuracy over standard random forests.
Effective reduction of variable redundancy and dimensionality.
Enhanced interpretability through variable grouping.
Abstract
Standard approaches to tackle high-dimensional supervised classification problem often include variable selection and dimension reduction procedures. The novel methodology proposed in this paper combines clustering of variables and feature selection. More precisely, hierarchical clustering of variables procedure allows to build groups of correlated variables in order to reduce the redundancy of information and summarizes each group by a synthetic numerical variable. Originality is that the groups of variables (and the number of groups) are unknown a priori. Moreover the clustering approach used can deal with both numerical and categorical variables (i.e. mixed dataset). Among all the possible partitions resulting from dendrogram cuts, the most relevant synthetic variables (i.e. groups of variables) are selected with a variable selection procedure using random forests. Numerical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
