Adaptive MSD-Splitting: Enhancing C4.5 and Random Forests for Skewed Continuous Attributes
Jake Lee

TL;DR
This paper introduces Adaptive MSD-Splitting, a dynamic discretization method for decision trees that improves accuracy on skewed data while maintaining computational efficiency, and integrates it into Random Forests.
Contribution
It proposes AMSD, an adaptive discretization technique that adjusts to data skewness, and demonstrates its effectiveness within ensemble learning frameworks.
Findings
AMSD improves accuracy by 2-4% over standard MSD-Splitting.
RF-AMSD achieves state-of-the-art accuracy with reduced computational costs.
The method maintains near-linear time complexity despite adaptive adjustments.
Abstract
The discretization of continuous numerical attributes remains a persistent computational bottleneck in the induction of decision trees, particularly as dataset dimensions scale. Building upon the recently proposed MSD-Splitting technique -- which bins continuous data using the empirical mean and standard deviation to dramatically improve the efficiency and accuracy of the C4.5 algorithm -- we introduce Adaptive MSD-Splitting (AMSD). While standard MSD-Splitting is highly effective for approximately symmetric distributions, its rigid adherence to fixed one-standard-deviation cutoffs can lead to catastrophic information loss in highly skewed data, a common artifact in real-world biomedical and financial datasets. AMSD addresses this by dynamically adjusting the standard deviation multiplier based on feature skewness, narrowing intervals in dense regions to preserve discriminative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
