The Interaction of Entropy-Based Discretization and Sample Size: An Empirical Study
Casey Bennett

TL;DR
This empirical study investigates how sample size influences entropy-based discretization (CAIM) and its bias in data mining performance, revealing that smaller samples and certain attribute types can introduce optimistic bias.
Contribution
It provides the first empirical evidence that discretizing within cross-validation folds can bias performance metrics, especially with small sample sizes and high-dimensional data.
Findings
Discretization within cross-validation folds causes optimistic bias.
Smaller sample sizes increase bias in discretized models.
Attribute types and quantity influence the bias magnitude.
Abstract
An empirical investigation of the interaction of sample size and discretization - in this case the entropy-based method CAIM (Class-Attribute Interdependence Maximization) - was undertaken to evaluate the impact and potential bias introduced into data mining performance metrics due to variation in sample size as it impacts the discretization process. Of particular interest was the effect of discretizing within cross-validation folds averse to outside discretization folds. Previous publications have suggested that discretizing externally can bias performance results; however, a thorough review of the literature found no empirical evidence to support such an assertion. This investigation involved construction of over 117,000 models on seven distinct datasets from the UCI (University of California-Irvine) Machine Learning Library and multiple modeling methods across a variety of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Imbalanced Data Classification Techniques · Statistical and Computational Modeling
