Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction
F. Provost, G. M. Weiss

TL;DR
This paper investigates how class distribution in training data affects the performance of classification trees, especially when training data is limited and costs are high, proposing a sampling algorithm to optimize class selection.
Contribution
It analyzes the impact of class distribution on tree induction performance and introduces a budget-sensitive sampling method for selecting training data to improve classifier accuracy.
Findings
Balanced class distribution improves ROC AUC performance.
Natural class distribution performs well with error rate evaluation.
Sampling algorithm yields classifiers with near-optimal performance.
Abstract
For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with learning from them. In such circumstances, one question of practical importance is: if only n training examples can be selected, in what proportion should the classes be represented? In this article we help to answer this question by analyzing, for a fixed training-set size, the relationship between the class distribution of the training data and the performance of classification trees induced from these data. We study twenty-six data sets and, for each, determine the best class distribution for learning. The naturally occurring class distribution is shown to generally perform well when classifier performance is evaluated using undifferentiated error…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
