Random Forests, Decision Trees, and Categorical Predictors: The "Absent Levels" Problem
Timothy C. Au

TL;DR
This paper identifies and explores the 'absent levels' problem in decision tree-based methods like random forests, which occurs when handling categorical predictors with unseen levels during prediction, affecting model bias and performance.
Contribution
It introduces the absent levels problem in decision trees, demonstrates its impact using real data and case studies, and proposes simple heuristics to mitigate it.
Findings
Absent levels can bias random forest predictions systematically.
Overlooking absent levels can significantly reduce model accuracy.
Simple heuristics can help mitigate the absent levels problem.
Abstract
One advantage of decision tree based methods like random forests is their ability to natively handle categorical predictors without having to first transform them (e.g., by using feature engineering techniques). However, in this paper, we show how this capability can lead to an inherent "absent levels" problem for decision tree based methods that has never been thoroughly discussed, and whose consequences have never been carefully explored. This problem occurs whenever there is an indeterminacy over how to handle an observation that has reached a categorical split which was determined when the observation in question's level was absent during training. Although these incidents may appear to be innocuous, by using Leo Breiman and Adele Cutler's random forests FORTRAN code and the randomForest R package (Liaw and Wiener, 2002) as motivating case studies, we examine how overlooking the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Data Mining Algorithms and Applications · Bayesian Modeling and Causal Inference
