Risk Bounds for Embedded Variable Selection in Classification Trees
Servane Gey (MAP5), Tristan Mary-Huard (AgroParisTech)

TL;DR
This paper introduces a new penalized criterion for variable selection in classification trees, providing risk bounds and comparing it to CART's pruning criterion, with practical calibration and simulation validation.
Contribution
It proposes a novel penalized criterion for variable selection in classification trees and establishes theoretical risk bounds for the resulting classifiers.
Findings
The new criterion is similar to CART's pruning under certain margin conditions.
Hold-out calibration effectively mimics the proposed penalized criterion.
Simulation studies confirm the practical utility of the calibration method.
Abstract
The problems of model and variable selections for classification trees are jointly considered. A penalized criterion is proposed which explicitly takes into account the number of variables, and a risk bound inequality is provided for the tree classifier minimizing this criterion. This penalized criterion is compared to the one used during the pruning step of the CART algorithm. It is shown that the two criteria are similar under some specific margin assumptions. In practice, the tuning parameter of the CART penalty has to be calibrated by hold-out. Simulation studies are performed which confirm that the hold-out procedure mimics the form of the proposed penalized criterion.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Gene expression and cancer classification · Data Mining Algorithms and Applications
