Regularisation of CART trees by summation of $p$-values
Nils Engler, Mathias Lindholm, Filip Lindskog, Taariq Nazar

TL;DR
This paper introduces a deterministic, p-value-based stopping rule for CART regression trees, improving efficiency and interpretability by avoiding cross-validation and enabling in-sample complexity control.
Contribution
It proposes a novel in-sample, p-value-based method for stopping CART tree growth, grounded in change point detection, applicable to high-dimensional data.
Findings
The method effectively detects signals with high probability given sufficient sample size.
It bounds the p-value of the entire tree, ensuring statistical validity.
Demonstrated on simulated and real data, showing practical utility.
Abstract
The standard procedure to decide on the complexity of a CART regression tree is to use cross-validation with the aim of obtaining a predictor that generalises well to unseen data. The randomness in the selection of folds implies that the selected CART regression tree is not a deterministic function of the data. Moreover, the cross-validation procedure may become time consuming and result in inefficient use of training data. We propose a simple deterministic in-sample method that can be used for stopping the growing of a CART regression tree based on node-wise statistical tests. This testing procedure is derived using a connection to change point detection, where the null hypothesis corresponds to no signal. The suggested -value based procedure allows us to consider covariate vectors of arbitrary dimension and allows us to bound the -value of an entire tree from above. Further, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPolynomial and algebraic computation
