Classification Trees for Imbalanced and Sparse Data: Surface-to-Volume   Regularization

Yichen Zhu; Cheng Li; David B. Dunson

arXiv:2004.12293·stat.ME·June 15, 2021·1 cites

Classification Trees for Imbalanced and Sparse Data: Surface-to-Volume Regularization

Yichen Zhu, Cheng Li, David B. Dunson

PDF

Open Access

TL;DR

This paper introduces SVR-Tree, a novel classification tree method that penalizes the Surface-to-Volume Ratio to improve performance on imbalanced and sparse data, with theoretical guarantees and empirical validation.

Contribution

The paper proposes a new SVR-Tree algorithm that regularizes decision boundaries for better generalization on limited data, with proven consistency and convergence.

Findings

01

SVR-Tree outperforms existing methods on real imbalanced datasets.

02

The approach achieves estimation consistency and favorable convergence rates.

03

Computationally efficient implementation demonstrated in experiments.

Abstract

Classification algorithms face difficulties when one or more classes have limited training data. We are particularly interested in classification trees, due to their interpretability and flexibility. When data are limited in one or more of the classes, the estimated decision boundaries are often irregularly shaped due to the limited sample size, leading to poor generalization error. We propose a novel approach that penalizes the Surface-to-Volume Ratio (SVR) of the decision set, obtaining a new class of SVR-Tree algorithms. We develop a simple and computationally efficient implementation while proving estimation consistency for SVR-Tree and rate of convergence for an idealized empirical risk minimizer of SVR-Tree. SVR-Tree is compared with multiple algorithms that are designed to deal with imbalance through real data applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Machine Learning and Data Classification · Anomaly Detection Techniques and Applications

MethodsInterpretability