Classifying extremely imbalanced data sets
Markward Britsch (1), Nikolai Gagunashvili (2), Michael Schmelling (1), ((1) Max-Planck-Institut f\"ur Kernphysik, (2) University of Akureyri)

TL;DR
This paper evaluates a multivariate rule growing algorithm with bagging and instance weighting on highly imbalanced particle physics data, proposing methods to optimize training set size and compare classifiers effectively.
Contribution
It extends previous work by applying the technique to more imbalanced datasets and introduces strategies to improve classifier performance and manage large training sets.
Findings
Classifier quality depends on background instances used for training
Methods to exploit background sample size improve results
Strategies to reduce training set size without losing accuracy
Abstract
Imbalanced data sets containing much more background than signal instances are very common in particle physics, and will also be characteristic for the upcoming analyses of LHC data. Following up the work presented at ACAT 2008, we use the multivariate technique presented there (a rule growing algorithm with the meta-methods bagging and instance weighting) on much more imbalanced data sets, especially a selection of D0 decays without the use of particle identification. It turns out that the quality of the result strongly depends on the number of background instances used for training. We discuss methods to exploit this in order to improve the results significantly, and how to handle and reduce the size of large training sets without loss of result quality in general. We will also comment on how to take into account statistical fluctuation in receiver operation characteristic curves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
