Classifying extremely imbalanced data sets

Markward Britsch (1); Nikolai Gagunashvili (2); Michael Schmelling (1); ((1) Max-Planck-Institut f\"ur Kernphysik; (2) University of Akureyri)

arXiv:1011.6224·physics.data-an·August 11, 2011

Classifying extremely imbalanced data sets

Markward Britsch (1), Nikolai Gagunashvili (2), Michael Schmelling (1), ((1) Max-Planck-Institut f\"ur Kernphysik, (2) University of Akureyri)

PDF

TL;DR

This paper evaluates a multivariate rule growing algorithm with bagging and instance weighting on highly imbalanced particle physics data, proposing methods to optimize training set size and compare classifiers effectively.

Contribution

It extends previous work by applying the technique to more imbalanced datasets and introduces strategies to improve classifier performance and manage large training sets.

Findings

01

Classifier quality depends on background instances used for training

02

Methods to exploit background sample size improve results

03

Strategies to reduce training set size without losing accuracy

Abstract

Imbalanced data sets containing much more background than signal instances are very common in particle physics, and will also be characteristic for the upcoming analyses of LHC data. Following up the work presented at ACAT 2008, we use the multivariate technique presented there (a rule growing algorithm with the meta-methods bagging and instance weighting) on much more imbalanced data sets, especially a selection of D0 decays without the use of particle identification. It turns out that the quality of the result strongly depends on the number of background instances used for training. We discuss methods to exploit this in order to improve the results significantly, and how to handle and reduce the size of large training sets without loss of result quality in general. We will also comment on how to take into account statistical fluctuation in receiver operation characteristic curves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.