Tradeoffs in Resampling and Filtering for Imbalanced Classification
Ryan Muther, David Smith

TL;DR
This paper investigates the tradeoffs in resampling and filtering methods for imbalanced token classification, highlighting how data selection impacts model performance and how base rates influence these effects.
Contribution
It provides an empirical analysis of how different data selection strategies affect performance in imbalanced NLP tasks, emphasizing the importance of filtering test data.
Findings
Filtering test data is as crucial as training data selection in highly imbalanced scenarios.
The base rate of the positive class affects the magnitude of performance tradeoffs.
Different data selection methods trade effectiveness for efficiency.
Abstract
Imbalanced classification problems are extremely common in natural language processing and are solved using a variety of resampling and filtering techniques, which often involve making decisions on how to select training data or decide which test examples should be labeled by the model. We examine the tradeoffs in model performance involved in choices of training sample and filter training and test data in heavily imbalanced token classification task and examine the relationship between the magnitude of these tradeoffs and the base rate of the phenomenon of interest. In experiments on sequence tagging to detect rare phenomena in English and Arabic texts, we find that different methods of selecting training data bring tradeoffs in effectiveness and efficiency. We also see that in highly imbalanced cases, filtering test data using first-pass retrieval models is as important for model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Imbalanced Data Classification Techniques · Text and Document Classification Technologies
MethodsTest · Balanced Selection
