Active learning for imbalanced data under cold start
Ricardo Barata, Miguel Leite, Ricardo Pacheco, Marco O. P. Sampaio,, Jo\~ao Tiago Ascens\~ao, Pedro Bizarro

TL;DR
This paper introduces an active learning approach tailored for highly imbalanced datasets in cold-start streaming scenarios, significantly reducing labeling effort while achieving high model performance.
Contribution
The paper proposes a novel Outlier-based Discriminative AL method and a 3-stage labeling policy sequence for imbalanced, cold-start streaming data.
Findings
Achieves up to 80% improvement over random sampling.
Reaches high performance with only 2-10% of labels.
Outperforms standard AL policies without warm-up.
Abstract
Modern systems that rely on Machine Learning (ML) for predictive modelling, may suffer from the cold-start problem: supervised models work well but, initially, there are no labels, which are costly or slow to obtain. This problem is even worse in imbalanced data scenarios, where labels of the positive class take longer to accumulate. We propose an Active Learning (AL) system for datasets with orders of magnitude of class imbalance, in a cold start streaming scenario. We present a computationally efficient Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage sequence of AL labeling policies where ODAL is used as warm-up. Then, we perform empirical studies in four real world datasets, with various magnitudes of class imbalance. The results show that our method can more quickly reach a high performance model than standard AL policies without ODAL warm-up. Its observed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Machine Learning and Algorithms · Data Stream Mining Techniques
