Focused PU learning from imbalanced data
Elias Zavitsanos, Georgios Paliouras

TL;DR
This paper introduces a focused PU learning method tailored for highly imbalanced datasets, improving detection of positive instances in real-world applications like fraud detection and disease gene identification.
Contribution
It presents a novel focused empirical risk estimator for PU learning that effectively handles class imbalance and hard-to-detect positives.
Findings
Achieves state-of-the-art performance on imbalanced datasets under SCAR and SAR mechanisms.
Demonstrates effectiveness in real-world financial misstatement detection.
Outperforms existing PU learning methods in challenging scenarios.
Abstract
We propose a new method of learning from positive and unlabeled (PU) examples in highly imbalanced datasets. Many real-world problems, such as disease gene identification, targeted marketing, fraud detection, and recommender systems, are hard to address with machine learning methods, due to limited labeled data. Often, training data comprises positive and unlabeled instances, the latter typically being dominated by negative, but including also several positive instances. While PU learning is well-studied, few methods address imbalanced settings or hard-to-detect positive examples that resemble negative ones. Our approach uses a focused empirical risk estimator, incorporating both positive and unlabeled examples to train binary classifiers. Empirical evaluations demonstrate state-of-the-art performance on imbalanced datasets under two labeling mechanisms - selecting positives completely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
