FPR Estimation for Fraud Detection in the Presence of Class-Conditional Label Noise
Justin Tittelfitz

TL;DR
This paper addresses the challenge of accurately estimating false and true positive rates in binary classification models when validation data labels are noisy, especially in fraud detection scenarios with asymmetric label noise.
Contribution
It highlights the limitations of existing cleaning-based methods for FPR/TPR estimation and emphasizes the need for approaches that reduce correlation between cleaning errors and model scores.
Findings
Using the model to clean its own validation data leads to underestimated FPR/TPR.
Existing methods focus on total error minimization but fail to ensure accurate FPR/TPR estimates.
De-correlating cleaning errors from model scores is crucial for reliable FPR/TPR estimation.
Abstract
We consider the problem of estimating the false-/ true-positive-rate (FPR/TPR) for a binary classification model when there are incorrect labels (label noise) in the validation set. Our motivating application is fraud prevention where accurate estimates of FPR are critical to preserving the experience for good customers, and where label noise is highly asymmetric. Existing methods seek to minimize the total error in the cleaning process - to avoid cleaning examples that are not noise, and to ensure cleaning of examples that are. This is an important measure of accuracy but insufficient to guarantee good estimates of the true FPR or TPR for a model, and we show that using the model to directly clean its own validation data leads to underestimates even if total error is low. This indicates a need for researchers to pursue methods that not only reduce total error but also seek to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Machine Learning and Data Classification · Advanced Statistical Process Monitoring
