Accurate, Data-Efficient Learning from Noisy, Choice-Based Labels for   Inherent Risk Scoring

W. Ronny Huang; Miguel A. Perez

arXiv:1811.10791·cs.LG·December 2, 2018·1 cites

Accurate, Data-Efficient Learning from Noisy, Choice-Based Labels for Inherent Risk Scoring

W. Ronny Huang, Miguel A. Perez

PDF

Open Access

TL;DR

This paper introduces a novel choice-based, synthetic data labeling approach for inherent risk scoring, improving data efficiency and consistency in anti-money laundering applications.

Contribution

It proposes a new paradigm combining choice-based labeling with synthetic data generation to address data scarcity and inconsistency in risk scoring models.

Findings

01

Achieved 89% accuracy on test data

02

Attained 93% ROC-AUC

03

Demonstrated effectiveness with small synthetic dataset

Abstract

Inherent risk scoring is an important function in anti-money laundering, used for determining the riskiness of an individual during onboarding $before$ fraudulent transactions occur. It is, however, often fraught with two challenges: (1) inconsistent notions of what constitutes as high or low risk by experts and (2) the lack of labeled data. This paper explores a new paradigm of data labeling and data collection to tackle these issues. The data labeling is choice-based; the expert does not provide an absolute risk score but merely chooses the most/least risky example out of a small choice set, which reduces inconsistency because experts make only relative judgments of risk. The data collection is synthetic; examples are crafted using optimal experimental design methods, obviating the need for real data which is often difficult to obtain due to regulatory concerns. We present…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Crime, Illicit Activities, and Governance · Data Stream Mining Techniques