Intentional Control of Type I Error over Unconscious Data Distortion: a   Neyman-Pearson Approach to Text Classification

Lucy Xia; Richard Zhao; Yanhui Wu; Xin Tong

arXiv:1802.02558·stat.ME·September 17, 2020

Intentional Control of Type I Error over Unconscious Data Distortion: a Neyman-Pearson Approach to Text Classification

Lucy Xia, Richard Zhao, Yanhui Wu, Xin Tong

PDF

1 Repo

TL;DR

This paper introduces a Neyman-Pearson classification approach to control Type I error in text classification tasks affected by data distortion, ensuring reliable relevance detection despite censorship and data asymmetry.

Contribution

It applies the Neyman-Pearson paradigm to text classification, providing a method that controls Type I error under data distortion, with theoretical guarantees and practical validation.

Findings

01

NP classifier controls Type I error under distortion

02

Method remains effective despite distributional differences

03

Applicable to various data distortion scenarios

Abstract

This paper addresses the challenges in classifying textual data obtained from open online platforms, which are vulnerable to distortion. Most existing classification methods minimize the overall classification error and may yield an undesirably large type I error (relevant textual messages are classified as irrelevant), particularly when available data exhibit an asymmetry between relevant and irrelevant information. Data distortion exacerbates this situation and often leads to fallacious prediction. To deal with inestimable data distortion, we propose the use of the Neyman-Pearson (NP) classification paradigm, which minimizes type II error under a user-specified type I error constraint. Theoretically, we show that the NP oracle is unaffected by data distortion when the class conditional distributions remain the same. Empirically, we study a case of classifying posts about worker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ZhaoRichard/nproc
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.