TL;DR
This paper introduces a Neyman-Pearson classification approach to control Type I error in text classification tasks affected by data distortion, ensuring reliable relevance detection despite censorship and data asymmetry.
Contribution
It applies the Neyman-Pearson paradigm to text classification, providing a method that controls Type I error under data distortion, with theoretical guarantees and practical validation.
Findings
NP classifier controls Type I error under distortion
Method remains effective despite distributional differences
Applicable to various data distortion scenarios
Abstract
This paper addresses the challenges in classifying textual data obtained from open online platforms, which are vulnerable to distortion. Most existing classification methods minimize the overall classification error and may yield an undesirably large type I error (relevant textual messages are classified as irrelevant), particularly when available data exhibit an asymmetry between relevant and irrelevant information. Data distortion exacerbates this situation and often leads to fallacious prediction. To deal with inestimable data distortion, we propose the use of the Neyman-Pearson (NP) classification paradigm, which minimizes type II error under a user-specified type I error constraint. Theoretically, we show that the NP oracle is unaffected by data distortion when the class conditional distributions remain the same. Empirically, we study a case of classifying posts about worker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
