Simpson's Bias in NLP Training
Fei Yuan, Longtu Zhang, Huang Bojun, Yaobo Liang

TL;DR
This paper investigates Simpson's bias in NLP training, revealing that common sample-level loss functions can be inconsistent with true population metrics, leading to sub-optimal model performance.
Contribution
It provides a systematic theoretical and experimental analysis of Simpson's bias in NLP, highlighting its impact on model training and evaluation.
Findings
Popular loss functions may not align with true evaluation metrics.
Models optimized with certain losses can be substantially sub-optimal.
The paper connects Simpson's bias with classical statistical paradoxes.
Abstract
In most machine learning tasks, we evaluate a model on a given data population by measuring a population-level metric . Examples of such evaluation metric include precision/recall for (binary) recognition, the F1 score for multi-class classification, and the BLEU metric for language generation. On the other hand, the model is trained by optimizing a sample-level loss at each learning step , where is a subset of (a.k.a. the mini-batch). Popular choices of include cross-entropy loss, the Dice loss, and sentence-level BLEU scores. A fundamental assumption behind this paradigm is that the mean value of the sample-level loss , if averaged over all possible samples, should effectively represent the population-level metric of the task, such as, that . In this paper, we systematically investigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Imbalanced Data Classification Techniques · Explainable Artificial Intelligence (XAI)
