TL;DR
This paper investigates survivorship bias in MS MARCO, revealing that many unanswerable or discarded queries are valid, which distorts dataset distribution and evaluation scores, impacting model training and assessment.
Contribution
It identifies and analyzes survivorship bias in MS MARCO, demonstrating its effects on dataset quality, evaluation, and model performance, and provides insights for more complete annotation.
Findings
Discarded queries include many valid questions.
Survivorship bias distorts query distribution and evaluation scores.
Models trained on biased subsets perform worse on complete datasets.
Abstract
Survivorship bias is the tendency to concentrate on the positive outcomes of a selection process and overlook the results that generate negative outcomes. We observe that this bias could be present in the popular MS MARCO dataset, given that annotators could not find answers to 38--45% of the queries, leading to these queries being discarded in training and evaluation processes. Although we find that some discarded queries in MS MARCO are ill-defined or otherwise unanswerable, many are valid questions that could be answered had the collection been annotated more completely (around two thirds using modern ranking techniques). This survivability problem distorts the MS MARCO collection in several ways. We find that it affects the natural distribution of queries in terms of the type of information needed. When used for evaluation, we find that the bias likely yields a significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
