On Survivorship Bias in MS MARCO

Prashansa Gupta; Sean MacAvaney

arXiv:2204.12852·cs.IR·April 28, 2022

On Survivorship Bias in MS MARCO

Prashansa Gupta, Sean MacAvaney

PDF

1 Repo

TL;DR

This paper investigates survivorship bias in MS MARCO, revealing that many unanswerable or discarded queries are valid, which distorts dataset distribution and evaluation scores, impacting model training and assessment.

Contribution

It identifies and analyzes survivorship bias in MS MARCO, demonstrating its effects on dataset quality, evaluation, and model performance, and provides insights for more complete annotation.

Findings

01

Discarded queries include many valid questions.

02

Survivorship bias distorts query distribution and evaluation scores.

03

Models trained on biased subsets perform worse on complete datasets.

Abstract

Survivorship bias is the tendency to concentrate on the positive outcomes of a selection process and overlook the results that generate negative outcomes. We observe that this bias could be present in the popular MS MARCO dataset, given that annotators could not find answers to 38--45% of the queries, leading to these queries being discarded in training and evaluation processes. Although we find that some discarded queries in MS MARCO are ill-defined or otherwise unanswerable, many are valid questions that could be answered had the collection been annotated more completely (around two thirds using modern ranking techniques). This survivability problem distorts the MS MARCO collection in several ways. We find that it affects the natural distribution of queries in terms of the type of information needed. When used for evaluation, we find that the bias likely yields a significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seanmacavaney/sigir2022-survivor
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.