MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj; Daniel Campos; Nick Craswell; Li Deng; Jianfeng Gao,; Xiaodong Liu; Rangan Majumder; Andrew McNamara; Bhaskar Mitra; Tri Nguyen,; Mir Rosenberg; Xia Song; Alina Stoica; Saurabh Tiwary; Tong Wang

arXiv:1611.09268·cs.CL·November 1, 2018

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao,, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen,, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang

PDF

5 Repos 10 Models 5 Datasets

TL;DR

MS MARCO is a large-scale, real-world dataset derived from Bing search queries, designed to benchmark machine reading comprehension and question-answering models across multiple tasks.

Contribution

The paper introduces MS MARCO, a novel large-scale dataset with real user questions and web passages, enabling diverse machine reading comprehension tasks.

Findings

01

Dataset contains over 1 million questions and 8.8 million passages.

02

Supports three distinct tasks: answerability prediction, answer generation, passage ranking.

03

Facilitates benchmarking of MRC and QA models on real-world data.

Abstract

We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.