Who did What: A Large-Scale Person-Centered Cloze Dataset
Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, David, McAllester

TL;DR
This paper introduces the 'Who-did-What' dataset, a large-scale, challenging reading comprehension dataset based on newswire articles, designed to advance person-centered NLP tasks and benchmark system performance.
Contribution
The paper presents a novel dataset constructed from independent news articles, avoiding summaries and anonymization, with filtering to ensure human solvability, providing a new benchmark for NLP research.
Findings
Standard systems perform significantly below human levels.
84% of questions are solvable by humans.
The dataset offers a challenging benchmark for future models.
Abstract
We have constructed a new "Who-did-What" dataset of over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus. The WDW dataset has a variety of novel features. First, in contrast with the CNN and Daily Mail datasets (Hermann et al., 2015) we avoid using article summaries for question formation. Instead, each problem is formed from two independent articles --- an article given as the passage to be read and a separate article on the same events used to form the question. Second, we avoid anonymization --- each choice is a person named entity. Third, the problems have been filtered to remove a fraction that are easily solved by simple baselines, while remaining 84% solvable by humans. We report performance benchmarks of standard systems and propose the WDW dataset as a challenge task for the community.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
