Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data
Dian Yu, Kai Sun, Dong Yu, Claire Cardie

TL;DR
This paper introduces a large-scale multi-subject question-answering dataset and a self-teaching method that leverages weakly-labeled data to significantly improve machine reading comprehension performance.
Contribution
It presents ExamQA, a new multi-subject QA dataset, and a self-teaching paradigm that effectively utilizes noisy web snippets to enhance MRC models.
Findings
+5.1% accuracy on C^3 dataset
+3.8% exact match on CMRC 2018
Demonstrates large-scale QA data benefits MRC tasks
Abstract
In spite of much recent research in the area, it is still unclear whether subject-area question-answering data is useful for machine reading comprehension (MRC) tasks. In this paper, we investigate this question. We collect a large-scale multi-subject multiple-choice question-answering dataset, ExamQA, and use incomplete and noisy snippets returned by a web search engine as the relevant context for each question-answering instance to convert it into a weakly-labeled MRC instance. We then propose a self-teaching paradigm to better use the generated weakly-labeled MRC instances to improve a target MRC task. Experimental results show that we can obtain +5.1% in accuracy on a multiple-choice MRC dataset, C^3, and +3.8% in exact match on an extractive MRC dataset, CMRC 2018 over state-of-the-art MRC baselines, demonstrating the effectiveness of our framework and the usefulness of large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
