CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training
Patrick Huber, Armen Aghajanyan, Barlas O\u{g}uz, Dmytro Okhonko,, Wen-tau Yih, Sonal Gupta, Xilun Chen

TL;DR
This paper introduces CCQA, a large-scale multilingual question-answering dataset from Common Crawl, demonstrating its effectiveness for improving open-domain QA models through in-domain pre-training.
Contribution
The paper presents a novel, extensive QA dataset from Common Crawl, enabling large-scale in-domain pre-training for ODQA models, which was previously unavailable.
Findings
Pre-training on CCQA improves zero-shot QA performance.
CCQA enhances low-resource QA tasks.
Models pre-trained on CCQA outperform baselines in multiple benchmarks.
Abstract
With the rise of large-scale pre-trained language models, open-domain question-answering (ODQA) has become an important research topic in NLP. Based on the popular pre-training fine-tuning approach, we posit that an additional in-domain pre-training stage using a large-scale, natural, and diverse question-answering (QA) dataset can be beneficial for ODQA. Consequently, we propose a novel QA dataset based on the Common Crawl project in this paper. Using the readily available schema.org annotation, we extract around 130 million multilingual question-answer pairs, including about 60 million English data-points. With this previously unseen number of natural QA pairs, we pre-train popular language models to show the potential of large-scale in-domain pre-training for the task of question-answering. In our experiments, we find that pre-training question-answering models on our Common Crawl…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
