The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus
Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Dmytro, Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas, O\u{g}uz, Edouard Grave, Wen-tau Yih, Sebastian Riedel

TL;DR
This paper explores knowledge-intensive NLP in an open web environment using a large web snapshot called Sphere, demonstrating that retrieval from Sphere can outperform Wikipedia-based models on several tasks despite challenges of scale and quality.
Contribution
It introduces a new evaluation setup using Sphere, a web-scale knowledge source, and shows how it can enhance NLP tasks beyond traditional Wikipedia-based approaches.
Findings
Sphere enables state-of-the-art performance on multiple tasks.
Dense indexing outperforms sparse BM25 on Wikipedia, but not on Sphere.
Shared infrastructure promotes further research in open-domain NLP.
Abstract
In order to address increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web-scale knowledge, lack of structure, inconsistent quality and noise. To this end, we propose a new setup for evaluating existing knowledge intensive tasks in which we generalize the background corpus to a universal web snapshot. We investigate a slate of NLP tasks which rely on knowledge - either factual or common sense, and ask systems to use a subset of CCNet - the Sphere corpus - as a knowledge source. In contrast to Wikipedia, otherwise a common background corpus in KI-NLP, Sphere is orders of magnitude larger and better reflects the full diversity of knowledge on the web. Despite potential gaps in coverage, challenges of scale, lack of structure and lower quality, we find that retrieval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Wikis in Education and Collaboration
MethodsCriss-Cross Network
