WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus
Hongjing Qian, Yutao Zhu, Zhicheng Dou, Haoqi Gu, Xinyu Zhang, Zheng, Liu, Ruofei Lai, Zhao Cao, Jian-Yun Nie, Ji-Rong Wen

TL;DR
This paper introduces WebBrain, a new NLP task for generating factual short articles with references from web evidence, supported by a large-scale dataset and a novel framework that improves factual accuracy.
Contribution
The paper presents WebBrain, a large-scale dataset and a new framework ReGen for generating factually correct articles grounded on web evidence, advancing factual NLP generation.
Findings
ReGen outperforms baselines in automatic evaluations.
WebBrain-Raw dataset is ten times larger than previous datasets.
Enhanced evidence retrieval improves factual correctness.
Abstract
In this paper, we introduce a new NLP task -- generating short factual articles with references for queries by mining supporting evidence from the Web. In this task, called WebBrain, the ultimate goal is to generate a fluent, informative, and factually-correct short article (e.g., a Wikipedia article) for a factual query unseen in Wikipedia. To enable experiments on WebBrain, we construct a large-scale dataset WebBrain-Raw by extracting English Wikipedia articles and their crawlable Wikipedia references. WebBrain-Raw is ten times larger than the previous biggest peer dataset, which can greatly benefit the research community. From WebBrain-Raw, we construct two task-specific datasets: WebBrain-R and WebBrain-G, which are used to train in-domain retriever and generator, respectively. Besides, we empirically analyze the performances of the current state-of-the-art NLP techniques on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Wikis in Education and Collaboration
