Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training
Oshin Agarwal, Heming Ge, Siamak Shakeri, Rami Al-Rfou

TL;DR
This paper proposes converting large-scale knowledge graphs into natural language text to enhance language model pre-training, improving factual accuracy and reducing toxicity in knowledge-intensive tasks.
Contribution
It introduces a novel method to verbalize the entire Wikidata knowledge graph, enabling seamless integration into language models for better performance.
Findings
Significant improvements in open domain QA accuracy
Enhanced performance on LAMA knowledge probe
Reduced toxicity in language model outputs
Abstract
Prior work on Data-To-Text Generation, the task of converting knowledge graph (KG) triples into natural text, focused on domain-specific benchmark datasets. In this paper, however, we verbalize the entire English Wikidata KG, and discuss the unique challenges associated with a broad, open-domain, large-scale verbalization. We further show that verbalizing a comprehensive, encyclopedic KG like Wikidata can be used to integrate structured KGs and natural language corpora. In contrast to the many architectures that have been developed to integrate these two sources, our approach converts the KG into natural text, allowing it to be seamlessly integrated into existing language models. It carries the further advantages of improved factual accuracy and reduced toxicity in the resulting language model. We evaluate this approach by augmenting the retrieval corpus in a retrieval language model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Tanh Activation · Low-Rank Factorization-based Multi-Head Attention
