Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced   Language Model Pre-training

Oshin Agarwal; Heming Ge; Siamak Shakeri; Rami Al-Rfou

arXiv:2010.12688·cs.CL·March 16, 2021

Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Oshin Agarwal, Heming Ge, Siamak Shakeri, Rami Al-Rfou

PDF

1 Repo 2 Datasets

TL;DR

This paper proposes converting large-scale knowledge graphs into natural language text to enhance language model pre-training, improving factual accuracy and reducing toxicity in knowledge-intensive tasks.

Contribution

It introduces a novel method to verbalize the entire Wikidata knowledge graph, enabling seamless integration into language models for better performance.

Findings

01

Significant improvements in open domain QA accuracy

02

Enhanced performance on LAMA knowledge probe

03

Reduced toxicity in language model outputs

Abstract

Prior work on Data-To-Text Generation, the task of converting knowledge graph (KG) triples into natural text, focused on domain-specific benchmark datasets. In this paper, however, we verbalize the entire English Wikidata KG, and discuss the unique challenges associated with a broad, open-domain, large-scale verbalization. We further show that verbalizing a comprehensive, encyclopedic KG like Wikidata can be used to integrate structured KGs and natural language corpora. In contrast to the many architectures that have been developed to integrate these two sources, our approach converts the KG into natural text, allowing it to be seamlessly integrated into existing language models. It carries the further advantages of improved factual accuracy and reduced toxicity in the resulting language model. We evaluate this approach by augmenting the retrieval corpus in a retrieval language model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research-datasets/KELM-corpus
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Tanh Activation · Low-Rank Factorization-based Multi-Head Attention