Assessing The Factual Accuracy of Generated Text

Ben Goodrich; Vinay Rao; Mohammad Saleh; Peter J Liu

arXiv:1905.13322·cs.CL·May 27, 2021

Assessing The Factual Accuracy of Generated Text

Ben Goodrich, Vinay Rao, Mohammad Saleh, Peter J Liu

PDF

2 Repos

TL;DR

This paper introduces a model-based metric for assessing the factual accuracy of generated text, supported by a new dataset and models that outperform traditional scoring methods like ROUGE and BLEU.

Contribution

It presents a novel factual accuracy metric, a large-scale dataset for training relation classifiers, and end-to-end fact extraction models for improved evaluation.

Findings

01

Model-based metric outperforms ROUGE and BLEU in factual accuracy assessment.

02

New dataset enables training of relation classifiers and fact extraction models.

03

Human evaluation confirms the effectiveness of the proposed metric.

Abstract

We propose a model-based metric to estimate the factual accuracy of generated text that is complementary to typical scoring schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We introduce and release a new large-scale dataset based on Wikipedia and Wikidata to train relation classifiers and end-to-end fact extraction models. The end-to-end models are shown to be able to extract complete sets of facts from datasets with full pages of text. We then analyse multiple models that estimate factual accuracy on a Wikipedia text summarization task, and show their efficacy compared to ROUGE and other model-free variants by conducting a human evaluation study.

Figures4

Click any figure to enlarge with its caption.

Tables9

Table 1. Table 1 . Example of factual inaccuracy noted in a summarization model (Liu et al . , 2018 ) . In this example, the summarization model uses the subject (Peter Duryea)’s father, Dan Duryea’s birthdate.

Target	Peter Duryea (July 14, 1939 – March 24, 2013) was an American actor. He is best known for appearing in a pilot episode of Star Trek: The Original Series, “The Cage” (1964), most of which was reused in “The Menagerie” (1966), as Lieutenant Tyler. His father, Dan Duryea (1907 – 1968), was also an actor.
Output	Peter Duryea (April 23, 1907 – March 24, 2013) was an American actor. He is best known for his role as Lt. Jose Tyler in the original Star Trek pilot, “The Cage”

Table 2. Table 2 . Factual accuracy predicted by different metrics on synthesized samples. Binary Classifier is described in Sec 4.3 , Classifier in Sec 4.1 and E2E is the end-to-end model described in Sec 4.2 . E2E-Reduced* is a model where sentences where no entities are detected are filtered out from the input text. The Expected Accuracy** is calculated as the ratio of number of corrupted facts to the total number of facts in the article.

Model	Accuracy
ROUGE-1	97.08
ROUGE-2	94.06
ROUGE-L	96.02
OpenIE	87.26
Binary Relation Classifier	46.75
Relation Classifier	59.30
E2E	65.44
E2E-Reduced*	57.10
Expected Accuracy**	30.97

Table 3. Table 3 . Performance (precision(P), recall(R), F1) of models on our proposed dataset grouped by classifiers and then end-to-end models. Binary Classifier is described in Sec 4.3 , Classifier in Sec 4.1 and E2E is the end-to-end model described in Sec 4.2 . The Binary Classifier* only considers the existence of a relation, and might not be directly comparable to the other models’ performance. E2E-Reduced** is a model where sentences where no entities are detected are filtered out from the input text. The best model is marked in bold. We consider precision(P) as the measure that matches best with the definition of f a c t a c c 𝑓 𝑎 𝑐 subscript 𝑡 𝑎 𝑐 𝑐 fact_{acc}

Model	P	R	F1
Binary Classifier*	59.60	75.13	66.47
Relation Classifier	63.49	68.64	65.96
E2E	71.67	56.21	63.01
E2E-Reduced**	72.16	61.03	66.13

Table 4. Table 4 . Precision (P), Recall (R) and F1 measure of the relation classifier (Section 4.1 ) on our test sets on ten most frequent relations.

Relation	P	R	F1
No relation	0.9830	0.9817	0.9824
Country of citizenship	0.6446	0.9394	0.7646
Date of birth	0.9330	0.9850	0.9582
Country	0.6049	0.9484	0.7386
Located in territory	0.6260	0.8118	0.7069
Instance of	0.5097	0.7015	0.5904
Place of birth	0.6430	0.7436	0.6897
Member of sports team	0.5179	0.9248	0.6640
Occupation	0.5934	0.7770	0.6729
Date of death	0.9163	0.9875	0.9506

Table 5. Table 5 . Precision (P), Recall (R) and F1 measure of our end-to-end model (Section 4.2 ) on our test sets on ten most frequent relations.

Relation	P	R	F1
Country of citizenship	0.8247	0.8359	0.8302
Instance of	0.7212	0.6676	0.6934
Date of birth	0.9342	0.9798	0.9564
Country	0.8387	0.8267	0.8327
Cast member	0.5889	0.4910	0.5355
Place of birth	0.7012	0.7348	0.7176
Located in the administrative territorial entity	0.7293	0.7700	0.7491
Member of sports team	0.7045	0.7027	0.7036
Occupation	0.5911	0.5774	0.5842
Educated at	0.5432	0.7278	0.6221

Table 6. Table 6 . Percentage of true facts that were inaccurately labelled wrong by the distant supervisor. The End-to-end model is the best model from Section 4.2 and Classifier is the best from 4.1 (Transformer-Sigmoid). The End-to-end model (in bold) predicts facts that are likelier to be true.

Model	% True-positives
End-to-end	77.8
Relation Classifier	46.6

Table 7. Table 7 . Spearman correlation of different metrics with human evaluation of factual accuracy on the ‘Actors’ subset of summaries. ROUGE and OpenIE are described in Sec 5 , and the model-based f a c t a c c 𝑓 𝑎 𝑐 subscript 𝑡 𝑎 𝑐 𝑐 fact_{acc} metrics are described in Sec 4 . The best metric is shown in bold.

Metric	Correlation with human scores
ROUGE-1	0.583
ROUGE-2	0.639
ROUGE-L	0.634
OpenIE	0.258
$f a c t_{a c c}$ -Binary Classifier	0.596
$f a c t_{a c c}$ -Relation Classifier	0.523
$f a c t_{a c c}$ -E2E	0.645
$f a c t_{a c c}$ -E2E-Reduced	0.668

Table 8. Table 8 . Spearman correlation of different metrics with human evaluation of factual accuracy on a random subset of summaries. ROUGE and OpenIE are described in Sec 5 , and the model-based f a c t a c c 𝑓 𝑎 𝑐 subscript 𝑡 𝑎 𝑐 𝑐 fact_{acc} metrics are described in Sec 4 . The best metric is shown in bold.

Metric	Correlation with human scores
ROUGE-1	0.384
ROUGE-2	0.435
ROUGE-L	0.339
OpenIE	0.128
$f a c t_{a c c}$ -Binary Classifier	0.200
$f a c t_{a c c}$ -Relation Classifier	0.250
$f a c t_{a c c}$ -E2E	0.314
$f a c t_{a c c}$ -E2E-Reduced	0.453

Table 9. Table 9 . Comparison of fact tuples extracted from this example, using OpenIE, our end-to-end model (Section 4.2 ), and our classifier (Section 4.1 ). A triplet consists of ( s u b j e c t , r e l a t i o n , o b j e c t ) 𝑠 𝑢 𝑏 𝑗 𝑒 𝑐 𝑡 𝑟 𝑒 𝑙 𝑎 𝑡 𝑖 𝑜 𝑛 𝑜 𝑏 𝑗 𝑒 𝑐 𝑡 (subject,relation,object) .

Input	Christopher Simon (born 5 June 1963) is an Australian actor and producer. Born in Sydney, Australia. He produced the film Miss You Already directed by Catherine Hardwicke. Simon is also a producer of such films as The Sweeney (2012 film) directed by Nick Love, Pusher, I, Anna, Still Life, Me and Me Dad, Boogie Woogie, The Proposition, Beyond the Ocean, The Trouble with Men and Women. He also produced short films by Joe Wright such as The End and Nick Love’s Love Story. Simon’s various television acting roles include Eddie in The Long Firm, Pedro in Gimme Gimme Gimme, Michael Hassan in The Bill, Lee Andersen in Casualty, Abdel in Lovejoy Samir in Ultimate Force, Da Souza in Lynda La Plante’s Supply and Demand, Nathan Morgan in Wire In The Blood and he appeared in Lenny Henry in Pieces. Film acting roles include Room To Rent, The Delivery and O Jerusalem. Simon has acted in such plays as 12 Angry Men and Taking Sides both directed by Harold Pinter in London’s west end, The Kitchen directed by Stephen Daldry at the Royal Court, the Amnesty award winning one man show When The Bulbull Stopped Singing for which he was nominated for the Acting Excellence Award (Best Actor) at the Edinburgh Festival Fringe, which premiered at the Traverse theatre and toured to Iran, New York and Jordan. Other theatre roles include Welcome to Ramallah, which toured York and London, at the Arcola and the Theatre Royal York, The Present at the Royal Court and the Bush, and Poor Superman at the Hampstead and the Traverse.
Targets	(Christopher Simon, date of birth, June 5 1963), (Christopher Simon, country of citizenship, Australian), (Christopher Simon, place of birth, Sydney)
OpenIE	(Abdel, is in, Ultimate Force), (Casualty, Abdel in, Ultimate Force), (Nathan Morgan, is In, Blood), (Lee Andersen, is in, Casualty), (Da Souza, is in, Lynda La Plante ’s Supply), (Simon ’s various television acting roles, include, Eddie), (Simon, is producer of, films as Sweeney directed by Nick Love), (Simon, is also producer of, such films as Sweeney), (Simon ’s television roles, include, Eddie in Firm), (Simon, is producer of, such films), (Simon ’s various television roles, include, Eddie), (Eddie, is in, Long Firm), (Simon, is producer of, such films as Sweeney), (Simon, is producer of, such films as Sweeney directed by Nick Love), (Michael Hassan, is in, Bill), (Bill, Andersen in, Casualty), (You, Already directed by, Catherine Hardwicke), (Simon, is producer of, films), (Simon, has, various television acting roles), (Simon ’s television acting roles, include, Eddie), (Abdel, is in, Lovejoy Samir), (Simon ’s television roles, include, Eddie in Long Firm), (Simon ’s television acting roles, include, Eddie in Firm), (Simon, is producer of, films as Sweeney directed), (Simon ’s various television acting roles, include, Eddie in Firm), (Simon, is also producer of, films as Sweeney), (Simon ’s various television acting roles, include, Eddie in Long Firm), (Simon, is, producer), (Rent, To Room is, Delivery), (Simon ’s television roles, include, Eddie), (Simon, is also producer of, films as Sweeney directed), (Lynda La Plante, in, Supply), (Pedro, is in, Gim), …
Seq2Seq	(Christopher Simon, date of birth, June 5 1963), (Christopher Simon, country of citizenship, Australian), (Christopher Simon, place of birth, Sydney), (Christopher Simon, occupation, Actor)
Classifier	(Christopher Simon, date of birth, June 5 1963), (Christopher Simon, country of citizenship, Australian)

Equations15

F_{T}

F_{T}

F_{G}

F_{T^{'}}

F_{T^{'}}

F_{G^{'}}

f a c t_{a cc}

f a c t_{a cc}

w_{1 : n}

w_{1 : n}

h_{1 : n}

h_{i}

p (r^{k})

r e l = {1 : e_{i} and e_{j} are related 0 : otherwise

r e l = {1 : e_{i} and e_{j} are related 0 : otherwise

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Assessing The Factual Accuracy of Generated Text

Ben Goodrich

[email protected]

,

Vinay Rao

[email protected]

,

Peter J. Liu

[email protected]

and

Mohammad Saleh

[email protected]

Google Brain

(2019)

Abstract.

We propose a model-based metric to estimate the factual accuracy of generated text that is complementary to typical scoring schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We introduce and release a new large-scale dataset based on Wikipedia and Wikidata to train relation classifiers and end-to-end fact extraction models. The end-to-end models are shown to be able to extract complete sets of facts from datasets with full pages of text. We then analyse multiple models that estimate factual accuracy on a Wikipedia text summarization task, and show their efficacy compared to ROUGE and other model-free variants by conducting a human evaluation study.

datasets, neural networks, fact extraction, deep learning, metric, end-to-end

††journalyear: 2019††copyright: rightsretained††conference: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 4–8, 2019; Anchorage, AK, USA††booktitle: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’19), August 4–8, 2019, Anchorage, AK, USA††doi: 10.1145/3292500.3330955††isbn: 978-1-4503-6201-6/19/08††submissionid: rt0615p

1. Introduction

Recently, there has been wide empirical success in text summarization (Rush et al., 2015; Nallapati et al., 2016; Liu et al., 2018), machine translation (Bahdanau et al., 2015; Wu et al., 2016; Vaswani et al., 2017), dialogue response generation (Li et al., 2017; Serban et al., 2017b; Serban et al., 2017a), and other text generation tasks. For evaluation, these models generally rely on metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) (Lin, 2004), BLEU (Bilingual Evaluation Understudy) (Papineni et al., 2002) and perplexity (Brown et al., 1992) that measure locally constrained n-gram overlap. In this paper, we propose an automatic metric for evaluating the factual accuracy of generated text.

A fact f is defined to be a relation tuple (subject, relation, object), where subject has a binary relation to object and can be assumed to have been inferred from text or a knowledge base, e.g. Barack Hussein Obama II (born August 4, 1961) is an American politician who served as the 44th President of the United States from January 20, 2009 to January 20, 2017 implies a set of facts such as (Barack Obama, president of, United States), (Barack Obama, born on, August 4 1961).

In this paper, we limit our scope to the task of evaluating text summarization. To evaluate a text summarization model, we compare the ground-truth summary text, T and the generated summary, G. Let $f_{t},f_{g}\in F$ , and $F_{T},F_{G}\subset F$ where $F$ is a set of relation tuples.

[TABLE]

The models used in the metric we propose do not make use of world knowledge (e.g. knowledge base) during inference, and to account for that we filter $F_{T}$ and $F_{G}$ by only considering claims made in $G$ that can either be verified or refuted by statements in $T$ . Concretely, if $f_{t}=(subj_{t},rel_{t},obj_{t})\in F_{T}$ and $f_{g}=(subj_{g},rel_{g},obj_{g})\in F_{G}$

[TABLE]

We can then define factual accuracy $fact_{acc}$ as the $precision$ between $F_{T^{\prime}}$ and $F_{G^{\prime}}$ .

[TABLE]

For example, consider ground-truth summary T: Brad Pitt was born in 1963 and generated summary G: Brad Pitt was born in 1961. Then, $F_{T}$ = {(Brad Pitt, born-in, 1963)}, $F_{G}$ = {(Brad Pitt, born-in, 1961)}. The metric $fact_{acc}$ = 0 indicates there is no factual consistency between the two summaries, whereas another metric like ROUGE-1 (1-gram overlap) measures 0.83. A real example is highlighted in Table 1 where the summarization model commits such a mistake. It is important to be able to measure these mistakes accurately to aid in training factually accurate summarization models.

Extracting fact tuples from text has been previously studied in methods like OpenIE (Open Information Extraction) (Banko et al., 2007). OpenIE extracts triplets with an unspecified schema, and the relation is usually the text linking the two entities. However, it does not leverage information from a knowledge base and leads to outputs that are hard to compare. For example, Person was born in that town $\Rightarrow$ (Person, born in, town). But That town is the birthplace of Person $\Rightarrow$ (Town, is the birthplace of, Person).

We standardize comparison by studying structured approaches to relation tuple extraction where the schema is fixed. We compare two approaches for fact extraction. One is a two-step process that first involves recognizing all the named entities in a sentence, and then classifying the relation for every pair of entities in the sentence (Sorokin and Gurevych, 2017; Lin et al., 2016). Our other approach is to use an end-to-end model with a Transformer-based architecture (Vaswani et al., 2017) that is trained to output structured fact tuples. These models are described in Section 4. We create a new dataset for fact extraction using distant supervision (Mintz et al., 2009) on Wikipedia text by cross-referencing facts from the Wikidata knowledge base (Vrandečić and Krötzsch, 2014). To the best of our knowledge, this dataset is bigger and contains more relations and domains than previously used datasets for relation or fact tuple extraction.

Our main contributions are:

(1)

We introduce model-based metrics to analyze the factual accuracy of generated text (Sec 4). We compare them against model-free metrics listed in Sec 5. 2. (2)

To train fact tuple extraction models, we release code (as part of the Tensor2Tensor111https://github.com/tensorflow/tensor2tensor framework along with the model weights222https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wikifact) and a large dataset (Sec 3) based on Wikidata and Wikipedia at https://github.com/google-research-datasets/wikifact. 3. (3)

We show that a Transformer-based end-to-end fact extraction model is able to perform structured prediction of relation tuples, avoiding the need to split the process into multiple steps (named entity recognition, coreference resolution and relation classification). It is able to extract complete sets of facts from full pages of text in one pass. 4. (4)

We conduct experiments to compare our proposed metric against human evaluation of factual accuracy of generated text (Sec 8.1) and show that model-based metrics are better correlated with human judgment when compared to traditional metrics like ROUGE.

Our models work under some limitations that are discussed in Sec 9.1, and then Sec 9.2 discusses future work and ways to make our models more robust.

2. Related Work and Motivation

Many evaluation metrics have been proposed for text generation tasks like BLEU (Papineni et al., 2002) and METEOR (Lavie and Agarwal, 2007) for machine translation and ROUGE (Lin, 2004), Basic Elements (Hovy et al., 2006) & Pyramid (Nenkova and Passonneau, 2004) for text summarization. In Steinberger and Jezek (2009), the authors explain the different kinds of evaluation we can perform for summarization. They are broadly classified as extrinsic metrics that are specific to tasks (e.g. in summarizing a person, whether the date of birth has been included) and intrinsic metrics like grammaticality, coherency and non-redundancy that are based on the analysis of the summary. ROUGE, BLEU, sentence level F1 measures, etc are intrinsic content based metrics. Zhang et al. (2018) and other related works study ways to estimate the trustworthiness of answers to a question. With the recent shift towards using neural abstractive methods for text summarization and other text generation tasks, we believe that it is important to assess the factual accuracy of generated text. Wiseman et al. (2017) have also studied some extractive evaluative methods to assess the quality of generated text. This includes a Relation Generator, which predicts the relation between entities to assess the factual correctness of generated records. However, we introduce a much larger dataset and enable training end-to-end models that can extract fact triplets from text. We additionally perform detailed analysis of the fact extraction models.

Typical fact extraction pipelines are a multistage process consisting of part-of-speech tagging, named entity recognition (Finkel et al., 2005; Lample et al., 2016; Chiu and Nichols, 2016) that produces entities $\{e_{i}\}$ and then relation classification that predicts a relation $r_{k}$ for every pair of entities $(e_{i},e_{j})$ . OpenIE (Banko et al., 2007) predicts a relation by linking the text connecting $e_{i}$ and $e_{j}$ . Because it does not have a fixed schema, logical reasoning on its outputs are not possible. Mohamed et al. (2011) extend this to start with a fixed schema that can grow with more training, yet retain a consistent output surface form.

In this paper, we consider fact classification models with fixed schema. This idea has been studied in many previous works including Surdeanu et al. (2012), which considered datasets that have multiple relation labels for an entity pair, which each may have multiple instances in the input text. This was modeled as a graphical model over latent variables. Riedel et al. (2013) treated relation extraction as reasoning with matrix-factorization, and could work with surface-form texts and knowledge-base embeddings simultaneously. However, both of these works had datasets with very few types of relations, and were shown to work over limited domains. Recently, neural networks have been used for classifying relations. Lin et al. (2016) used attention over multiple instances for the same entity pair to predict relations. Sorokin and Gurevych (2017) proposed to predict multiple relations in a sentence by using all the entity pairs and relation labels in the sentence as contextual input. We propose a simpler model where we classify relations between all the entity pairs in a sentence, without any additional context. We also make use of our proposed dataset that is bigger, more diverse and has more relation types. Our dataset also has article-level information that can be used to train models like in Section 4.2. Since using two-step processes may be affected by compounding of errors across the models, some end-to-end approaches (Miwa and Sasaki, 2014; Miwa and Bansal, 2016) have been proposed, where the models extract entities and relations in one pass through the model. However, the method used in Miwa and Sasaki (2014) required designing hand-crafted features and task-specific algorithms. Miwa and Bansal (2016) has a two-phase model that first extracts entity candidates and then predicts relations based on the parsed tree-structure of the sentence. We instead propose a sequence-to-sequence model that is able to output fact tuples directly, and does not require any feature engineering.

We found that the abstractive summarization models such as those described in Liu et al. (2018) may generate sentences with factual inaccuracies (e.g. incorrect month in date of birth, wrong city in the state, etc.). Cao et al. (2017) found that 30% of summaries generated by a state-of-the-art summarization model contained factual inaccuracies. We found by running a large-scale experiment as described in Section 8.1, that the summarization model had factual inaccuracy rate of approximately 17%. We believe that this is because such mistakes are not heavily penalized by cross-entropy or n-gram based model losses and metrics.

As further motivation, we synthesized factually inaccurate samples by making simple corruptions to Wikipedia lead sections. We replaced mentions of dates (day and month only), locations or people with other entities of the same type in the text. For example, Barack was born on August 4, 1961 in Honolulu. He married Michelle on October 3, 1992 in Chicago. becomes Barack was born on October 3, 1961 in Chicago. He married Michelle on August 4, 1992 in Honolulu.. Table 2 shows that model-free metrics such as ROUGE and OpenIE-based tuple comparison do not reflect the decline in factual accuracy due to such corruption as much as the model-based metrics do.

3. Dataset

We create a dataset for fact extraction using distant supervision that is based entirely on the English Wikipedia corpus and the Wikidata knowledge base $W_{KB}$ (Vrandečić and Krötzsch, 2014). Our distant supervisor is very similar to the one proposed by Mintz et al. (2009). Although the inputs and labels for the classifier and end-to-end model are slightly different, we start by running an NER and co-reference resolution system ${}^{\ref{footnote-ner-coref}}$ on each Wikipedia article. The topic of that article is considered as the subject $e_{s}$ . The other entities $e_{j}$ found in the article are considered objects. For every pair $(e_{s},e_{j})$ , we say they are related if there is a relation $r_{k}$ such that the triplet $(e_{s},r_{k},e_{j})$ is found in $W_{KB}$ . We add this triplet to a set of positive examples $E_{p}$ . If no such relation exists between $e_{s}$ and $e_{j}$ , we add the triplet $(e_{s},r_{0},e_{j})$ ( $r_{0}$ denotes no-relation) to a set of negative examples $E_{n}$ .

4. Model-based Metrics

In this section we describe models that can extract fact tuples from text and how we use them to define the factual accuracy metric as defined in Eq 1. Given some input text $X$ , we then extract claims made in $X$ as fact tuples.

4.1. Named Entity Recognition (NER) + Relation Classifier

This approach consists of two steps, where we first recognize all the named entities $e_{i}$ from $X$ and then classify relations between entity pairs $(e_{i},e_{j})$ .

4.1.1. Named Entity Recognition

Entities are real-world objects like people, locations, organizations etc that can be identified by a proper name333https://en.wikipedia.org/wiki/Named_entity. Entities can be identified with named-entity recognition (NER) systems like Chiu and Nichols (2016); Lample et al. (2016); Finkel et al. (2005) that take in $X$ and produce the set $\{e_{i}\}$ . NER is followed by co-reference resolution444While we use an NER and co-reference resolution system that is not available to the public, the dataset we release (Section 3) has the positions of all the recognized and resolved entities that we use for training our classifier. (Clark and Manning, 2016; Recasens et al., 2013; Lee et al., 2011; Raghunathan et al., 2010). Publicly available NER and co-reference systems include Stanford’s CoreNLP555http://stanfordnlp.github.io/CoreNLP/coref.html and NLTK666https://www.nltk.org/.

4.1.2. Relation Classifier

For every pair $(e_{i},e_{j}),\ e_{i}\not=e_{j}$ we consider all sentences $S_{l}$ in $X$ that contain both entities. The input to the classifier is then each of these sentences $S_{l}$ . Because a sentence may contain multiple entities, we also add a prefix $SUBJ$ for $e_{i}$ and $OBJ$ to $e_{j}$ as a hint. For example, $X$ = Person1 was born in City1 becomes $S_{l}$ = SUBJ{ Person1 } was born in OBJ{ City1 }. Unlike Sorokin and Gurevych (2017), our classifier does not require additional context. Let $s_{i}$ be a token in the input sentence $S_{l}$ after NER, and $r^{k}$ denote the $k$ th relation. Our classifier takes in input tokens $s_{i}$ that are first embedded onto a latent space, and then a stack of Transformer encoder-only layers process the whole sequence. A subsequent max-pooling layer selects one of these outputs that is then converted to a probability estimate of relations by a sigmoid operation. The exact series of operations can be viewed as:

[TABLE]

Figure 1(a) also shows the architecture of this model.

4.1.3. Dataset preparation

For every triplet $f$ in $E_{p}\cup E_{n}$ , we have sentence(s)777There may be more than one sentence in the article that have mentions of the subject and object entity pair. $S_{l}$ in the article that may describe the relation between $e_{s}$ and $e_{j}$ . $S_{l}$ is processed so that subject and object are prefixed with “SUBJ” and “OBJ” as a hint to the model (Section 4.1). This leads to a dataset with 2.9 million positive examples and 34 million negative examples totaling to 45GiB on disk.

4.1.4. $fact_{acc}$ with the Relation Classifier

The classifier predicts a relation $r_{k}$ for each entity pair $(e_{i},e_{j})$ . We extract such triplets from the ground-truth $T$ and generated text $G$ , and use the definition from eq 1 to calculate the factual accuracy.

4.2. End-to-End Extraction

We propose an end-to-end fact extraction model to avoid compounding of errors across components in multi-stage approaches like Section 4.1 (Mccallum and Jensen, 2003). This model also does not require any feature engineering or context. The input to the model is text $X$ of any length (sentence/paragraph/article) and the $subject$ entity $e_{s}$ prefixed to $X$ . All the inputs tokens in $[e_{s};X]$ are first embedded onto a latent space. A Transformer model consisting of a stack of encoder layers followed by decoder layers produces an output sequence of arbitrary length. A softmax operation is applied to every output token to define a distribution at every timestep. Figure 1(b) shows the architecture of this model. To encourage the model to have structured outputs, we train the model with labels that are a sequence of fact tuples. For example, if $X$ = “ Person1 was born in Country1. He was a painter”, then the label, $Y$ , for that input is “Person1 $\langle$ t $\rangle$ born in $\langle$ t $\rangle$ Country1 $\langle$ f $\rangle$ Person1 $\langle$ t $\rangle$ profession $\langle$ t $\rangle$ painter $\langle$ end $\rangle$ ”, where $\langle$ t $\rangle$ separates tokens within the fact $f_{i}$ and $\langle$ f $\rangle$ separates facts. For prediction, we perform a beam search over all the output timesteps, and continue decoding until $\langle$ end $\rangle$ is predicted. A length-penalty $\alpha$ controls the length of this prediction as in (Wu et al., 2016).

4.2.1. Dataset preparation

If the input article text is $X$ , every triplet $f_{p}$ in $E_{p}$ (we ignore the negative examples for end-to-end models because no relations between entity pairs is implied by no output by the model) is appended to the article’s label $L$ . $L$ will then contain a series of tokens that describe facts, with seperators between them. For example: $e_{s}\langle t\rangle r_{1}\langle t\rangle e_{1}\langle f\rangle e_{s}\langle t\rangle r_{2}\langle t\rangle e_{2}\langle f\rangle...$ We also prepend the input text $X$ with $e_{s}$ ( $[e_{s};X]$ ) as a hint to the model for generating facts about $e_{s}$ . This leads to a dataset with 2.5 million examples totaling to 1.5GiB on disk. 888This dataset is made available at https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wikifact

4.2.2. $fact_{acc}$ with the End-to-End model

The End-to-End model is able to produce a sequence of fact tuples in the form, $subj_{1}$ * $\langle$ t $\rangle$ $rel_{1}$ $\langle$ t $\rangle$ $obj_{1}$ $\langle$ f $\rangle$ $subj_{1}$ $\langle$ t $\rangle$ $rel_{2}$ $\langle$ t $\rangle$ $obj_{2}$ *. It is trained to output relations from a fixed schema based on WikiData. Consider an output from this model, $\textit{Barack Obama}\langle t\rangle P69\langle t\rangle Harvard$ . $P69$ denotes ‘educated at’999https://www.wikidata.org/wiki/Property:P69. These tuples are extracted from $T$ and $G$ to fit into the metric defined in eq 1.

4.3. NER + Binary Relation Classifier

Similar to the typical relation classifier detailed in Sec 4.1, we define a classifier that predicts whether a pair of entities $(e_{i},e_{j})$ are related to each other through any relation. This allows for verifying that entities are related in both the ground-truth $T$ and generated text $G$ , while being flexible enough to allow for any relation types. We also note that two entities can be related to each other in multiple ways. The inputs to this model are the same as Sec 4.1, but the model is expected to output $rel$ as

[TABLE]

4.3.1. Dataset preparation

Data for this model is generated with the same procedure detailed in Sec 4.1.3. The only difference is the way we define the label $rel$ . We consider entities $e_{i}$ and $e_{j}$ to be related if there is a relation $r_{k}$ such that $(e_{i},r_{k},e_{j})$ is found in $W_{KB}$ .

4.3.2. $fact_{acc}$ with the Binary Relation Classifier

The model predicts $rel$ for each entity pair $(e_{i},e_{j})$ , and we are able to extract a set of tuples of the form $(e_{i},rel,e_{j})$ from both $T$ and $G$ . To use eq 1 to define the factual accuracy, we filter the set by considering only entity pairs $(e_{i},e_{j})$ that are found in both $T$ and $G$ to then compare the predicted label $rel$ between them.

5. Model-free Metrics

We describe model-free automatic metrics in this section. Unlike model-based metrics, they are not susceptible to changes in training data, and might be considered easier to interpret or understand.

5.1. ROUGE

ROUGE (Lin, 2004) has been used as an automatic metric to judge the quality of generated text, and has shown to correlate well with human judgment of overall linguistic quality of the text.

5.2. OpenIE

OpenIE (Banko et al., 2007) is a tool that can extract relation tuples from text, without a specified schema. We use it to extract sets of relation tuples from $T$ and $G$ , and then compute the precision like in eq 1.

6. Model Experiments

In this section, we describe the methods we used to train and evaluate our relation extraction models. All of our proposed classifiers and end-to-end models have 6 Transformer layers and 1 embedding layer, with number of neurons (hidden layer size) set to 512. In the Transformer-based models, we use 8 attention heads. Our models are trained using the AdaFactor (Shazeer and Stern, 2018) optimizer. We use the publicly available Tensor2Tensor (Vaswani et al., 2018)101010https://github.com/tensorflow/tensor2tensor framework for our experiments and will be releasing our code extensions as part of that framework. On our proposed dataset, the classifiers are trained for 50,000 iterations with batch-size of 1024 and the end-to-end models are trained for 50,000 iterations with batch-size of 256.

We evaluate classifiers and end-to-end models on our dataset. These results are presented in Table 3. The end-to-end model is learning to recognize entities, resolving entity co-references, and reason about their relation in one pass through the model. To the best of our knowledge, we are not aware of other end-to-end structured relation extraction models and therefore do not include a comparison against other approaches. Some examples of extracting facts on our dataset are shown in A.2, where we include a comparison to OpenIE’s triplet extraction.

We calculate precision and recall in the above experiments by matching ground-truth fact tuples exactly. This implies that the end-to-end model is not only learning to identify entities and resolve co-references, but also predict structured output, and its outputs can be used for reasoning. Their performance is competitive against relation classifiers while having a simple training and inference routine.

For each model, we sort and select the ten most frequent relation types that appear in our test sets. The $F1$ measure on these relations for classifiers are shown in Table 4, and end-to-end models are shown in Table 5.

7. Error Analysis of Model Predictions

Distant supervision (Mintz et al., 2009) is a way to create training data by using weak signals. In our dataset, we assign a relation label $r_{k}$ for every entity pair $(e_{i},e_{j})$ in the input text $X$ if the relation tuple $(e_{i},r_{k},e_{j})$ exists in the Wikidata knowledge base $W_{KB}$ . However, the sentence $S_{l}$ containing $(e_{i},e_{j})$ may not necessarily entail $r_{k}$ . This leads to inaccurate estimates of the true-positive rate for our fact extraction models. We evaluate the effect of this distant supervision by gathering the set of facts extracted from our models that are marked false-positive by the distant supervision scheme. We present a pair of input text (Wikipedia articles) and facts extracted by our models to human evaluators, and ask them to mark a fact to be True only if the relation tuple $(subject,relation,object)$ is implied by the input text. We asked two evaluators to score facts marked false-positive from a random set of 30 Wikipedia articles. We consider the fact to be true if both evaluators agree. We present the results in Table 6, where we can see the rate of false-positive facts that were marked true by the evaluators. This suggests that the end-to-end models could benefit by a better labeling scheme.

8. Evaluation of $fact_{acc}$ as a Metric

In this section, we show the effectiveness of our proposed metric on judging the factual accuracy of generated text. We use the text summarization model proposed in (Liu et al., 2018) to generate lead sections of Wikipedia articles using the dataset and model in that paper, and compare the generated summary against the real lead section. In the following section, we describe the methodology used to compare human judgment of factual accuracy and how we compare our metric against that baseline.

8.1. Human Evaluation

Every claim made in the generated text $G$ can be considered to belong to one of three categories: supported by a sentence in ground-truth $T$ , refuted by $T$ or cannot be verified by $T$ . The evaluators were asked to only consider claims that are either supported or refuted by $T$ . This ensures that no external knowledge is used in comparing $T$ and $G$ , and ignores all claims that cannot be verified by $T$ . Four evaluators were asked to rate 30 examples of generated text $G$ and then give it a score of 1-5 with 5 being highest factual accuracy. A special case is where the generated text has no verifiable claims. In this case, they were asked to give it a score of 1. Figure 2 shows the interface a human evaluator uses in our experiment.

We conduct the same experiment on two sets of data: first is a random sampling from summaries generated for Actors. We consider this an easier subset because we expect our fact extraction models to do well on this subset due to the summaries and Wikipedia lead sections generally containing relationships our models perform well on (see tables 4 and 5). We present these results in Table 7. We analyzed the inter-rater agreement on the scores given to each example, and found that Krippendorff’s alpha (allows for ordinal rankings) was 0.6897. The second is a random sampling from all categories in Wikipedia. The results are presented in Table 8. The inter-rater agreement on this sample was found to be 0.7530.

We see that our end-to-end model (Section 4.2) has the best correlation on both subsets, indicating that it generalizes better to generated text. This may also be because the classifier suffers from a compounding of errors, where it is unable to predict relations if the NER system fails to recognize entities.

9. Conclusion

9.1. Limitations

The dataset we create only makes use of sentences found in Wikipedia, and facts found in WikiData. This means that our models are biased to sentences structured to the neutral tone set in Wikipedia, and towards popular types of facts expressed in WikiData such as date of birth, profession, etc. Other sources of text may have more complex structures and styles of writing that may make it hard for our models to adapt to easily. An simple example of this is negating a binary relationship with ‘not’, and different ways of expressing the same idea such as ‘wife/husband’ instead of ‘spouse’. WikiData is an incomplete knowledge base, and this also leads to many sentences that in reality imply a fact to be marked containing no facts. This is a very typical problem faced by any work using distant supervision, and is combated with methods like active learning (Shen et al., 2017).

It should be noted that ROUGE and to the best of our knowledge, most other automatic metrics, are also susceptible to changes in linguistic style and structure. However, elaborate labeling and bigger datasets will allow for our models to learn to overcome these challenges.

9.2. Discussion and future work

We have shown that our proposed metric is able to indicate the factual accuracy of generated text, and agrees with human judgment on our datasets. By leveraging a new dataset for both relation classification and end-to-end fact extraction, we also showed that classifiers and end-to-end models with straightforward architectures are able to perform competitive fact extraction.

Our end-to-end model avoids compounding of errors over sub-components typically used in other fact-extraction pipelines. We will release the code and datasets used to train this model, so that the proposed metric can be used to standardize comparison. We are in the process of building a bigger dataset that will contain multiple text domains, stronger human supervision and a larger collection of relation tuples that will help overcome many of the limitations discussed in the previous section (9.1). We encourage further development and use of this metric for automating the assessment of factual accuracy of generated text, and the development of better end-to-end models with structured outputs for fact extraction.

Appendix A Appendix

A.1. Reproducibility

We release code to train our fact extraction models as part of the Tensor2Tensor framework111111https://github.com/tensorflow/tensor2tensor along with trained model weights at https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wikifact. A large fact extraction dataset (Sec 3) based on Wikidata and Wikipedia is made available 121212https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wikifact. To train our end-to-end and classifier models for fact extraction, we use the hyper-parameter set “transformer_base” defined in the Tensor2Tensor framework131313https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py. We further release code to use our end-to-end models as a fact extractor and calculate the factual accuracy metric at https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wikifact.

A.2. Fact extraction example

We include an example of facts extracted from text using our models where we compare it against OpenIE’s (Banko et al., 2007) triplet extraction in Table 9. This example illustrates the advantage of using structured approaches to fact extraction. OpenIE yields many triplets that mostly cannot be used for reasoning.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Bahdanau et al . (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In International Conference on Learning Representations .
3Banko et al . (2007) Michele Banko, Michael J. Cafarella, Stephen Soderland, Matt Broadhead, and Oren Etzioni. 2007. Open Information Extraction from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2670–2676.
4Brown et al . (1992) Peter F. Brown, Vincent J. Della Pietra, Robert L. Mercer, Stephen A. Della Pietra, and Jennifer C. Lai. 1992. An Estimate of an Upper Bound for the Entropy of English. Computational Linguistics 18, 1 (March 1992), 31–40.
5Cao et al . (2017) Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2017. Faithful to the Original: Fact Aware Neural Abstractive Summarization. Co RR abs/1711.04434 (2017). ar Xiv:1711.04434 http://arxiv.org/abs/1711.04434
6Chiu and Nichols (2016) Jason Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirectional LSTM-CN Ns. Transactions of the Association for Computational Linguistics 4 (2016), 357–370.
7Clark and Manning (2016) Kevin Clark and Christopher D. Manning. 2016. Improving Coreference Resolution by Learning Entity-Level Distributed Representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, 643–653. https://doi.org/10.18653/v 1/P 16-1061 · doi ↗
8Finkel et al . (2005) Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005) . 363–370.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Assessing The Factual Accuracy of Generated Text

Abstract.

1. Introduction

2. Related Work and Motivation

3. Dataset

4. Model-based Metrics

4.1. Named Entity Recognition (NER) + Relation Classifier

4.1.1. Named Entity Recognition

4.1.2. Relation Classifier

4.1.3. Dataset preparation

4.1.4. factaccfact_{acc}factacc​ with the Relation Classifier

4.2. End-to-End Extraction

4.2.1. Dataset preparation

4.2.2. factaccfact_{acc}factacc​ with the End-to-End model

4.3. NER + Binary Relation Classifier

4.3.1. Dataset preparation

4.3.2. factaccfact_{acc}factacc​ with the Binary Relation Classifier

5. Model-free Metrics

5.1. ROUGE

5.2. OpenIE

6. Model Experiments

7. Error Analysis of Model Predictions

8. Evaluation of factaccfact_{acc}factacc​ as a Metric

8.1. Human Evaluation

9. Conclusion

9.1. Limitations

9.2. Discussion and future work

Appendix A Appendix

A.1. Reproducibility

A.2. Fact extraction example

4.1.4. $fact_{acc}$ with the Relation Classifier

4.2.2. $fact_{acc}$ with the End-to-End model

4.3.2. $fact_{acc}$ with the Binary Relation Classifier

8. Evaluation of $fact_{acc}$ as a Metric