Discourse Understanding and Factual Consistency in Abstractive   Summarization

Saadia Gabriel; Antoine Bosselut; Jeff Da; Ari Holtzman; Jan Buys,; Kyle Lo; Asli Celikyilmaz; Yejin Choi

arXiv:1907.01272·cs.CL·April 12, 2021

Discourse Understanding and Factual Consistency in Abstractive Summarization

Saadia Gabriel, Antoine Bosselut, Jeff Da, Ari Holtzman, Jan Buys,, Kyle Lo, Asli Celikyilmaz, Yejin Choi

PDF

Open Access

TL;DR

This paper presents Co-opNet, a transformer-based framework that improves abstractive summarization by ensuring factual consistency and narrative coherence through a generator-discriminator architecture, evaluated on scientific papers.

Contribution

The paper introduces Co-opNet, a novel generator-discriminator model that enhances factual accuracy and coherence in abstractive summaries, addressing hallucination and coherence issues.

Findings

01

Co-opNet significantly improves global coherence over baselines.

02

Automatic and human evaluations confirm better factual consistency.

03

Discriminator objectives effectively capture coherence aspects.

Abstract

We introduce a general framework for abstractive summarization with factual consistency and distinct modeling of the narrative flow in an output summary. Our work addresses current limitations of models for abstractive summarization that often hallucinate information or generate summaries with coherence issues. To generate abstractive summaries with factual consistency and narrative flow, we propose Cooperative Generator -- Discriminator Networks (Co-opNet), a novel transformer-based framework where a generator works with a discriminator architecture to compose coherent long-form summaries. We explore four different discriminator objectives which each capture a different aspect of coherence, including whether salient spans of generated abstracts are hallucinated or appear in the input context, and the likelihood of sentence adjacency in generated abstracts. We measure the ability of…

Tables11

Table 1. Table 1: Discourse Role Ordering

$S_{i - 1}$	$S_{i}$
BACKGROUND	BACKGROUND
BACKGROUND $\lor$ METHOD $\lor$ OBJECTIVE	METHOD
BACKGROUND $\lor$ OBJECTIVE $\lor$ METHOD	OBJECTIVE
OBJECTIVE $\lor$ METHOD $\lor$ OTHER	RESULT

Table 2. Table 2: Salient spans extracted using factuality discriminator.

Topic	Spans
NLP	existing semantic schema, annotation effort,
NLP	music knowledge representation, siri assistant
BIO	biological system, ptotic, cybernetics
BIO	entropy , shannon established fundamental limits

Table 3. Table 3: Domain subset sizes

Split	CS	BIO	AAN
Train	44900	4104	10106
Validation	5622	555	892
Test	5670	522	892

Table 4. Table 4: Automatic Evaluation of generative architectures and Co-opNet. For AAN, we provide results using the Factuality discriminator. For CS and Bio, we provide results using the Coverage discriminator.

Model	AAN			CS			Bio
Model	R-1	R-2	R-L	R-1	R-2	R-L	R-1	R-2	R-L
Lede-3	27.12	6.62	23.88	28.22	7.06	16.22	27.60	5.70	24.21
LexRank	36.03	10.14	31.37	36.53	10.41	32.09	35.32	8.84	30.76
LSTM	27.80	5.57	18.02	22.74	4.56	20.64	10.73	0.49	9.94
PGen	39.85	12.83	23.24	36.68	11.74	32.55	23.74	4.48	21.65
Generator (Our work)	41.31	12.97	37.05	38.01	10.95	34.46	34.86	8.45	31.38
Co-opNet (Our work)	41.67	12.65	37.23	38.57	10.81	35.11	35.86	8.41	32.56

Table 5. Table 5: BERTScore results on AAN subset (F1)

Model	BERTScore	SciBERTScore
PGen	57.86	59.13
Generator	61.71	62.80
Co-opNet (Adj)	61.87	63.10
Co-opNet (Fact)	62.09	63.21

Table 6. Table 6: Comparison of different Co-opNet discriminators

Model	AAN			CS			Bio
Model	R-1	R-2	R-L	R-1	R-2	R-L	R-1	R-2	R-L
Coverage (Cov)	41.29	12.09	37.14	38.57	10.81	35.11	35.86	8.41	32.56
Order	41.20	12.20	37.11	38.50	10.87	35.10	35.66	8.46	32.39
Adjacency (Adj)	40.97	12.46	36.70	37.44	10.67	33.86	34.89	8.45	31.57
Factuality	41.67	12.65	37.23	38.23	11.03	34.64	35.46	8.41	31.96

Table 7. Table 7: Human Evaluation of Co-opNet Architectures (% of judgements for each model)

PGen vs. Co-opNet-Adj			Generator vs. Co-opNet-Adj
Criteria	PGen	Co-opNet	Criteria	Generator	Co-opNet
Abstractiveness	41.89	47.30	Abstractiveness	20.41	38.10
Coherence	42.57	50.00	Coherence	23.81	34.01
Factuality	39.86	45.95	Factuality	22.98	30.41
Overall	34.90	53.02	Overall	25.00	31.08
PGen vs. Co-opNet-Fact			Generator vs. Co-opNet-Fact
Criteria	PGen	Co-opNet	Criteria	Generator	Co-opNet
Abstractiveness	51.02	39.46	Abstractiveness	27.33	35.33
Coherence	43.92	50.00	Coherence	30.87	32.21
Factuality	43.84	48.63	Factuality	27.52	32.21
Overall	43.54	50.34	Overall	30.87	32.21

Table 8. Table 8: Example of gold and generated abstracts from baseline Pointer Networks + Coverage See et al. ( 2017 ) (PGen) and two of our proposed models, Generator and Co-opNet, on the NLP scientific domain. Coherence issues and factual errors in generated abstracts are highlighted in italics . We highlight correct terminology and transitional phrases that contribute to coherent flow by properly delineating sections of abstracts in bold and italics .

Gold	We investigate mutual benefits between syntax and semantic roles using neural network models, by studying a parsing->SRL pipeline, a SRL->parsing pipeline, and a simple joint model by embedding sharing. The integration of syntactic and semantic features gives promising results in a Chinese Semantic Treebank…
PGen	In this paper, we propose a novel approach to learn syntactic and semantic role labeling models to semantic role labeling (wsd). In the first neural network models induce non-linear feature features from word and part-of-speech (pos) parsing. We show that semantic features can be used to learn…
Generator	Syntax-semantic relations play a crucial role in natural language processing. In contrast, semantic role labeling (srl) models typically rely on parser output features to improve accuracy. In this work, we propose a joint srl and syntactic parsing srl pipeline using the chinese treebank (qiu et al., 2016)…
Co-opNet (Adj)	In this paper, we explore the use of neural network models to jointly train semantic role labelers and parsers for semantic role labeling (srl). We first propose a simple neural srl model that uses a neural long shortterm memory (lstm)-based parser to represent the output of an srl system…

Table 9. Table 9: Automatic Evaluation of discriminator architectures

Model	Training Data	Prec	Rec	F1	Acc
Discourse-Adj	ArXiv-AAN	86.05	85.25	85.65	86.81
Discourse-Adj	ArXiv-Bio	90.30	93.44	91.84	92.32
Discourse-Abs	CSAbstruct	88.99	89.09	89.04	89.00
Factuality	SciFact	73.70	70.50	72.10	75.70

Table 10. Table 10: Statistics of gold summaries in different summarization datasets.

Dataset	Narrative Flow?	# Summaries	Avg # Sents	Avg # Words
XSum Narayan et al. (2018)	✗	226,711	1.00	23.26
Newsroom Grusky et al. (2019)	✗	1,321,995	1.45	26.70
CNN Hermann et al. (2015)	✗	92,579	3.59	45.70
DailyMail Hermann et al. (2015)	✗	219,506	3.86	54.65
ArXiv	✓	472,493	6.11	150.85
AAN	✓	11,890	5.03	106.76

Table 11. Table 11: Human Evaluation of Co-opNet Architectures (% of judgements for each model)

Gold vs. Co-opNet-Adj
Criteria	Gold	Co-opNet
Abstractiveness	47.37	52.63
Coherence	66.67	31.06
Factuality	66.92	32.33
Overall	61.36	32.33

Equations23

h_{j}^{0}

h_{j}^{0}

h_{j}^{l}

P (w_{i} ∣ w_{0}, ... w_{i - 1})

P (w_{i} ∣ w_{0}, ... w_{i - 1})

L_{co v}

L_{co v}

L_{or d er}

L_{or d er}

P_{a d j} (s) = σ (w_{d i sc}^{⊤} h_{c l s})

P_{a d j} (s) = σ (w_{d i sc}^{⊤} h_{c l s})

L_{d i sc}

L_{d i sc}

\displaystyle+(1-\delta_{adj}(\textbf{s}))\cdot\log(1-P_{adj}(\textbf{s}))\Big{)},

L_{f a c t}

L_{f a c t}

score (g)

score (g)

+ λ_{d i sc} \frac{1}{∣ S ( g ) ∣ - 1} u = 2 \sum ∣ S (g) ∣ lo g P_{a d j} (s_{u}, s_{u - 1}),

L_{g e n} = - i = 1 \sum ∣ a ∣ + ∣ s ∣ lo g P (w_{i} ∣ w_{0}, ... w_{i - 1})

L_{g e n} = - i = 1 \sum ∣ a ∣ + ∣ s ∣ lo g P (w_{i} ∣ w_{0}, ... w_{i - 1})

f_{n} (S, ∣ O ∣) = \frac{S - ( - ∣ O ∣ )}{∣ O ∣ - ( - ∣ O ∣ )}

f_{n} (S, ∣ O ∣) = \frac{S - ( - ∣ O ∣ )}{∣ O ∣ - ( - ∣ O ∣ )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques

Full text

Discourse Understanding and Factual Consistency

in Abstractive Summarization

Saadia Gabriel♠ Antoine Bosselut ${}^{\vardiamondsuit}$ Jeff Da ♢ Ari Holtzman♠

Jan Buys♡ Kyle Lo ♢ Asli Celikyilmaz ♣ **Yejin Choi♠♢

♠**Paul G. Allen School of Computer Science & Engineering, University of Washington

♣Microsoft Research ♢Allen Institute for Artificial Intelligence

${}^{\vardiamondsuit}$ Stanford University ♡University of Cape Town

{skgabrie,ahai,yejin}@cs.washington.edu, {jeffd,kylel}@allenai.org,

[email protected], [email protected], [email protected]

Abstract

We introduce a general framework for abstractive summarization with factual consistency and distinct modeling of the narrative flow in an output summary. Our work addresses current limitations of models for abstractive summarization that often hallucinate information or generate summaries with coherence issues.

To generate abstractive summaries with factual consistency and narrative flow, we propose Cooperative Generator – Discriminator Networks (Co-opNet), a novel transformer-based framework where a generator works with a discriminator architecture to compose coherent long-form summaries. We explore four different discriminator objectives which each capture a different aspect of coherence, including whether salient spans of generated abstracts are hallucinated or appear in the input context, and the likelihood of sentence adjacency in generated abstracts.

We measure the ability of Co-opNet to learn these objectives with arXiv scientific papers, using the abstracts as a proxy for gold long-form scientific article summaries. Empirical results from automatic and human evaluations demonstrate that Co-opNet learns to summarize with considerably improved global coherence compared to competitive baselines.

1 Introduction

Generating summaries with coherent discourse structure and domain knowledge awareness poses a challenge for current methods in summarization. Generative models can commonly produce high quality text (Figure 1), but fail to understand finer-grained details of coherence such as the structure and flow of a narrative. In addition, they often generate factually incorrect content. Prior work on factuality in abstractive summarization has found that current models can hallucinate information more than 70% of the time when generating summaries of news articles Maynez et al. (2020).

To address these issues, we focus our study on generating abstractive summaries with factuality and narrative flow. Given an input document, the goal is to generate a paragraph-length abstractive summary with proper discourse structure that contains factually correct claims. Our study builds on and extends previous work that focuses on either extractive document-level summarization Nenkova and McKeown (2012); Allahyari et al. (2017) or abstractive sentence-level summarization Rush et al. (2015); Grusky et al. (2019); Narayan et al. (2018).

In pursuit of this goal, we introduce Cooperative Generator-Discriminator Networks (Co-opNet), a framework for abstractive summarization that considers subtle aspects of fact-checking and discourse necessary for coherent text generation. In this framework, the generator, a transformer language model fine-tuned for abstractive summarization, proposes a pool of candidate summaries ( $\S$ 2). The discriminator, also transformer-based, scores the factuality or discourse quality of candidate summaries using one of four different objectives: the overlap between a scientific article introduction and predicted fact-checking evidence spans in generated summaries, the ordering of predicted discourse roles, the coverage of predicted discourse roles, or the likelihood of adjacency between generated sentences ( $\S$ 3). The best summary is chosen cooperatively by combining the generator and discriminator scores ( $\S$ 4).

Most previous works on abstractive document-level summarization have difficulty in directly modeling or evaluating narrative flow and factuality in generated summaries. This weakness is largely due to the inherent limitations of existing datasets, such as the CNN/DailyMail dataset Hermann et al. (2015). The reference summaries available in these commonly used resources are mainly headlines of news articles or stories. As a result, they are often sets of disconnected sentences that are highly extractive, leading to models that are also extractive (Hoang et al., 2019), rather than abstractive.

In order to address these data challenges, we test our summarization model on a set of arXiv scientific papers. Scientific abstracts are ideal for modeling narrative flow as they are structured with highly coherent discourse flow. They also maintain implicit abstractive alignments with respect to the introduction of the article – in contrast to the tight, extractive alignments of current models. Scientific article summarization is also a task where factuality is more well-defined than in other domains like story summarization which leave more room for interpretation.

Comprehensive empirical results considering both automatic and human evaluations demonstrate that Co-opNet learns to summarize scientific articles from three domains with considerably improved global coherence compared to competitive baselines ( $\S$ 6). We also demonstrate that the framework is generalizable to multiple coherence objectives, and effective at generating scientific abstracts that are more factually consistent.

2 Generator Networks

We use the transformer architecture of Radford et al. (2019) as our generator’s architecture. Following the work of Liu et al. (2018), we adapt a language model to the task of abstractive summarization by concatenating the article $a$ , a delimiter token $[\mathrm{SEP}]$ , the summary $s$ , and an end token $[\mathrm{END}]$ into one input vector $X=(a_{1},...,a_{|a|},[\mathrm{SEP}],s_{1},...,s_{|s|},[\mathrm{END}])$ , where $|a|$ is the length of the gold article and $|s|$ is the length of the gold summary.

At each time step $i$ , the model produces an output probability distribution over the vocabulary for the next token $w_{i}$ given all previous output tokens $w_{<i}$ . For any arbitrary token $w_{j}$ preceding $w_{i}$ , the per-layer representation of that token is computed in the following way:

[TABLE]

where block refers to each transformer block composed of multi-headed attention, a feedforward network and layer normalization, $\textbf{W}_{e}$ is a word embedding matrix, $\textbf{p}_{j}$ is the position embedding, $\textbf{h}_{j}^{0}$ is the initial representation, $\{\textbf{h}\}_{j}^{l}$ is the block output for an arbitrary layer $l$ , and $\{\textbf{h}\}_{<j}^{l-1}$ is the set of all block outputs from the preceding layer for positions up to $j$ . Finally, for the current position $i$ in the sequence, we compute a distribution over the output vocabulary as follows:

[TABLE]

where $\textbf{W}_{e}$ is the same embedding matrix as in Equation 1 and $\textbf{h}_{i-1}^{L}$ is the final layer transformer block output.

3 Discriminator Networks

Because summarization models are prone to narrative flow and factual consistency issues Kryściński et al. (2020); Xu et al. (2020), we use a discriminator to score generated summaries for discourse and factuality properties. Due to the challenge of explicitly defining discourse and factuality properties as scores, these properties are approximated using parameterized scoring functions.

These scoring functions determine if generated text demonstrates discourse and factuality properties in three ways: (1) predicting the discourse role of sentences within a full summary, (2) predicting the likelihood of adjacency given a sentence pair, and (3) measuring the presence of salient facts in the generated summary from the original input context. While our discriminators focus on these three properties, we note that this framework is generalizable and could be extended to include other discriminator models that encourage different communicative norms associated with high-quality language generation.

3.1 Discourse

We explore different discriminator architectures as additional discourse scoring functions during the generator’s decoding process. For these discriminators, we generally score discourse in two ways. First, we use inferred sentence-level scientific abstract discourse role labels111The labels are {BACKGROUND, METHOD, OBJECTIVE, RESULT, OTHER}. defined by Cohan et al. (2019) and predict them using a sequence classifier222See Cohan et al. (2019) for model and training details. based on SciBERT Beltagy et al. (2019). Using these predictions, we score the discourse properties of the abstract relative to their coverage (§3.1.1) or ordering (§3.1.2). Second, we learn a function that can score the likelihood that sentences within generated abstracts should be adjacent to one another (§3.1.3).

3.1.1 Coverage

We measure the completeness of the narrative structure within a scientific abstract by defining the following coverage score:

[TABLE]

where $D_{abs}$ is the number of unique discourse roles appearing in an abstract and $D_{all}$ is the total number of possible discourse roles. This objective allows us to penalize abstracts that are missing discourse roles. For example, an abstract that fails to mention anything about the results of the study would be penalized.

3.1.2 Ordering

We also score the order in which discourse labels appear in generated abstracts. In Table 1, we hard-code valid orderings of discourse labels for generated sentences based on each of the abstract discourse roles of Cohan et al. (2019). If the ordering for two adjacent sentences in the abstract $O(s_{i-1},s_{i})$ is valid, the score for the ordering is 1 (-1 otherwise). We sum the scores for all the orderings within a particular abstract and normalize between 0 and 1 (as described by $f_{n}$ ):333See the Appendix for a more detailed description of the $f_{n}$ function.

[TABLE]

We also impose a rule for $s_{1}$ =‘BACKGROUND’ and a rule for $s_{S}$ =‘RESULT’ to encourage more natural orderings.

3.1.3 Adjacency Classification

To model the likelihood of adjacency between two sentences $s_{u}$ and $s_{v}$ , we first compute a hidden representation of the sentence pair using SciBERT Beltagy et al. (2019). The encoder input is the concatenation of the sentences: $\textbf{s}=[\mathrm{CLS}]+s_{u}+[\mathrm{SEP}]+s_{v}+[\mathrm{SEP}]$ , where $[\mathrm{CLS}]$ is a special token associated with the task and $[\mathrm{SEP}]$ is a sentence delimiter token. Each word in the sequence is encoded by a word embedding $w_{i}$ and positional embedding $p_{i}$ and passed through the SciBERT model to yield $\textbf{h}_{cls}$ , the output state at the position of the $[\mathrm{CLS}]$ token. We then obtain the probability of adjacency between the sentences by a linear projection of $\textbf{h}_{cls}$ followed by a sigmoid activation:

[TABLE]

We define the training objective for the adjacency discriminator to minimize the negative log likelihood of predicting whether two sentences are adjacent or not:

[TABLE]

where $\delta_{adj}(\textbf{s})$ is an indicator function for whether the two sentences in s are adjacent. We note that while the discourse discriminators mainly focus on narrative structure, they may also capture context-aware aspects of factuality and content selection.

3.2 Factuality and Faithfulness

To measure factuality of generated summaries, we predict which tokens in the summary are likely to belong to a fact-checking evidence span (i.e., a span of the text used to prove a scientific claim using a finetuned BERT token classification model.444See Appendix A.4 for details of token classification model. Recent work has shown that inspecting attention weights alone is not necessarily a reliable metric for determining saliency of particular aspects in the input context to the output of neural models Serrano and Smith (2019). The saliency weights representing the likelihood of tokens belonging to evidence spans provides us with a more explicit representation of factual importance.

We obtain proxy saliency labels for the importance of a particular token $t$ appearing in an abstract using a BERT model trained on evidence spans annotated for scientific fact-checking Wadden et al. (2020). Specifically, if $t$ is not a stopword and $t\in E$ , where $E$ is an evidence span used to check a scientific claim, then we assign a label of 1 to $t$ . Otherwise, the label for $t$ is 0. Examples of extracted spans are given in table 2.

We compare the predicted evidence spans against information presented in the original introduction to capture the degree to which generative models are hallucinating information.

Factuality Objective

At inference time, we compare the extracted salient spans, $F(g)$ , of the generated summary $g$ against the set of all ngrams in the article input context, $N(a)$ , measuring the degree to which salient spans are hallucinated:

[TABLE]

4 Reranking with Discourse and Factuality Experts

To incorporate the discriminator objective into our summarization framework, we first generate a pool of candidate summaries from the base summarization model (§2) using any decoding strategy (e.g., beam search or top- $k$ sampling). Then, the discriminator is used to re-rank these candidates in conjunction with the original token-level generator scores. For example, in the case of the adjacency discriminator, we maximize the generator token-level probability of a candidate summary $g$ , and the average of adjacency scores for the set of sentences composing $g$ (denoted $S(g)$ ) – i.e., the probability of each sentence $s_{u}$ being adjacent to the previous sentence $s_{u-1}$ in $S(g)$ :

[TABLE]

where $\lambda_{gen}$ and $\lambda_{disc}$ are hyper-parameters controlling the contribution of the generator and adjacency discriminator to the final predicted summary. The same procedure is followed for the other discourse and factuality objectives, replacing $P_{adj}(s_{u},s_{u-1})$ with the scores from these discriminators.

5 Data

5.1 Datasets

Since the focus of this work is on generating summaries with more coherent narrative flow and greater factual consistency, we concentrate on datasets requiring discourse structure to generate good summaries. Particular attributes of the discourse structure of these datasets include:

•

Length of summaries $\rightarrow$ Are the summaries long enough to clearly show narrative flow properties and factual correctness?

•

Abstractiveness of gold summaries $\rightarrow$ Do the summaries exhibit particular sentence-level flow, or are the summary sentences extracted highlights from the context?

ArXiv

We crawled over 700K samples (472K abstracts) from scientific articles on arxiv.org. In our experiments we primarily focus on the CS555https://arxiv.org/corr and Bio666https://arxiv.org/archive/q-bio domain subsets. The task we define is to generate an abstract given a introduction, which presents a challenge to existing summarization models. This task also requires models to learn relevant domain knowledge for the scientific domain of interest and recognize common discourse structure for papers written in that domain.

AAN

Additionally, we include an existing dataset of scientific articles that focuses on papers in the NLP computer science domain. This dataset consists of a 12k paper subset from the ACL Anthology Network (AAN; Radev et al., 2009) with extracted introduction and abstract pairs.

Scientific abstracts in ArXiv and AAN have properties that are missing from existing summarization datasets based on Newswire data. For example, XSum Narayan et al. (2018) and Newsroom Grusky et al. (2019) summaries are generally too short to exhibit cross-sentence narrative flow. Meanwhile, CNN/DailyMail Hermann et al. (2015) summaries are acquired by concatenating extracted highlights, which can be unrelated. Conversely, ArXiv and AAN abstracts are long enough to have multiple sentences,777See Appendix A.6 for comparison of datasets. and generally exhibit strong discourse patterns typical to scientific writing, making them ideal corpora for assessing discourse understanding in abstractive summarization. Table 3 provides details of dataset splits.

6 Experimental Setup

Our implementation is based on the Huggingface implementation888https://github.com/huggingface/transformers of the BERT (Devlin et al., 2019) and GPT-2 language models (Radford et al., 2019).

Generator

We perform WordPiece tokenization for the input context and output summaries. Because of the fixed input size of the transformer language model, the input context is truncated to a maximum of 800 tokens, and summaries are truncated to a maximum of 200 tokens. We use a learning rate of 2e-5 and a batch size of 16 to finetune the generator. We train the base summarization transformer model for 12 epochs. All experiments are run on either a Titan-X or Quadro RTX 8000 GPU. Training time for the AAN and ArXiv Bio datasets is about 30 minutes per epoch. Training time for the ArXiv CS dataset is 2.5 hours per epoch. In our experiments we use top- $k$ sampling with $k$ =4 Fan et al. (2018) to generate candidate summaries for each model.

Discriminator

At training time we use a maximum sentence length of 200 tokens to accommodate the fixed input size of BERT (512 tokens), reduce inference time, and discourage the model from generating abnormally long run-on sentences that indicate the presence of coherence issues.999See the original papers for details of training the SciFact and abstract discourse models.

For the adjacency discourse models, we fine-tune the discriminator using a learning rate of 2e-5, a linear warmup learning rate schedule, and a batch size of 32. All adjacency discourse discriminator models are fine-tuned for 2 epochs on a Titan-X GPU. The adjacency discriminator models are adapted from the Huggingface implementation of the BERT next sentence prediction classifier. We initialize the 12-layer BERT-base discriminator model with the pretrained weights of the SciBERT-uncased model, which was originally trained on 1.14 million scientific papers Beltagy et al. (2019). Two discriminators are trained: one is fine-tuned on AAN for decoding both ArXiv CS and AAN, while the other discriminator is fine-tuned on ArXiv Bio and used exclusively for decoding that subset. We weigh the generation and discriminator models equally when decoding by setting $\lambda_{gen}$ = $\lambda_{disc}$ = $.5$ . Additional implementation details are provided in Appendices A.3 and A.4.101010Our code/data is released here: https://github.com/skgabriel/coopnet.

7 Experiments

We compare against extractive approaches using the Lede-3 and LexRank Erkan and Radev (2004) baselines. We also compare against two abstractive approaches: a 2-layer bi-LSTM sequence-to-sequence model with attention (LSTM), and a pointer-generator model (PGen; See et al., 2017). Training details of the supervised baselines can be found in the Appendix A.2. In addition, we compare to a subset of our approach that only uses the generator to produce summaries, rather than the full framework.

7.1 Automatic Evaluation

Following previous work on summarization, we use the ROUGE metric Lin (2004) for automatic evaluation of generative models and Co-opNet. Specifically, we report ROUGE-1, ROUGE-2 and ROUGE-L F1 scores. To capture similarity in contextual meaning, we look at BERTScore F1 Zhang et al. (2020a), which has been shown to more closely correlate with human judgements than other generation metrics.

Results on the AAN, CS and Bio subsets of ArXiv are shown in Table 4. Co-opNet outperforms all baselines on ROUGE-1 and ROUGE-L by a consistent margin. Notably, Co-opNet’s performance is superior to the generator-only model, illustrating the importance of the discriminators for generating more coherent summaries. Interestingly, on the more domain-specific AAN subset, our model is over 12% better on ROUGE-L compared to the PGen baseline and 5.86% better than the best extractive model. Our model also outperforms the strongest baselines on BERTScore.

When we break down results for various Co-opNet architectures (see Table 6), we find that the factuality and discourse role discriminators lead to the best performance in terms of ROUGE scores with the adjacency discriminator achieving lower performance on ROUGE than the base generator. However, as shown by Table 5, the adjacency discriminator outperforms the base generator when we consider BERTScore, a more contextual evaluation metric, indicating that this generator-discriminator combination selects summaries that capture the same linguistic patterns and meaning as reference summaries without directly copying.

7.2 Human Evaluation

Since coherence of generated text is difficult to measure with automatic metrics Kilickaya et al. (2017); Sun et al. (2019); Clark et al. (2019), we conduct human evaluations to assess how the discriminator affects generation quality using pairwise model comparisons.

Setup

We use four key criteria in all evaluations – abstractiveness, coherence, factuality and best overall quality, which we define as follows:

•

Abstractiveness $\rightarrow$ Which abstract rewords information from the introduction instead of directly copying from the introduction?

•

Coherence $\rightarrow$ Which abstract is more structured, and presents a complete and coherent story about the work done in the paper?

•

Factuality $\rightarrow$ Which abstract is more factually consistent, presenting the same information that appears in the introduction and not producing hallucinated information?

•

Overall $\rightarrow$ Which abstract is better overall?

We conduct human evaluations on Amazon Mechanical Turk (AMT) considering 4 different abstractive baseline model variants over 100 randomly sampled AAN test set examples. Given a gold introduction, AMT evaluators are asked to compare a corresponding abstract generated from Co-opNet against an abstract generated by a baseline or our generator model. To reduce bias, the ordering of generated abstracts are randomized and evaluators are not told that abstracts are machine-generated.

Each abstract pair is judged by three unique annotators. For each criteria, we filter to 50 abstracts based on the amount of time AMT workers spent ( $\geq$ 20 seconds) and inter-annotator agreement (at least $\frac{2}{3}$ of annotators should agree on which abstract is best). We also prime annotators to consider subtler aspects of discourse coherence by providing examples that capture good or bad narrative flow without complete text degeneration.

We test the Co-opNet framework using both the factuality and adjacency discriminators, as these are the highest and lowest performing discriminator architectures in terms of automatic metrics on the AAN domain. We allow for ties, as Co-opNet and the generator baseline sometimes assign the highest probability to the same abstract, or generated abstracts in the candidate pool are high quality enough that there is little room for improvement.

Results

We find that Co-opNet is preferred across all criteria for all comparisons, when we use the adjacency discriminator (see Table 7). When using the the factuality discriminator, Co-opNet is superior to baselines in all cases except when compared on abstractiveness to the PGen model.

In particular, human evaluators prefer Co-opNet with the adjacency discriminator over baselines by over 8% on the coherence metric and 18.12% compared to PGen on overall quality. Notably, the adjacency discriminator encourages more abstractiveness in generated abstracts while still maintaining higher levels of factual consistency. We also find that Co-opNet with the factuality discriminator improves coherence and overall quality in addition to factuality. However, Co-opNet generations with the factuality discriminator were found to be more extractive than abstracts generated by PGen.

As shown in Table 8, generations selected by the adjacency discriminator more closely match the distribution of abstracts, while the generator sometimes favors copying from the introduction at the loss of narrative structure. For example, the generator will select a summary that opens with “we present a method for jointly solving penn treebank style empty category (e.g. figure 1)…", while the adjacency discriminator selects a summary that opens with “we present a method to jointly solve the problem of empty categories…" and does not refer to a particular figure. Both summaries are faithful to the introduction, but the discriminator-selected summary makes more sense in the context of a paper abstract.

8 Related Work

Narrative Flow and Factuality

Modeling coherent narrative flow remains a major challenge in the field of text generation, due to the need for accurate understanding of narrative structure Christensen et al. (2013); Nikolov et al. (2018); Holtzman et al. (2018); Qin et al. (2019); Koncel-Kedziorski et al. (2019); Gabriel et al. (2021). Early approaches to incorporating structure include integration of explicit discourse markers into automatic summarization Alonso i Alemany and Fuentes Fort (2003). Recently proposed solutions include global-tracking of entities Kiddon et al. (2016); Bosselut et al. (2018); Mei et al. (2016), as well as discourse-aware attention Cohan et al. (2018). While there has been prior work on factual consistency Cao et al. (2018); Gao et al. (2019); Kryściński et al. (2020); Zhang et al. (2020b), these works did not focus on scientific paper summarization.

Neural Abstractive Summarization

In the past, abstractive summarization models Rush et al. (2015); Gehrmann et al. (2018) have relied upon seq2seq encoder-decoder architectures Sutskever et al. (2014); Narayan et al. (2018); Celikyilmaz et al. (2018). Transformer models have emerged as a promising architecture for text generation and summarization Liu et al. (2018); Hoang et al. (2019); Khandelwal et al. (2019); Zhang et al. (2019). While our model builds upon this work, it is, to our knowledge, the first transformer summarization framework to explicitly model narrative flow and scientific fact-checking across domains.

9 Conclusion

In this work, we introduced Cooperative Generator-Discriminator Networks, a framework for more coherent natural language generation with transformer language models through the integration of discriminators that encourage proper narrative flow and factual consistency. Through our analyses over scientific papers from ArXiv and AAN, we empirically showed that our framework selects generations that are more relevant and narratively coherent than previous approaches.

Acknowledgments

We thank the anonymous reviewers, as well as Lianhui (Karen) Qin, Jungo Kasai, Rik Koncel-Kedziorski, Elizabeth Clark, Dave Wadden and Rowan Zellers for helpful feedback. We also thank the annotators who contributed to the human evaluations in this work. This research was supported in part by NSF (IIS-1524371), DARPA CwC through ARO (W911NF-15-1-0543), and Samsung AI Research.

Appendix A Appendices

A.1 Additional Implementation Details

A.2 Baselines

For the sequence-to-sequence RNN model, a bi-LSTM is used to encode a given source article $a$ and a separate decoder LSTM produces the generated summary $g$ . At each decoding time step, the decoder attends to all the context vectors produced by the encoder as well as the maintained state from the previous decoder tokens to produce the next token in the summary.

The Pointer-Generator (PGEN + Cov) model extends the base LSTM model (LSTM + Cov) to allow tokens to be copied from the input during generation. Baselines are trained for up to 40000 steps with a batch size of 16. Following previous work, we decode from these baselines using beam search with a beam size of 4.

A.3 Generator Model

We use the 345M parameter GPT-2 model. The model is trained to minimize the negative log likelihood of the next word $w_{i}$ given all preceding words:

[TABLE]

where $w_{i}$ is the $i^{th}$ token of our full input vector $X$ , $a$ is our article and $s$ is our summary. At test time, $X$ only consists of the gold article and delimiter token $(a_{1},...,a_{|a|},[\mathrm{SEP}]$ ) and we decode generated summaries $g$ starting from this input.

During generation, we filter candidate summaries from the hypothesis generation pool that contain sentences longer than a fixed max length of 200 tokens, a clear sign of coherence deterioration. We use a candidate pool size of 30 for ATLAS and 20 for AAN.

A.4 Discriminator Training

Factuality Discriminator Details

For the token-level classification model, we use the BERT base model with binary labels for whether or not a token should be included in a salient span. We predict for all spans in an abstract at once.

Order Discriminator Details

We set the max length of summaries considered by the order discriminator to be 10 sentences, truncating longer summaries. Given the max length of a summary, we have a fixed number of orderings |O| that can be scored. We calculate the final score from the order discriminator based on the unnormalized sum of scores from these orderings, S, and the following function $f_{n}$ :

[TABLE]

Sentence Selection for Discriminator Models

To train an adjacency discriminator model, we use a subset of adversarial and positive sentence pair examples extracted from the training set. The sentence pairs are extracted from gold abstracts containing at least five sentences using the following approach: For a randomly selected sentence $s_{u}$ from the abstract, we randomly select an adjacent sentence, $s_{u-1}$ or $s_{u+1}$ , as a positive example and any nonadjacent sentence $s_{v\notin[u-1,u,u+1]}$ as a negative example.

Discriminator Performance

We measure the performance of discriminator models using recall, precision, accuracy and F1. Table 9 provides summary statistics of discriminator performance on the various discourse and factuality objectives. Discourse-Adj denotes the adjacency discriminators, while Discourse-Abs denotes the discourse role label prediction model Cohan et al. (2019) and Factuality denotes the token saliency prediction model.

A.5 Details on Model Performance

Automatic results for Co-opNet selection were given using a context size of 800 tokens for the input, while a context size of 800 characters was used to select Co-opNet summaries for the human eval. The automatic results for the summaries used in the human eval were lower than the ones using the longer context size. Using a smaller context size leads to faster and more efficient Co-opNet selection (less memory usage), but slightly lower overall automatic performance (while maintaining the same ordering in terms of highest and lowest scores for Co-opNet variants on ROUGE).

A.6 Comparison of Datasets

We removed duplicates and articles without abstracts from AAN. From this subset, we extract introduction and abstract pairs.

A.7 Additional Analysis

Comparison with Gold Summaries

To obtain an upper-bound comparison for the human evaluation and verify the effectiveness of our human evaluation pipeline for judging the quality of abstracts, we used the same intro-abstract pairs and Mturk annotation framework as the model comparison to conduct a Turing-style evaluation. In this evaluation, we presented a Co-opNet (adj) generated abstract and a gold abstract to the annotators in a random ordering without noting whether either of the abstracts were human-written or machine-generated. We found that annotators consistently selected the gold abstract over the machine-generated abstract when considering factuality and coherence, though they found the machine-generated abstracts to be slightly more abstractive. We provide the results for this full evaluation in Table 11.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alonso i Alemany and Fuentes Fort (2003) Laura Alonso i Alemany and Maria Fuentes Fort. 2003. Cohesion and coherence for automatic summarization . In Student Research Workshop .
2Allahyari et al. (2017) Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D. Trippe, Juan B. Gutierrez, and Krys Kochut. 2017. Text summarization techniques: A brief survey . International Journal of Advanced Computer Science and Applications , 8(10). · doi ↗
3Beltagy et al. (2019) Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert: Pretrained contextualized embeddings for scientific text. In EMNLP .
4Bosselut et al. (2018) Antoine Bosselut, Asli Çelikyilmaz, Xiaodong He, Jianfeng Gao, Po-Sen Huang, and Yejin Choi. 2018. Discourse-aware neural rewards for coherent text generation. In NAACL-HLT .
5Cao et al. (2018) Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018. Faithful to the original: Fact aware neural abstractive summarization. In AAAI .
6Celikyilmaz et al. (2018) Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In NAACL-HLT .
7Christensen et al. (2013) Janara Christensen, Mausam, Stephen Soderland, and Oren Etzioni. 2013. Towards coherent multi-document summarization . In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1163–1173, Atlanta, Georgia. Association for Computational Linguistics.
8Clark et al. (2019) Elizabeth Clark, Asli Celikyilmaz, and Noah A. Smith. 2019. Sentence mover’s similarity: Automatic evaluation for multi-sentence texts. In ACL .