What is this Article about? Extreme Summarization with Topic-aware   Convolutional Neural Networks

Shashi Narayan; Shay B. Cohen; Mirella Lapata

arXiv:1907.08722·cs.CL·July 23, 2019

What is this Article about? Extreme Summarization with Topic-aware Convolutional Neural Networks

Shashi Narayan, Shay B. Cohen, Mirella Lapata

PDF

1 Repo

TL;DR

This paper introduces a new extreme summarization task that generates concise, one-sentence news summaries using a novel CNN-based abstractive model conditioned on article topics, outperforming existing methods.

Contribution

It presents a large-scale BBC dataset for extreme summarization and a novel CNN-based model that effectively captures long-range dependencies for abstractive summarization.

Findings

01

The model outperforms extractive and state-of-the-art abstractive methods.

02

It effectively captures long-range dependencies in documents.

03

Human evaluations favor the proposed approach.

Abstract

We introduce 'extreme summarization', a new single-document summarization task which aims at creating a short, one-sentence news summary answering the question ``What is the article about?''. We argue that extreme summarization, by nature, is not amenable to extractive strategies and requires an abstractive modeling approach. In the hope of driving research on this task further: (a) we collect a real-world, large scale dataset by harvesting online articles from the British Broadcasting Corporation (BBC); and (b) propose a novel abstractive model which is conditioned on the article's topics and based entirely on convolutional neural networks. We demonstrate experimentally that this architecture captures long-range dependencies in a document and recognizes pertinent content, outperforming an oracle extractive system and state-of-the-art abstractive approaches when evaluated automatically…

Tables11

Table 1. Table 1: Comparison of XSum with benchmark summarization datasets: CNN and DailyMail datasets (?), NY Times (?), and Newsroom (?). We present the full Newsroom dataset (Newsroom) and its three subsets: mostly extractive (Newsroom-Ext), mostly abstractive (Newsroom-Abs), and mixed (Newsroom-Mixed). We report corpus size, i.e., the number of documents in, training, validation, and test sets.

Datasets	Corpus Size (# docs)
Datasets	training	validation	test
CNN	90,266	1,220	1,093
DailyMail	196,961	12,148	10,397
NY Times	589,284	32,736	32,739
Newsroom	992,966	108,591	108,650
Newsroom-Mixed	328,634	35,879	36,006
Newsroom-Ext	331,778	36,332	36,122
Newsroom-Abs	332,554	36,380	36,522
XSum	204,045	11,332	11,334

Table 2. Table 2: We compare datasets with respect to average document (source) and summary (target) length (in terms of words and sentences), and vocabulary size on both on source and target. See main text for steps taken to split and pre-process these datasets. For the vocabulary, we lower case tokens.

Datasets	avg. document length		avg. summary length		vocabulary size
Datasets	words	sentences	words	sentences	document	summary
CNN	760.50	33.98	45.70	3.59	343,516	89,051
DailyMail	653.33	29.33	54.65	3.86	563,663	179,966
NY Times	800.04	35.55	45.54	2.44	1,399,358	294,011
Newsroom	770.09	34.73	30.36	1.43	2,646,681	360,290
Newsroom-Mixed	830.58	36.63	23.78	1.17	1,271,435	169,875
Newsroom-Ext	706.06	31.65	45.78	1.88	1,214,748	243,062
Newsroom-Abs	774.17	35.92	21.49	1.25	1,385,205	157,939
XSum	431.07	19.77	23.26	1.00	399,147	81,092

Table 3. Table 3: Proportion of novel n 𝑛 n -grams in gold summaries for CNN, DailyMail, NY Times, Newsroom, and XSum datasets. All results are computed on the test set.

Datasets	% of novel n-grams in gold summary
Datasets	unigrams	bigrams	trigrams	4-grams
CNN	16.75	54.33	72.42	80.37
DailyMail	17.03	53.78	72.14	80.28
NY Times	22.64	55.59	71.93	80.16
Newsroom	18.31	46.80	58.06	62.72
Newsroom-Mixed	13.78	48.37	67.15	77.11
Newsroom-Ext	2.65	7.25	10.25	12.42
Newsroom-Abs	38.25	84.36	96.39	98.28
XSum	35.76	83.45	95.50	98.49

Table 4. Table 4: Performance of extractive baselines on CNN, DailyMail, NY Times, Newsroom, and XSum datasets. We report ROUGE scores for the lead baseline and ext-oracle , the extractive oracle system. All results are computed on the test set.

Datasets	lead			ext-oracle
Datasets	R1	R2	RL	R1	R2	RL
CNN	29.15	11.13	25.95	50.38	28.55	46.58
DailyMail	40.68	18.36	37.25	55.12	30.55	51.24
NY Times	31.85	15.86	23.75	52.08	31.59	46.72
Newsroom	33.04	22.35	30.31	57.09	42.94	53.65
Newsroom-Mixed	27.95	13.87	23.97	51.98	34.04	46.96
Newsroom-Ext	55.87	50.60	54.76	89.63	87.20	89.32
Newsroom-Abs	15.44	2.72	12.32	29.85	7.82	24.86
XSum	16.30	1.61	11.95	29.79	8.81	22.65

Table 5. Table 5: Example topics learned by an LDA model on XSum and Newsroom documents (training portion).

XSum documents
T1:	murder, charge, court, police, arrest, guilty, sentence, boy, bail, space, crown, trial
T2:	abuse, church, bishop, child, catholic, gay, pope, school, christian, priest, cardinal
T3:	council, people, government, local, housing, home, house, property, city, plan, authority
T4:	party, clinton, trump, climate, poll, vote, plaid, election, debate, change, candidate, campaign
T5:	country, growth, report, business, export, fall, bank, security, economy, rise, global, inflation
T6:	hospital, patient, trust, nhs, people, care, health, service, staff, report, review, system, child
Newsroom Abstractive documents
T1:	fund, investment, firm, asset, capital, financial, corporate, management, return, profit, equity
T2:	building, design, build, square, space, office, architect, center, architecture, interior, project
T3:	award, parade, beverly, actress, annual, hills, star, red, hollywood, carpet, premiere, pose
T4:	company, business, customer, industry, consumer, service, product, revenue, fortune, startup
T5:	military, force, afghanistan, government, security, troops, war, country, taliban, attack, army
T6:	party, government, minister, leader, political, prime, election, vote, power, country, parliament

Table 6. Table 6: ROUGE results on XSum test set. We report ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) F 1 scores. Extractive systems are in the upper block, RNN-based abstractive models are in the middle block, and convolutional systems are in the bottom block.

Models	R1	R2	RL
Random	15.16	1.78	11.27
lead	16.30	1.60	11.95
ext-oracle	29.79	8.81	22.66
Seq2Seq	28.42	8.77	22.48
PtGen	29.70	9.21	23.24
PtGen+Covg	28.10	8.02	21.72
ConvS2S	31.27	11.07	25.23
ConvS2S+Copy	29.80	10.10	24.10
T-ConvS2S (enc $_{t^{'}}$ )	31.71	11.38	25.56
T-ConvS2S (enc $_{(t^{'}, t_{D})}$ )	31.61	11.30	25.51
T-ConvS2S (enc $_{t^{'}}$ , dec $_{t_{D}}$ )	31.71	11.34	25.61
T-ConvS2S (enc $_{(t^{'}, t_{D})}$ , dec $_{t_{D}}$ )	31.89	11.54	25.75

Table 7. Table 7: Proportion of novel n 𝑛 n -grams in summaries generated by various models on the XSum test set.

Models	% of novel n-grams in generated summaries
Models	unigrams	bigrams	trigrams	4-grams
Seq2Seq	36.66	82.17	95.58	98.63
PtGen	27.40	73.33	90.43	96.04
PtGen+Covg	25.71	70.76	88.87	95.24
ConvS2S	31.28	79.50	94.28	98.10
ConvS2S+Copy	32.30	79.56	94.46	98.21
T-ConvS2S	30.73	79.18	94.10	98.03
gold	35.76	83.45	95.50	98.49

Table 8. Table 8: System ranking according to human judgments and QA-based evaluation for the XSum dataset.

Models	Score	QA
ext-oracle	-0.121	15.70
PtGen	-0.218	21.40
ConvS2S	-0.130	30.90
T-ConvS2S	0.037	46.05
gold	0.431	97.23

Table 9. Table 9: Results on NewsRoom-Abs test set. We report ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) F 1 scores. Extractive systems are in the upper block, RNN-based abstractive systems are in the second block, and our convolutional abstractive systems are in the third block.

Models	R1	R2	RL
random	13.02	1.51	10.46
lead (?)	13.76	2.42	11.29
lead	15.44	2.72	12.32
ext-oracle	29.85	7.82	24.86
PtGen (?)	14.71	2.27	11.48
Seq2Seq	15.23	4.21	12.88
PtGen	17.61	5.15	14.73
PtGen+Covg	16.13	4.33	13.47
ConvS2S	16.77	5.57	14.54
ConvS2S+Copy	16.31	5.34	14.24
T-ConvS2S	16.97	5.56	14.70

Table 10. Table 10: System ranking according to human judgments and QA-based evaluation for the Newsroom-Abs dataset.

Models	Score	QA
ext-oracle	0.473	41.31
PtGen	-0.047	20.98
ConvS2S	-0.397	11.97
T-ConvS2S	-0.160	23.28
gold	0.130	90.98

Table 11. Table 11: XSum and Newsroom-Abs summaries and their informativeness. The middle column proportionately shows the number times a summary was judged “Informative”, “Partially Informative”, or “Uninformative.” The last column shows the informativeness score for each dataset (higher is better).

Dataset	Informative	Part. Informative	Uninformative	Score
XSum	68.00	26.00	6.00	2.62
Newsroom-Abs	48.67	33.33	18.00	2.30

Equations16

e_{i} = [(x_{i} + p_{i}); (t_{i}^{'} \otimes t_{D})] \in R^{f + f^{'}},

e_{i} = [(x_{i} + p_{i}); (t_{i}^{'} \otimes t_{D})] \in R^{f + f^{'}},

g_{i} = [(x_{i}^{'} + p_{i}^{'}); t_{D}] \in R^{f + f^{'}},

g_{i} = [(x_{i}^{'} + p_{i}^{'}); t_{D}] \in R^{f + f^{'}},

a_{ij}^{ℓ} = \frac{\mbox e x p ( d _{i}^{ℓ} \cdot z _{j}^{u} )}{\sum _{t = 1}^{m} \mbox e x p ( d _{i}^{ℓ} \cdot z _{t}^{u} )},

a_{ij}^{ℓ} = \frac{\mbox e x p ( d _{i}^{ℓ} \cdot z _{j}^{u} )}{\sum _{t = 1}^{m} \mbox e x p ( d _{i}^{ℓ} \cdot z _{t}^{u} )},

c_{i}^{ℓ} = j = 1 \sum m a_{ij}^{ℓ} (z_{j}^{u} + e_{j}) .

c_{i}^{ℓ} = j = 1 \sum m a_{ij}^{ℓ} (z_{j}^{u} + e_{j}) .

p (y_{i + 1} ∣ y_{1}, \dots, y_{i}, D, t_{D}, t^{'}) = \mbox so f t ma x (W_{o} h_{i}^{L} + b_{o}) \in R^{T}

p (y_{i + 1} ∣ y_{1}, \dots, y_{i}, D, t_{D}, t^{'}) = \mbox so f t ma x (W_{o} h_{i}^{L} + b_{o}) \in R^{T}

L (θ) = i = 0 \sum n - 1 \mbox l o g p (y_{i + 1}^{*} ∣ y_{1}^{*}, \dots, y_{i}^{*}, D, t_{D}, t^{'}, θ)

L (θ) = i = 0 \sum n - 1 \mbox l o g p (y_{i + 1}^{*} ∣ y_{1}^{*}, \dots, y_{i}^{*}, D, t_{D}, t^{'}, θ)

p_{g e n} = σ (w_{h} h_{i}^{L} + w_{c} c_{i}^{L} + w_{g} g_{i} + b_{g e n})

p_{g e n} = σ (w_{h} h_{i}^{L} + w_{c} c_{i}^{L} + w_{g} g_{i} + b_{g e n})

p^{'} (w) = p_{g e n} p (w)^{α} + (1 - p_{g e n}) j : w_{j} = w \sum (a_{ij}^{L})^{β},

p^{'} (w) = p_{g e n} p (w)^{α} + (1 - p_{g e n}) j : w_{j} = w \sum (a_{ij}^{L})^{β},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shashiongithub/XSum
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

What is this Article about? Extreme Summarization with Topic-aware Convolutional Neural Networks

\nameShashi Narayan \[email protected]

\addrGoogle Research \AND\nameShay B. Cohen \[email protected]

\nameMirella Lapata \[email protected]

\addrInstitute for Language, Cognition and Computation

School of Informatics, University of Edinburgh The work was primarily done while Shashi was still at School of Informatics, University of Edinburgh.

Abstract

We introduce extreme summarization, a new single-document summarization task which aims at creating a short, one-sentence news summary answering the question “What is the article about?”. We argue that extreme summarization, by nature, is not amenable to extractive strategies and requires an abstractive modeling approach. In the hope of driving research on this task further: (a) we collect a real-world, large scale dataset by harvesting online articles from the British Broadcasting Corporation (BBC); and (b) propose a novel abstractive model which is conditioned on the article’s topics and based entirely on convolutional neural networks. We demonstrate experimentally that this architecture captures long-range dependencies in a document and recognizes pertinent content, outperforming an oracle extractive system and state-of-the-art abstractive approaches when evaluated automatically and by humans on the extreme summarization dataset.111Our dataset, code, and demo are available at: https://github.com/shashiongithub/XSum.

1 Introduction

Automatic summarization is one of the central problems in Natural Language Processing (NLP) posing several challenges relating to understanding (i.e., identifying important content) and generation (i.e., aggregating and rewording the identified content into a summary). Of the many summarization paradigms that have been identified over the years (see ? and ? for comprehensive overviews) single-document summarization has consistently garnered attention.

Modern approaches to single document summarization are data-driven, taking advantage of the success of neural network architectures and their ability to learn continuous features without recourse to preprocessing tools or linguistic annotations (?, ?, ?, ?, ?, ?, ?, ?). The application of neural networks to the summarization task has motivated the development of large-scale datasets containing hundreds of thousands of (news) document-summary pairs (?, ?, ?). However, these datasets often favor extractive models which create a summary by identifying (and subsequently concatenating) the most important sentences in a document (?, ?, ?). Abstractive approaches, despite being more faithful to the actual summarization task — professional editors employ various rewrite operations to transform article sentences into a summary including compression, aggregation, and paraphrasing (?) aside from writing sentences from scratch — they either lag behind extractive ones or are mostly extractive, exhibiting a small degree of abstraction (?, ?, ?, ?, ?).

In this paper we introduce extreme summarization, a new single-document summarization task which is not amenable to extractive strategies and requires an abstractive modeling approach. The idea is to create a short, one-sentence news summary answering the question “What is this article about?”. Figure 1 shows an example of a document and its extreme summary. As can be seen, the summary is very different from a headline whose aim is to encourage readers to read the story; it draws on information interspersed in various parts of the document (not only the beginning) and displays multiple levels of abstraction including paraphrasing, fusion, synthesis, and inference.

To drive research on abstractive summarization forward, we build a dataset for the proposed task by harvesting online articles from the British Broadcasting Corporation (BBC) that often include a first-sentence summary. We further propose a novel deep learning model which is well-suited to extreme summarization. Unlike most recent abstractive approaches (?, ?, ?, ?, ?, ?, ?, ?) which rely on an encoder-decoder architecture modeled by recurrent neural networks (RNNs), we present a topic-conditioned neural model which is based entirely on convolutional neural networks (?). Convolution layers capture long-range dependencies between words in the document more effectively compared to RNNs, allowing to perform document-level inference, abstraction, and paraphrasing. Our convolutional encoder associates each word with a topic vector capturing whether it is representative of the document’s content, while our convolutional decoder conditions each word prediction on a document topic vector capturing whether it is in the theme of the document.

Experimental evaluation on the extreme summarization task shows that our topic-aware convolutional model outperforms an oracle extractive system (in terms of ROUGE) as well as state-of-the-art RNN-based abstractive systems, a vanilla convolutional model (?) and a convolutional model augmented with the pointer-generator mechanism (?). We also conduct two human evaluations in order to assess (a) which type of summary participants prefer and (b) how much key information from the document is preserved in the summary. Both evaluations overwhelmingly show that human subjects find our summaries more informative and complete. To further illustrate that the proposed model is generally applicable, we evaluate its performance on the Newsroom Abstractive dataset (?). Our experiments set a new state of the art and highlight interesting differences between our extreme summarization dataset and the Newsroom dataset.

Our contributions in this work are three-fold: we propose a new single-document summarization dataset which encourages the development of abstractive systems; we demonstrate through analysis and empirical results that extractive approaches are not well-suited to the extreme summarization task; and propose a novel topic-aware convolutional sequence-to-sequence model for abstractive summarization. In the remainder, we present an overview of related work (Section 2) and the describe our extreme summarization dataset in more detail (Section 3). Section 4 presents our model while Section 6 discusses our results.

2 Related Work

Summarization Datasets

The summarization of news articles has enjoyed wide popularity in natural language processing due to its potential for various information access applications which allow readers to spot emerging trends, person mentions, the evolution of storylines, and so on. The news domain has been the main focus of several Document Understanding (DUC) and Text Analysis conferences (TAC) leading to the creation of various high-quality summarization datasets (?, ?). More recently, the training requirements of neural systems have led to the compilation of larger datasets based on New York Times (?), the Gigaword corpus (?), the CNN and DailyMail news outlets (?), or a combination of several major news publications (?). There has been some interest in summarizing texts from other domains, such as longer scientific articles (?, ?, ?), Wikipedia articles (?), live sport text commentary scripts (?), movie reviews (?, ?) or online discussion forums and blogs (?, ?, ?). In this paper, we focus on generating extreme (single line abstractive) summaries for BBC news articles.

The nature and quality of reference summaries vary for different datasets. DUC datasets contain multi-reference summaries that are manually written especially to evaluate summarization systems. Due to the effort and cost involved in creating multiple reference summaries, DUC datasets are rather small (few hundreds of articles) and fall short of training neural summarization systems. Gigaword summaries are short headlines (?). Systems trained on New York Times and CNN/DailyMail learn to generate multi-line abstracts or highlights, however these summaries are mostly extractive and systems trained on them unavoidably learn to perform mainly copying operations even when capable of performing abstraction (?, ?). Newsroom (?) summaries are manual descriptions of news articles writen by authors and editors in newsrooms of 38 major news publications. Coming from a variety of sources, these summaries exhibit different degrees of abstraction, they are not visible to readers but are often used to index the article (?). In contrast, our summaries are read together with the article, they are the first sentence readers see (often highlighted in boldface) prior to digesting the full article. The Newsroom dataset is fairly large containing 1.3 million articles and their summaries and goes some way towards addressing the concerns relating to biases towards extractive strategies in earlier datasets. We discuss the differences between Newsroom and our dataset in more detail in Section 3 and also present experimental results with our model in Section 6.2.

Summarization Approaches

Approaches to document summarization fall under two major paradigms: extractive systems select sentences from the document and assemble them together to generate a summary, while abstractive systems create a summary from scratch, possibly generating new words or phrases which are not in the document.

A great deal of previous work has focused on extractive summarization which is usually modeled as a sentence ranking or binary classification problem (i.e., sentences which are top ranked or predicted as True are selected as summaries). Early attempts mostly leverage human-engineered features including sentence position and length (?), keywords and the presence of proper nouns (?, ?, ?), information based on frequency (?) or events (?). These methods often learn to score each sentence independently (?, ?, ?, ?, ?, ?, ?, ?), however summary quality can be improved heuristically (?, ?), via max-margin methods (?, ?), or integer-linear programming (?, ?, ?, ?, ?, ?).

Modern extractive summarization models (?, ?, ?, ?) are data-driven and learn continuous features using neural network architectures without any linguistic preprocessing or reliance on expert feature design. The majority of them conceptualize extractive summarization as a sequence labeling task in which each label specifies whether each document sentence should be included in the summary (?, ?, ?, ?, ?, ?, ?, ?). These models often rely on recurrent neural networks to derive a meaning representation of the document which is then used to label each sentence, taking the previously labeled sentences into account.

There has also been a surge of interest in neural network models for abstractive summarization which is viewed as a sequence-to-sequence problem (?, ?, ?). Central in most approaches (?, ?, ?, ?, ?) is an encoder-decoder architecture modeled by recurrent neural networks. The encoder reads the source sequence into a list of continuous-space representations from which the decoder generates the target sequence. ? (?) refine this sequence-to-sequence architecture with a copy mechanism (?, ?) which allows to reuse sequences from the source document and with a coverage mechanism (?) which allows to keep track of what has been summarized, discouraging repetition. A few extractive (?, ?, ?) and abstractive (?, ?, ?, ?, ?, ?) approaches obtain performance improvements by combining the maximum-likelihood cross-entropy loss with rewards from policy gradient reinforcement learning (?) to directly optimize the evaluation metric relevant for the summarization task.

Our topic-conditioned convolutional model differs from earlier approaches both in application and formulation. Unlike abstractive models based on recurrent neural networks, we adopt a fully convolutional endcoder-decoder architecture (?). Convolution layers capture long range dependencies between words in the document more effectively compared to RNNs, allowing to perform document-level inference, abstraction, and paraphrasing. Our convolutional encoder associates each word with a topic vector capturing whether it is representative of the document’s content, while our convolutional decoder conditions each word prediction on a document topic vector. Convolutional alternatives to sequence modeling have been proposed for machine translation (?), headline generation (?), and story generation (?), however we are not aware of any prior work targetting summarization. The Transformer architecture (?) presents an alternative to convolutions, also aiming at eliminating the fundamental constraint of sequential computation, and has been successfully applied to sentence and document summarization (?, ?, ?, ?).222Experiments with Transformer architectures are outside the scope of this paper. Recent work (?) on multiodocument summarization shows that Transformer-based models perform on par with their convolutional alternatives.

Our convolutional model uses topic vectors to foreground salient words in the document. The idea is inspired from traditional summarization methods for content selection (?, ?, ?, ?), however, our topics are not manually crafted, they are automatically learned using an LDA model (?). Several recent summarization models have explored architectures dedicated to content selection; ? (?) extract a set of keywords from the document to guide the summarization process. ? (?) and ? (?) use dedicated gates to filter the representation of the source document; while others modulate the attention based on how likely it is for a word or a sentence to be included in a summary (?, ?, ?) or use reinforcement learning to optimize content selection objectives (?, ?). Document-level semantic information (as expressed via latent topics) has been previously integrated with recurrent neural networks (?, ?, ?), however, we are not aware of any existing convolutional models.

3 The XSum Dataset

In this section we present, XSum, our extreme summarization dataset which consists of BBC articles and accompanying single sentence summaries. We describe how XSum was obtained, provide comparisons with popular summarization benchmarks, and analyze how it differs from them both quantitatively and qualitatively.

3.1 Data Collection

Each BBC article is prefaced with an introductory sentence (aka summary) which is professionally written, typically by the author of the article. The summary bears the HTML class “story-body__introduction,” and can be easily identified and extracted from the main text body (see Figure 1 for an example summary-article pair).

To create a large-scale dataset for extreme summarization, we followed the methodology proposed in ? (?). Specifically, we collected 226,711 Wayback archived BBC articles ranging over almost a decade (2010 to 2017) and covering a wide variety of domains (e.g., News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment, and Arts). Each article comes with a unique identifier in its URL, which we used to randomly split the dataset into training (90%, 204,045), validation (5%, 11,332), and test (5%, 11,334) set.

Tables 1 and 2 compare XSum with the CNN, DailyMail, NY Times, and Newsroom benchmarks. For CNN and DailyMail, we used the original splits of ? (?) and followed ? (?) to preprocess them. For NY Times (?), we used the splits and pre-processing steps of ? (?). For the Newsroom dataset, we used the splits and pre-processing steps of ? (?). We present comparisons with the full Newsroom dataset (Newsroom) and its three subsets: mostly extractive (Newsroom-Ext), mostly abstractive (Newsroom-Abs), and mixed (Newsroom-Mixed). As can be seen in Table 1, XSum contains a substantial number of training instances, similar to DailyMail; documents and summaries in XSum are shorter in relation to most datasets (see Table 2) but the vocabulary size is sufficiently large, comparable to CNN.

3.2 How Abstractive is XSum?

To support the claim that XSum summaries are fairly abstractive and as a result systems trained on them could not resort to extractive strategies, we record the percentage of novel $n$ -grams in the gold summaries that do not appear in their source documents. As shown in Table 3, there are 36% novel unigrams in the XSum reference summaries compared to 17% in CNN, 17% in DailyMail, 23% in NY Times, and 18% in Newsroom. This indicates that XSum summaries are more abstractive. The proportion of novel constructions grows for larger $n$ -grams across datasets, however, it is much steeper in XSum whose summaries exhibit approximately 83% novel bigrams, 96% novel trigrams, and 98% novel 4-grams (comparison datasets display around 47–55% new bigrams, 58–72% new trigrams, and 63–80% novel 4-grams).

We further evaluate two extractive methods, lead and ext-oracle, on these datasets. lead is often used as a strong lower bound for news summarization (?) and creates a summary by selecting the first few sentences or words in the document. We extracted the first 3 sentences for CNN documents and the first 4 sentences for DailyMail (?). Following previous work (?, ?), we obtained lead summaries based on the first 100 words for NY Times documents. For Newsroom, we extracted the first 2 sentences to form the lead summaries. For XSum, we selected the first sentence in the document (excluding the one-line summary) to generate the lead. Our second method, ext-oracle, can be viewed as an upper bound for extractive models (?, ?). It creates an oracle summary by selecting the best possible set of sentences in the document that gives the highest ROUGE (?) with respect to the gold summary. For XSum, we simply selected the single-best sentence in the document as summary.

Table 4 reports the performance of the two extractive methods using ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) with the gold summaries as reference. The lead baseline performs extremely well on CNN, DailyMail, NY Times and Newsroom confirming that they contain fairly extractive summaries. ext-oracle further shows that improved sentence selection would bring further performance gains to extractive approaches. Abstractive systems trained on these datasets often have a hard time beating the lead, let alone ext-oracle, or display a low degree of novelty in their summaries (?, ?, ?, ?, ?, ?). Interestingly, lead and ext-oracle perform poorly on XSum underlying the fact that it contains genuinely abstractive summaries.

? (?) also find that CNN / Daily Mail and New York Times are skewed towards extractive summaries (albeit following different analysis metrics). The abstractive subset of their Newsroom dataset (Newsroom-Abs) demonstrates similar patterns to XSum in terms of the percentage of novel $n$ -grams in the gold summary and the performance of extractive methods (lead and ext-oracle). However, XSum differs from Newsroom in two key respects. Firstly, Newsroom is a fairly diverse dataset, it contains documents and summaries from multiple news outlets representing a large range of summarization styles from highly abstractive to highly extractive, while XSum is not; it covers a single news outlet (i.e., BBC) and a uniform summarization style (i.e., a single sentence). Another difference comes from the way the reference summaries are extracted in these two datasets. Newsroom summaries are extracted using the HTML meta-tag “description,” and constitute descriptions of the document’s content which are often used for indexing but are not shown to the readers. In comparison, XSum summaries are aimed at the reader and meant to be read together with the article. Newsroom summaries are often indicative – they provide merely an indication of the subject matter of the document without giving away detail on its content. In contrast, XSum summaries are more informative, they contain pertinent information necessary to convey the gist of the document. We further explore these differences in our experimental evaluation (see Section 6.3).

4 Topic-Aware Convolutional Model for Summarization

Unlike tasks like machine translation and paraphrase generation where there is often a one-to-one semantic correspondence between source and target words, document summarization must distill the content of a document into a few important facts. This is even more challenging for our task, where the compression ratio is extremely high, and pertinent content can be easily missed.

Our model builds on the work of ? (?) who develop an encoder-decoder architecture with an attention mechanism (?) based exclusively on deep convolutional networks. Their convolutional alternative to sequence modeling has shown promise for machine translation (?, ?) and story generation (?). We believe that convolutional architectures are attractive for extreme summarization for at least two reasons. Firstly, contrary to recurrent networks which view the input as a chain structure, convolutional networks can be stacked to represent large context sizes. Secondly, hierarchical features can be extracted over larger and larger contents, allowing to represent long-range dependencies efficiently through shorter paths.

We adapt this model to our task by allowing it to recognize pertinent content (i.e., by foregrounding salient words in the document). In particular, we improve the convolutional encoder by associating each word with a vector representing topic salience, and the convolutional decoder by conditioning each word prediction on the document topic vector. Our model aims to generate informative summaries that are grounded in the input document and its content.

4.1 Model Overview

At the core of our model is a simple convolutional block structure that computes intermediate states based on a fixed number of input elements. Our convolutional encoder (shown at the top of Figure 2) applies this unit across the document. We repeat these operations in a stacked fashion to get a multi-layer hierarchical representation over the input document where words at closer distances interact at lower layers while distant words interact at higher layers. The interaction between words through hierarchical layers effectively captures long-range dependencies.

Analogously, our convolutional decoder (shown at the bottom of Figure 2) uses the multi-layer convolutional structure to build a hierarchical representation over what has been predicted so far. Each layer on the decoder side determines useful source context by attending to the encoder representation before it passes its output to the next layer. This way the model remembers which words it previously attended to and applies multi-hop attention (shown in the middle of Figure 2) per time step. The output of the top layer is passed to a softmax classifier to predict a distribution over the target vocabulary.

Our model assumes access to word and document topic distributions. These can be obtained by any topic model, however we use Latent Dirichlet Allocation (LDA; ? (?)) in our experiments; we pass the distributions obtained from LDA directly to the network as additional input. This allows us to take advantage of topic modeling without interfering with the computational advantages of the convolutional architecture.

4.2 Topic Sensitive Embeddings

Let $D$ denote a document consisting of a sequence of words $(w_{1},\ldots,w_{m})$ ; we embed $D$ into a distributional space $\mathbf{x}=(x_{1},\ldots,x_{m})$ where $x_{i}\in\mathbb{R}^{f}$ is a column in embedding matrix $M\in\mathbb{R}^{V\times f}$ (where $V$ is the vocabulary size). We also embed the absolute word positions in the document $\mathbf{p}=(p_{1},\ldots,p_{m})$ where $p_{i}\in\mathbb{R}^{f}$ is a column in position matrix $P\in\mathbb{R}^{N\times f}$ , and $N$ is the maximum number of positions; $p_{i}$ is the position embedding of word $w_{i}$ at position $i$ in the input sequence. Position embeddings have proved useful for convolutional sequence modeling (?), because, in contrast to RNNs, they do not observe the temporal positions of words (?). Let $t_{D}\in\mathbb{R}^{f^{\prime}}$ be the topic distribution of document $D$ and $\mathbf{t^{\prime}}=(t^{\prime}_{1},\ldots,t^{\prime}_{m})$ the topic distributions of words in the document (where $t^{\prime}_{i}\in\mathbb{R}^{f^{\prime}}$ ). During encoding, we represent document $D$ via $\mathbf{e}=(e_{1},\ldots,e_{m})$ , where $e_{i}$ is:

[TABLE]

and $\otimes$ denotes point-wise multiplication. The topic distribution $t^{\prime}_{i}$ of word $w_{i}$ essentially captures how topical the word is in itself (local context), whereas the topic distribution $t_{D}$ represents the overall theme of the document (global context). The encoder essentially enriches the context of the word with its topical relevance to the document.

For every output prediction, the decoder estimates representation $\mathbf{g}=(g_{1},\ldots,g_{n})$ for previously predicted words $(w^{\prime}_{1},\ldots,w^{\prime}_{n})$ where $g_{i}$ is:

[TABLE]

$x^{\prime}_{i}$ and $p^{\prime}_{i}$ are word and position embeddings of previously predicted word $w^{\prime}_{i}$ , and $t_{D}$ is the topic distribution of the input document. Note that the decoder does not use the topic distribution of $w^{\prime}_{i}$ as computing it on the fly would be expensive. However, every word prediction is conditioned on the topic of the document, enforcing the summary to have the same theme as the document.

4.3 Multi-layer Convolutional Structure

Each convolution block, parametrized by $W\in\mathbb{R}^{2d\times kd}$ and $b_{w}\in\mathbb{R}^{2d}$ , takes as input $X\in\mathbb{R}^{k\times d}$ which is the concatenation of $k$ adjacent elements embedded in a $d$ dimensional space, applies one dimensional convolution and returns an output element $Y\in\mathbb{R}^{2d}$ . We apply Gated Linear Units (GLU, $v:\mathbb{R}^{2d}\rightarrow\mathbb{R}^{d}$ , ?) on the output of convolution $Y$ . Subsequent layers operate over the $k$ output elements of the previous layer and are connected through residual connections (?) to allow for deeper hierarchical representation. We denote the output of the $\ell$ th layer as $\mathbf{h^{\ell}}=(h^{\ell}_{1},\ldots,h^{\ell}_{n})$ for the decoder network, and $\mathbf{z^{\ell}}=(z^{\ell}_{1},\ldots,z^{\ell}_{m})$ for the encoder network.

4.4 Multi-hop Attention

Our encoder and decoder are tied via a multi-hop attention mechanism. For each decoder layer $\ell$ , we compute the attention $a^{\ell}_{ij}$ of state $i$ and source element $j$ as:

[TABLE]

where $d^{\ell}_{i}=W^{\ell}_{d}h^{\ell}_{i}+b^{\ell}_{i}+g_{i}$ is the decoder state summary combining the current decoder state $h^{\ell}_{i}$ and the previous output element embedding $g_{i}$ . Vector $\mathbf{z^{u}}$ is the output from the last encoder layer $u$ . The conditional input $c^{\ell}_{i}$ to the current decoder layer is a weighted sum of the encoder outputs as well as the input element embeddings $e_{j}$ :

[TABLE]

The attention mechanism described here performs multiple attention “hops” per time step and considers which words have been previously attended to. It is therefore different from single-step attention in recurrent neural networks (?), where the attention and weighted sum are computed over $\mathbf{z^{u}}$ only.

Our network uses multiple linear layers to project between the embedding size $(f+f^{\prime})$ and the convolution output size $2d$ . These are applied to $\mathbf{e}$ (before feeding it to the encoder), to the final encoder output $\mathbf{z^{u}}$ , to all decoder layers $\mathbf{h^{\ell}}$ (for the attention score computation), and to the final decoder output $\mathbf{h^{L}}$ (before the softmax). We pad the input with $k-1$ zero vectors on both left and right sides to ensure that the output of the convolutional layers matches the input length. During decoding, we ensure that the decoder does not have access to future information; we start with $k$ zero vectors and shift the convolutional block to the right after every prediction. The final decoder output $\mathbf{h^{L}}$ is used to compute the distribution over the target vocabulary as:

[TABLE]

where, $W_{o}$ and $b_{o}$ are the parameters of the softmax layer and $T$ is the size of the target vocabulary. We also use layer normalization and weight initialization to stabilize learning. We use cross-entropy loss to maximize the likelihood of the ground-truth sequence $(y^{\ast}_{1},\ldots,y^{\ast}_{n})$ :

[TABLE]

Our topic-enhanced model calibrates long-range dependencies with globally salient content. As a result, it provides a better alternative to vanilla convolutional sequence models (?) and RNN-based summarization models (?) for capturing cross-document inferences and paraphrasing. At the same time it retains the computational advantages of convolutional models. Each convolution block operates over a fixed-size window of the input sequence, allowing for simultaneous encoding of the input and ease in learning due to the fixed number of non-linearities and transformations for words in the input sequence.

5 Experimental Setup

In this section we present our experimental setup for assessing the performance of our Topic-aware Convolutional Sequence to Sequence model which we abbreviate to T-ConvS2S. We evaluate our model on our newly collected XSum dataset and show that it is suitable for extreme summarization. We also report experiments on the abstractive subset of the Newsroom dataset (Newsroom-Abs; ?). In the following, we discuss implementation details, present the systems used for comparison with our approach, and explain how system output was evaluated.

5.1 Comparison Systems

We report results with various systems which were all trained on the XSum dataset to generate a one-line summary given an input news article. We compared T-ConvS2S against three extractive systems: a baseline which randomly selects a sentence from the input document (random), a baseline which simply selects the leading sentence from the document (lead), and an oracle which selects a single-best sentence in each document (ext-oracle). The latter is often used as an upper bound for extractive methods. We also compared our model against the RNN-based abstractive systems introduced in ? (?).333 State-of-the-art abstractive systems on the CNN/Daily mail and New York Times datasets (?, ?, ?) use reinforcement learning to directly optimize the evaluation metric relevant for the summarization task. Although our model could be optimized with reinforcement learning objectives, we leave this to future work and present comparisons with related models which are all trained with the maximum-likelihood objective. In particular, we experimented with an attention-based sequence-to-sequence model (Seq2Seq), a pointer-generator model which allows us to copy words from the source text (PtGen), and a pointer-generator model with a coverage mechanism to keep track of words that have been summarized (PtGen+Covg). Finally, we compared our model against two convolutional abstractive systems: the vanilla convolution sequence-to-sequence model of ? (?) (ConvS2S) and a variant thereof augmented a copy mechanism (ConvS2S+Copy) which copies words from the input document via pointing (?, ?) while retaining the ability to produce novel words from a fixed vocabulary.444ConvS2S+Copy estimates the generation probability $p_{gen}\in[0,1]$ at each decoding step $i$ as:

$p_{gen}=\sigma(w_{h}h_{i}^{L}+w_{c}c_{i}^{L}+w_{g}g_{i}+b_{gen})$

using the final decoder state $h_{i}^{L}$ , the final context vector $c_{i}^{L}$ and the decoder input $g_{i}$ . $L$ is the final layer of the decoder. $w_{h}$ , $w_{c}$ , $w_{g}$ and $b_{gen}$ are model parameters and $\sigma$ is the non-linear sigmoid function. We estimate the final probability distribution $p^{\prime}(w)\in R^{T^{\prime}}$ over the extended vocabulary $T^{\prime}$ denoting the union of the target vocabulary $T$ and the words of the input document as

$p^{\prime}(w)=p_{gen}{p(w)}^{\alpha}+(1-p_{gen})\sum_{j:w_{j}=w}(a_{ij}^{L})^{\beta},$

where $p(w)$ is the target vocabulary distribution (estimated using Equation (5) and $a_{ij}^{L}$ is the attention for the final decoder layer $L$ (estimated using Equation (3)). $\alpha$ and $\beta$ are scaling parameters (used to stabilize convolutional learning) estimated as $\frac{\log(|T^{\prime}|)}{\log(|T|)}$ and $\frac{\log(|T^{\prime}|)}{\log(\mathrm{EncLen})}$ , respectively (where $\mathrm{EncLen}$ is the encoder length). ConvS2S+Copy uses $p_{gen}$ to switch between generating a novel word from a fixed vocabulary by sampling from $p(w)$ or copying a word from the source text by sampling from the attention distribution $a_{ij}^{L}$ .

For our experiments on Newsroom-Abs, we again compared T-ConvS2S against the extractive systems random, lead and ext-oracle, the recurrent abstractive systems Seq2Seq, PtGen, and PtGen+Covg, and the convolutional systems ConvS2S and ConvS2S+Copy. All systems were retrained on the Newsroom-Abs training set. lead selects the first 2 sentences to form the summary while random selects random 2 sentences from the input document to form the summary. ext-oracle creates an oracle summary by selecting the best possible set of sentences in the document that gives the highest ROUGE (?) with respect to the gold summary.

5.2 Model Parameters and Optimization

We did not anonymize entities but worked with a lowercased version of the XSum and Newsroom-Abs datasets. During training and at test time input documents were truncated to 400 tokens and the length of the summary was limited to 90 tokens.

We trained two separate LDA models (?) on XSum and Newsroom documents (training portion). We therefore obtained for each word a probability distribution over topics which we used to estimate $\mathbf{t^{\prime}}$ ; the topic distribution $t_{D}$ can be inferred for any new document, at training and test time. We explored several LDA configurations on held-out data, and obtained best results with 512 topics for XSum and 256 topics for Newsroom. LDA models were trained with $\alpha$ 555 $\alpha$ controls the prior distribution over topics for individual documents. set to a fixed normalized asymmetric prior of $1/\mbox{number of topics}$ ; we let the model learn an asymmetric prior $\eta$ 666 $\eta$ controls the prior distribution over words for individual topics. from the data. Table 5 shows some of the topics learned by the LDA models.777We used a multi-core implementation of LDA made available by gensim at https://radimrehurek.com/gensim/models/ldamulticore.html.

For all RNN-based models888We used the code available at https://github.com/abisee/pointer-generator. (Seq2Seq, PtGen, and PtGen+Covg) we used the best settings reported on the CNN and DailyMail data (?) All models had 256 dimensional hidden states and 128 dimensional word embeddings. They were trained using Adagrad (?) with learning rate set to 0.15 and an initial accumulator value of 0.1. We used gradient clipping with a maximum gradient norm of 2, without any regularization and the loss on the validation set to implement early stopping. All models trained on the XSum and Newsroom datasets have the same settings.

For all convolutional models999We used the code available at https://github.com/facebookresearch/fairseq-py. (ConvS2S, ConvS2S+Copy, and T-ConvS2S) we used 512 dimensional hidden states, word embeddings and position embeddings for XSum and 256 dimensional hidden states, word embeddings and position embeddings for Newsroom. All models were trained with Nesterov’s accelerated gradient method (?) using a momentum value of 0.99 and renormalized gradients if their norm exceeded 0.1 (?). We used a learning rate of 0.10 for ConvS2S and T-ConvS2S, and 0.02 for ConvS2S+Copy.101010ConvS2S+Copy failed to converge with learning rate greater than 0.02. Once the validation perplexity stopped improving, we reduced the learning rate by an order of magnitude after each epoch until it fell below $10^{-4}$ . We also applied a dropout of 0.2 to the embeddings, the decoder outputs and the input of the convolutional blocks. Gradients were normalized by the number of non-padding tokens per mini-batch. We also used layer normalization and weight normalization for all layers except for lookup tables to stabilize learning.

All neural models, including ours and those based on RNNs (?), had a vocabulary of 50,000 words and were trained on a single Nvidia M40 GPU with a batch size of 32 sentences. Summaries at test time were obtained using beam search (with beam size 10) in all cases.

5.3 Evaluation

We evaluated summarization quality automatically using F1 ROUGE (?). Unigram and bigram overlap (ROUGE-1 and ROUGE-2) are a proxy for assessing informativeness and the longest common subsequence (ROUGE-L) represents fluency.111111We used pyrouge to compute all ROUGE scores, with parameters “-a -c 95 -m -n 4 -w 1.2.” In addition to ROUGE which can be misleading when used as the only means to assess the informativeness of summaries (?, ?), we also evaluated system output by eliciting human judgments in two ways.

In our first experiment, participants were asked to compare summaries produced by different systems. The study was conducted on the Amazon Mechanical Turk platform using Best-Worst Scaling (BWS; ?; ?), a less labor-intensive alternative to paired comparisons that has been shown to produce more reliable results than rating scales (?). Participants were presented with a document and summaries generated from three systems and were asked to decide which summary was the best and which one was the worst in order of informativeness (does the summary capture important information in the document?) and fluency (is the summary written in well-formed English?). In two separate studies, we randomly selected 50 documents from the XSum and Newsroom-Abs test set. We compared all possible system pairs for each document and collected judgments from three different participants for each comparison. The order of summaries was randomized per document and the order of documents per participant. The score of a system was computed as the percentage of times it was chosen as best minus the percentage of times it was selected as worst. The scores range from -1 (worst) to 1 (best). Figures 3 and 4 show example summaries from the XSum and Newsroom datasets used for this study.

For our second experiment we used a question-answering (QA) paradigm (?, ?) to assess the degree to which the models retain key information from the document. We wrote fact-based questions for each document, just by reading the reference summary, under the assumption that it highlights the most important content of the news article. Questions were formulated so as not to reveal answers to subsequent questions. Participants read the output summaries and answered the questions as best they could without access to the document or the gold summary. The more questions can be answered, the better the corresponding system is at summarizing the document as a whole. Five participants answered questions for each summary. Answers again were elicited using Amazon’s Mechanical Turk crowdsourcing platform. We uploaded the data in batches (one system at a time) to ensure that the same participant does not evaluate summaries from different systems on the same set of questions. We followed the scoring mechanism introduced in ? (?). A correct answer was marked with a score of one, partially correct answers with a score of 0.5, and zero otherwise. The final score for a system is the average of all its question scores.

We used the same 100 documents (50 documents for XSum and 50 documents for Newsroom) as in our first elicitation study. For XSum, we created 100 questions in total; we wrote two fact-based questions per document. For Newsroom summaries, we were often not able to write more than one fact-based questions per document. Consequently, we only have 61 questions in total. Figures 3 and 4 show example summaries and their corresponding questions for XSum and Newsroom, respectively.

6 Results

In this section we present results for our model and comparison systems on the XSum dataset; we also discuss experiments on Newsroom (?) and analyze quantitative and qualitative differences between the two datasets.

6.1 Results on the XSum Dataset

Automatic Evaluation

Table 6 summarizes our ROUGE-based results. As can be seen, Seq2Seq outperforms the lead and random baselines by a large margin. PtGen, a Seq2Seq model with a “copying” mechanism outperforms ext-oracle, a “perfect” extractive system on ROUGE-2 and ROUGE-L. This is in sharp contrast to the performance of these models on the CNN/DailyMail (?) and Newsroom datasets (?), where they fail to outperform the lead. The result provides further evidence that XSum is a good testbed for abstractive summarization. PtGen+Covg, the best performing abstractive system on the CNN/DailyMail datasets, does not do well. We believe that the coverage mechanism is more useful when generating multi-line summaries and is basically redundant for extreme summarization.

ConvS2S, the convolutional variant of Seq2Seq, significantly outperforms all RNN-based abstractive systems.121212Statistical significance at the 95% confidence level is estimated using bootstrap resampling (?) with the official ROUGE script. We hypothesize that its superior performance stems from the ability to better represent document content (i.e., by capturing long-range dependencies). Surprisingly, ConvS2S+Copy, a ConvS2S enhanced with a “copying” mechanism obtains performance inferior to ConvS2S. Our analysis revealed that the multi-hop attention mechanism of ConvS2S is very effective in resolving the unknown (UNK) words, by simply copying the most attended word $w_{j}$ (estimated using the average attention scores as $\mathrm{argmax}_{w_{j}}\sum_{l=1}^{L}a_{ij}^{L}/L)$ ) from the source text to replace an UNK word. The copy mechanism unnecessarily over-parametrizes the ConvS2S model leading to a drop in performance. For example, ConvS2S correctly resolves two subsequent UNK words to “Dick Advocaat” whereas ConvS2S+Copy incorrectly resolves them to “Dick Dick” as shown at the bottom of Figure 3. The Seq2Seq model without the copy mechanism is prone to generating random rare words (e.g., “Andre Mccormack” for the same example in Figure 3) or unresolved UNK words (see the top example in Figure 4).131313None of the models of ? (?) resolves UNK by simply copying the most attended word from the source text. PtGen and PtGen+Covg rely on the copy mechanism to sample words from the extended target vocabulary, including source words., while PtGen guides the model towards sampling words from the source text.

Table 6 also shows several variants of T-ConvS2S including an encoder network enriched with information about how topical a word is on its own (enc ${}_{t^{\prime}}$ ) or in the document (enc ${}_{(t^{\prime},t_{D})}$ ). We also experimented with various decoders by conditioning every prediction on the topic of the document, basically encouraging the summary to be in the same theme as the document (dec ${}_{t_{D}}$ ) or letting the decoder decide the theme of the summary. Interestingly, all four T-ConvS2S variants outperform ConvS2S and ConvS2S+Copy. T-ConvS2S performs best when both encoder and decoder are constrained by the document topic (enc ${}_{(t^{\prime},t_{D})}$ ,dec ${}_{t_{D}}$ ). In the remainder of the paper, we refer to this variant as T-ConvS2S.

How Abstractive are the Generated Summaries?

We further assessed the extent to which various models are able to perform rewriting by generating genuinely abstractive summaries. Table 7 shows the proportion of novel $n$ -grams for abstractive systems based on RNNs (Seq2Seq, PtGen and PtGen+Covg) and our convolutional models (ConvS2S, ConvS2S+Copy and T-ConvS2S). We omit extractive systems (lead and ext-oracle) as they are not capable of generating summaries from scratch with novel $n$ -grams.

Overall, we observe that all abstractive models generate a fair amount of novel constructions that go beyond what is said in the source document. This result further supports our claim that XSum is an appropriate testbed for abstractive summarization. The three convolutional models show comparable proportions of novel $n$ -grams, while RNN-based models show greater variance with Seq2Seq generating the highest proportion of novel $n$ -grams. PtGen+Covg performs the least rewriting, followed by PtGen. Interestingly, PtGen trained on XSum only copies 4% of 4-grams from the source document, 10% of trigrams, 27% of bigrams, and 73% of unigrams. This is in sharp contrast to PtGen trained on CNN/DailyMail which copies more than 85% of 4-grams in the source document, 90% of trigrams, 95% of bigrams, and 99% of unigrams (?). We should point out that the summaries being evaluated have on average comparable lengths: summaries generated by Seq2Seq, PtGen, and PtGen+Covg contain 23.02, 22.86, and 22.52 words, respectively; those generated by ConvS2S, ConvS2S+Copy, and T-ConvS2S have 20.07, 19.82 and 20.22 words, respectively, while gold summaries are the longest with 23.26 words.

Human Evaluation

Recall that system generated summaries were evaluated in two studies one aimed at eliciting judgments of summary quality and the other following a question-answering paradigm. In both studies, participants were asked to evaluate summaries produced from the ext-oracle baseline, PtGen, the best performing RNN-based system according to ROUGE (see Table 6), ConvS2S, our topic-aware model T-ConvS2S, and the human-authored gold summary (gold). We did not include summaries from the lead or ConvS2S+Copy as they were significantly inferior to other models. Table 8 presents our results.

Perhaps unsurprisingly human-authored summaries were considered best, whereas, T-ConvS2S was ranked 2nd followed by ext-oracle and ConvS2S. PtGen was ranked worst with the lowest score of $-0.218$ . We carried out pairwise comparisons between all models to assess whether system differences are statistically significant. gold is significantly different from all other systems and T-ConvS2S is significantly different from ConvS2S and PtGen (using a one-way ANOVA with posthoc Tukey HSD tests; $p<0.01$ ). All other differences are not statistically significant.

The rightmost column in Table 8 shows the results of the QA evaluation. Based on the summaries generated by T-ConvS2S, participants can answer $46.05\%$ of the questions correctly. Summaries generated by ConvS2S, PtGen, and ext-oracle provide answers to $30.90\%$ , $21.40\%$ , and $15.70\%$ of the questions, respectively. Pairwise differences between all systems are statistically significant ( $p<0.01$ ) with the exception of PtGen and ext-oracle. ext-oracle performs poorly on both QA and rating evaluations. The examples in Figure 3 indicate that ext-oracle is often misled by selecting a sentence with the highest ROUGE (against the gold summary), but ROUGE itself does not ensure that the summary retains the most important information from the document. The QA evaluation further emphasizes that in order for the summary to be felicitous, information needs to be embedded in the appropriate context. For example, ConvS2S and PtGen will fail to answer the question “Who has resigned?” (see Figure 3 second block) despite containing the correct answer “Dick Advocaat” due to the wrong context. T-ConvS2S is able to extract important entities from the document with the right theme.

6.2 Results on the Newsroom-Abs Dataset

We next examine whether our approach extends to other datasets with similar characteristics. Specifically, we examine the performance of the proposed model and related models on the abstractive portion of the Newsroom dataset (Newsroom-Abs; ?).

Automatic Evaluation

Table 9 summarizes the results of our ROUGE-based evaluation. In addition to earlier discussed systems (see Section 5.1), we also include ?’s (?) versions of lead and PtGen. Our lead selects the first 2 sentences to form the summary compared to the lead reported in ? (?) which selects the first 3 sentences. Our lead is more in line with the average number of sentences observed in the reference summaries which is 1.25 (see Table 2), and as result obtains better performance. We also found that PtGen (?) was trained on the whole Newsroom dataset. To make a fair comparison, we report results with models trained on the NewsRoom-Abs portion of the dataset only. The discrepancy in the results between our PtGen and the PtGen model reported in ? (?) can be explained by the usage of different training sets.

Our convolutional models, ConvS2S, ConvS2S+Copy, and T-ConvS2S, significantly outperform the lead baselines.141414Again, we use the pyrouge script to estimate statistical significance. ConvS2S significantly outperforms Seq2Seq, its RNN counterpart but lags behind PtGen. T-ConvS2S performs competitively against PtGen on R1 and RL scores and better on R2 (5.56 vs 5.15). The superior performance of T-ConvS2S over ConvS2S confirms our hypothesis that T-ConvS2S enhanced with topic information is better at identifying pertinent content and generating informative summaries. The worse performance of ConvS2S+Copy against ConvS2S further supports our claim that the multi-hop attention mechanism already in place in ConvS2S is very effective at resolving UNKs simply by copying the most attended words from the source. However, this is not case with the RNN-based models (?). As previously discussed, Seq2Seq is prone to generating UNKs (see the top block in Figure 4), while PtGen corrects for this with the copy mechanism. The summaries generated by Seq2Seq have a total of 11,418 UNK words, whil PtGen only generates 4,467 UNK words on the Newsroom-Abs test set.151515Surprisingly, none of the RNN-based abstractive models generates UNK words on XSum. We believe this is due to the smaller vocabulary size (81,092 for XSum vs. 157,939 for Newsroom-Abs; Table 2).

Interestingly, all abstractive summaries fall behind the extractive oracle (ext-oracle) on this dataset. In contrast, most abstractive models (PtGen, ConvS2S, ConvS2S+Copy and T-ConvS2S) were able to outperform ext-oracle on Xsum. This suggests that Newsroom-Abs still has some bias towards extractive methods and improved sentence selection would bring further performance gains for extractive approaches on this dataset. Models trained on XSum are better at generating good quality abstracts, e.g., T-ConvS2S achieves ROUGE scores of 31.89/11.54/25.75 compared to 16.97/5.56/14.70 on Newsroom-Abs. There are two reasons for this. Firstly, Newsroom has a great variety of summarization styles due to its collection from multiple news outlets; it is hard for abstractive methods to effectively model this. Secondly, XSum reference summaries are more representative of document content, whereas, Newsroom-Abs summaries are more indicative (see Section 6.3 for examples of reference summaries from XSum and Newsroom). It is probably harder for abstractive models to generate indicative summaries describing the source text rather directly presenting its content.

Human Evaluation

For both evaluation protocols, participants were asked to assess summaries produced from the ext-oracle baseline, PtGen, ConvS2S, our topic-aware model T-ConvS2S, and the human-authored gold summary (gold). We did not include summaries from the lead, or ConvS2S+Copy as they were significantly inferior to other models. Table 10 presents our results.

To our surprise, ext-oracle summaries were considered best, whereas the human-authored summaries were ranked 2nd followed by PtGen and T-ConvS2S. ConvS2S was ranked worst with the lowest score of $-0.397$ . In line with our findings in Section 6.3, participants found ext-oracle summaries to be more informative than human-authored summaries which are often indicative in nature. We carried out pairwise comparisons between all models to assess whether system differences are statistically significant. The difference between T-ConvS2S and PtGen is not statistically significant (using a one-way ANOVA with posthoc Tukey HSD tests; $p<0.01$ ), while all other differences are.

The rightmost column in Table 10 shows the results of the QA evaluation. Based on the oracle extracts, participants can answer $41.31\%$ of the questions correctly. Summaries generated by T-ConvS2S, PtGen, and ConvS2S provide answers to $23.28\%$ , $20.98\%$ , and $11.97\%$ of the questions, respectively. ext-oracle performs best on both QA and judgment elicitation evaluations. However, we should point out that ext-oracle has the advantage of selecting the best set of sentences (as determined by ROUGE) without any length constraints. Consequently, summaries generated by ext-oracle tend to be longer with 38.84 words on average (see Figure 4 second block). Summaries generated by PtGen, ConvS2S, and T-ConvS2S contain 23.72, 19.41 and 20.66 words, respectively, while gold summaries contain 23.26 words. Perhaps unsurprisingly T-ConvS2S, which was slightly lagging behind PtGen on R1 and RL scores, performs better than both PtGen and ConvS2S in terms of correctly answering questions. The QA evaluation shows that T-ConvS2S is able to generate informative summaries with pertinent information embedded in the appropriate context. Pairwise differences between systems are all statistically significant ( $p<0.01$ ) with the exception of T-ConvS2S and PtGen, and, PtGen and ConvS2S.

6.3 Informative vs. Indicative Summaries

Our experiments have so far revealed differences in the nature of XSum and Newsroom summaries. XSum summaries are often informative of the document content while Newsroom summaries are indicative, i.e., they describe the source text rather than directly presenting the information it contains. In this section, we provide further empirical support for this claim.

We conducted a human evaluation where participants were asked to rate reference summaries from both datasets. Specifically, they were presented with a document and its gold summary and asked to decide whether it was informative (i.e., it relayed pertinent content from the document), partially informative, or uninformative. The study was conducted on Amazon Mechanical Turk with the same 100 test documents used of our judgment elicitation and QA studies on XSum and Newsroom-Abs. We collected judgments from three different participants for each document. The order of documents and systems were randomized.

Table 11 present the results of this study. We measure the informativeness of each dataset as the average score assigned by crowdworkers across summaries; a summary receives a score of 3 if it is deemed informative, 2 if it is partially informative, and 1 if it is uninformative. Therefore, the informativeness score for a dataset as a whole may vary from 1 to 3, with 1 being least informative and 3 being most informative. We also report the proportion of times a dataset was considered informative, partially informative, and uninformative. XSum reference summaries were mostly considered informative (68%), around a quarter of them (26%) were deemed partially informative, and only 6% were found uninformative. In comparison, less than half (48.67%) of Newsroom-Abs summaries were found informative, the remaining being either partially informative or uninformative. XSum achieved an informativeness score of 2.62 compared to 2.30 for the Newsroom dataset. We carried out pairwise comparisons to assess whether informativeness differences are statistically significant. We found that XSum is significantly more informative than Newsroom (using a one-way ANOVA with posthoc Tukey HSD tests; $p<0.01$ ). Figure 5 shows the 15 best summaries from each dataset ranked from least to most informative.

Figure 6 shows the percentage of $n$ -grams in test summaries which have been already seen in training summaries. For both datasets, as the size of $n$ -grams increases, their chance of having been seen in the training summaries decreases rapidly. For XSum, this percentage drops to almost 0% for any $n$ -grams larger than size 10. Interestingly, this is not the case for Newsroom-Abs, where more than 4% of $n$ -grams (sizes 10 to 15) in test summaries have been already seen in training summaries. This result suggests that Newsroom-Abs summaries are somewhat formulaic displaying a certain degree of repetition that goes beyond simple phrases. As an example consider the summary from Figure 5 “Collection of all usatoday.com coverage of the Exorcist, including articles, videos, photos, and quotes.” which is rather generic and would apply to any movie, not just the Exorcist. In fact two crowdworkers labeled this summary as uninformative and one as partially informative (informativeness score is 1.33). Figure 7 shows the type-token ratio of different $n$ -grams in gold summaries as a measure of how often constructions are being reused; a higher type-token ratio represents larger variation in terms of $n$ -grams. XSum summaries exhibit more variation for $n$ -grams of size larger than 5 corraborating our claim that Newsroom-Abs summaries are more repetitive.

7 Conclusions

In this paper we introduced the task of “extreme summarization” together with a large-scale dataset which pushes the boundaries of abstractive methods. We further proposed a novel “topic-aware” fully-convolutional deep learning model which is well-suited to extreme summarization. And designed a question-answering paradigm to assess the degree to which abstractive models retain key information from the document. Experimental evaluation revealed that models which have abstractive capabilities do better on this task and that high-level document knowledge in terms of topics and long-range dependencies is critical for recognizing pertinent content and generating informative summaries. Finally, experimental results support our claim that extreme summarization is a good testbed for abstractive summarization; the task, as operationalized via our dataset, encourages models to create informative summaries which promote novel constructions and are less skewed towards extractive mechanisms.

Extreme summarization revisits interesting problems in abstractive summarization with the relatively simpler objective of generating single sentence summaries rather than multi-line summaries (?, ?, ?). Models trained for extreme summarization require document-level inference, abstraction, and paraphrasing to generate summaries which are informative and consistent with the input document. Throughout this paper we have argued that our model is better suited for this task than recurrent abstractive models due to its ability to foreground pertinent content using topic vectors and model long-range dependencies using a multi-layer convolutional architecture. In the future, we would like to create more linguistically-aware encoders and decoders incorporating co-reference and entity linking. It would also be interesting to use contextualised word representations (?, ?, ?) to enhance modeling of long-range dependencies within our model.

Beyond generating single sentences, we would like to adapt our method to create multi-sentence summaries. For instance, this would allow us to assess whether our model’s ability to capture long-range dependencies translates to more readable and coherent summaries. Finally, our method might be relevant to summarizing texts from other domains, e.g., generating multi-line abstracts of scientific articles (?, ?), creating Wikipedia pages in a multi-document summarization setting (?), and aggregating product or movie reviews (?, ?).

Acknowledgments

We thank the reviewers for their enthusiastic feedback. We gratefully acknowledge the support of the European Research Council (Lapata; award number 681760), the European Union under the Horizon 2020 SUMMA project (Narayan, Cohen; grant agreement 688139) and Bloomberg (Cohen).

Bibliography110

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Almeida and Martins Almeida, M. B., and Martins, A. F. T. (2013). Fast and robust compressive summarization with dual decomposition and multi-task learning. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics , pp. 196–206, Sofia, Bulgaria.
2Angelidis and Lapata Angelidis, S., and Lapata, M. (2018). Summarizing opinions: Aspect extraction meets sentiment prediction and they are both weakly supervised. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pp. 3675–3686.
3Bahdanau, Cho, and Bengio Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations , San Diego, California, USA.
4Barzilay and Elhadad Barzilay, R., and Elhadad, M. (1997). Using lexical chains for text summarization. In Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization , pp. 10–17, Madrid, Spain.
5Barzilay, Elhadad, and Mc Keown Barzilay, R., Elhadad, N., and Mc Keown, K. R. (2002). Inferring strategies for sentence ordering in multidocument news summarization. Journal of Artificial Intelligence Research , 17 (1), 35–55.
6Berg-Kirkpatrick, Gillick, and Klein Berg-Kirkpatrick, T., Gillick, D., and Klein, D. (2011). Jointly learning to extract and compress. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , pp. 481–490, Portland, Oregon, USA.
7Blei, Ng, and Jordan Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research , 3 , 993–1022.
8Carbonell and Goldstein Carbonell, J., and Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , pp. 335–336, Melbourne, Australia.