A Hierarchical Decoder with Three-level Hierarchical Attention to   Generate Abstractive Summaries of Interleaved Texts

Sanjeev Kumar Karn; Francine Chen; Yan-Ying Chen; Ulli Waltinger and; Hinrich Sch\"utze

arXiv:1906.01973·cs.CL·April 10, 2020

A Hierarchical Decoder with Three-level Hierarchical Attention to Generate Abstractive Summaries of Interleaved Texts

Sanjeev Kumar Karn, Francine Chen, Yan-Ying Chen, Ulli Waltinger and, Hinrich Sch\"utze

PDF

Open Access

TL;DR

This paper introduces an end-to-end hierarchical encoder-decoder with a three-level attention mechanism for abstractive summarization of interleaved texts, effectively reducing error propagation and improving fluency.

Contribution

It presents a novel hierarchical attention mechanism and an integrated model that outperforms existing two-step systems in summarizing interleaved texts.

Findings

01

Outperforms state-of-the-art two-step systems by 20-40%.

02

Effectively disentangles threads without explicit separation.

03

Enhances summary fluency and coherence.

Abstract

Interleaved texts, where posts belonging to different threads occur in one sequence, are a common occurrence, e.g., online chat conversations. To quickly obtain an overview of such texts, existing systems first disentangle the posts by threads and then extract summaries from those threads. The major issues with such systems are error propagation and non-fluent summary. To address those, we propose an end-to-end trainable hierarchical encoder-decoder system. We also introduce a novel hierarchical attention mechanism which combines three levels of information from an interleaved text, i.e, posts, phrases and words, and implicitly disentangles the threads. We evaluated the proposed system on multiple interleaved text datasets, and it out-performs a SOTA two-step system by 20-40%.

Tables12

Table 1. Table 1 : The left rows contain interleaving of 3 articles with 2 to 5 sentences and the right rows contain their interleaved titles. Associated sentences and titles are depicted by similar symbols.

✦ to assess the effect of a program of supervised fitness…
✓ this study was conducted to evaluate the influence of e…	✓ caffeine in sport . influence of endurance exercise on the urinary caffeine concentration .
✦ an 8-week randomized , controlled trial .…
✓ nine endurance-trained athletes participated in a randomised…
…	✦ supervised fitness walking in patients with osteoarthritis of the knee . a randomized , controlled trial .
✱ we examined the effects of intensity of training on ratings…
✱ subjects were recruited as sedentary controls or were randomly…
✱ the at lt group trained at velocity lt and the greater than…	✱ the effect of training intensity on ratings of perceived exertion .
✓ data were obtained on 47 of 51 intervention patients and 45…

Table 2. Table 2 : Summarization performance (Rouge Recall-Scores) comparing models when the threads are disentangled (top blue dotted section, upper bounds) and when the threads are entangled (bottom green dashed section, real-world) on the Easy, Medium and Hard Pubmed Corpora. ind = individual, dis = disentangled (ground-truth), kmn = K-means disentangled and ent = entangled. In the middle, the first row shows a seq2seq model trained on ground-truth disentangled texts and tested on unsupervised disentangled texts, and the second row shows a seq2seq model trained and tested on unsupervised disentangled texts. The best performance for the entangled threads and for the disentangled threads are in bold.

Input Text	Model	Easy			Medium			Hard
Input Text	Model	Rouge-1	Rouge-2	Rouge-L	Rouge-1	Rouge-2	Rouge-L	Rouge-1	Rouge-2	Rouge-L
ind	seq2seq	35.09	28.72	13.16	36.31	28.78	13.45	37.74	28.72	13.76
dis	seq2seq	36.38	29.90	14.78	35.63	28.45	13.98	37.87	28.85	14.77
dis	hier2hier	35.30	28.93	13.35	37.30	29.83	14.90	39.09	30.11	15.22
kmn	seq2seq(dis)	34.48	27.51	13.31	34.05	26.58	13.14	35.54	26.36	13.65
kmn	seq2seq	34.28	27.84	13.86	34.89	27.42	13.68	31.22	23.37	11.77
kmn	compress	30.04	19.83	10.75	29.37	17.54	10.43	29.11	15.76	10.13
ent	seq2seq	35.78	28.89	14.62	35.20	27.44	13.54	32.46	24.17	12.17
ent	hier2hier	35.88	28.47	13.33	37.29	29.63	14.95	37.11	27.97	14.26

Table 3. Table 3 : Rouge Recall-Scores on the Medium and Hard Corpora. The base Pubmed has abstract-summary pairs of 10 MeSH types, while base Stack Exchange has posts-question pairs from 12 topics.

Corpus	Model	Pubmed			Stack Exchange
Difficulty	Model	Rouge-1	Rouge-2	Rouge-L	Rouge-1	Rouge-2	Rouge-L
Medium	seq2seq	30.67	11.71	23.80	18.78	03.52	14.73
Medium	hier2hier	32.78	12.36	25.33	24.34	05.07	18.63
Hard	seq2seq	29.07	10.96	21.76	20.21	04.03	14.93
Hard	hier2hier	33.36	12.69	24.72	24.96	05.56	17.95

Table 4. Table 4 : Rouge Recall-Scores of models on the Stack Exchange Medium and Hard Corpus.

	Medium Corpus
Model	Rouge-1	Rouge-2	Rouge-L
seq2seq	19.67	03.88	15.37
hier2hier	23.97	05.63	18.75
	Hard Corpus
seq2seq	19.62	03.71	14.90
hier2hier	24.14	05.00	17.25

Table 5. Table 5 : Rouge Recall-Scores of ablated models (encoder-decoder) on the Pubmed Hard Corpus.

	Pubmed Hard Corpus
Model	Rouge-1	Rouge-2	Rouge-L
seq2seq	29.07	10.96	21.76
seq2hier	32.92	11.87	24.43
hier2seq	31.86	11.9	23.57
hier2hier	33.36	12.69	24.72

Table 6. Table 6 : Rouge Recall-Scores of ablated models (attentions) on the Hard Pubmed Corpus.

Model	Rouge-1	Rouge-2	Rouge-L
hier2hier $(+ 𝜸 + 𝜷)$	33.36	12.69	24.72
hier2hier $(- 𝜸 + 𝜷)$	32.65	12.21	24.23
hier2hier $(+ 𝜸 - 𝜷)$	31.28	10.20	23.49
hier2hier(Li et al.)	29.83	09.80	22.17
hier2hier $(- 𝜸 - 𝜷)$	30.58	10.00	22.96
seq2seq	29.07	10.96	21.76

Table 7. Table 7 : Rouge F1 Scores of models on AMI Corpus with summary size 150.

Model	Rouge-1	Rouge-2	Rouge-L
Shang et al.	29.00	-	-
seq2seq	31.60	10.60	25.03
hier2hier	39.75	12.75	25.41

Table 8. Table 8 : Interleaved sentences of 3 articles, and corresponding ground-truth and hier2hier generated summaries. The top 2 sentences that were attended ( 𝜸 𝜸 \boldsymbol{\gamma} ) for the generation are on the left. Additionally, top words ( 𝜷 𝜷 \boldsymbol{\beta} ) attended for the generation are colored accordingly.

	Interleaved Texts
$0$	this study was conducted to evaluate the influence of excessive sweating during long-distance running on the urinary concentration of caffeine…
$1$	to assess the effect of a program of supervised fitness walking and patient education on functional status , pain , and…
$\dots$	…
$5$	a total of 102 patients with a documented diagnosis of primary osteoarthritis of one or both knees participated…
$6$	we examined the effects of intensity of training on ratings of perceived exertion (…
$\dots$	…
	GroundTruth/Generation
	caffeine in sport . influence of endurance exercise on the urinary caffeine concentration .
0,2	effect of excessive [UNK] during [UNK] running on the urinary concentration of caffeine .
	supervised fitness walking in patients with osteoarthritis of the knee . a randomized , controlled trial .
1,4	effect of a physical fitness walking on functional status , pain , and pain
	the effect of training intensity on ratings of perceived exertion .
6,8	effects of intensity of training on perceived [UNK] in [UNK] athletes .

Table 9. Table 9 : The left rows contain interleaving of 3 articles with 2 to 5 sentences and the right rows contain their interleaved titles. Associated sentences and titles are depicted by similar symbols.

✓ botulinum toxin a is effective for treatment…	✓ prospective randomised controlled trial comparing trigone-sparing versus trigone-including intradetrusor injection of abobotulinumtoxina for refractory idiopathic detrusor overactivity.
✓ the trigone is generally spared because of the theoretical…
✓ evaluate efficacy and safety of trigone-including .…
✦ most methadone-maintained injection drug users …
✱ gender-related differences in the incidence of bleeding…
✱ we studied patients with stemi receiving fibrinolysis…
✦ physicians may be reluctant to treat hcv in idus because …	✦ rationale and design of a randomized controlled trial of directly observed hepatitis c treatment delivered in methadone clinics.
✱ outcomes included moderate or severe bleeding defined …
✦ optimal hcv management approaches for idus remain …
✱ moderate or severe bleeding was 1.9-fold higher …	✱ comparison of incidence of bleeding and mortality of men versus women with st-elevation myocardial infarction treated with fibrinolysis.
✦ we are conducting a randomized controlled trial in a network…
✱ bleeding remained higher in women even after adjustment …

Table 10. Table 10 : The left rows contain interleaving of 4 articles with 2 to 5 sentences and the right rows contain their interleaved titles. Associated sentences and titles are depicted by similar symbols.

✓ the effects of short-course antiretrovirals given to…	✓ hiv-1 persists in breast milk cells despite antiretroviral treatment to prevent mother-to-child transmission.
✱ good adherence is essential for successful antiretroviral…
✓ women in kenya received short-course zidovudine ( zdv )…
✓ breast milk samples were collected two to three times weekly.…
✦ the present primary analysis of antiretroviral therapy with…	✱ patterns of individual and population-level adherence to antiretroviral therapy and risk factors for poor adherence in the first year of the dart trial in uganda and zimbabwe.
✱ this was an observational analysis of an open multicenter…
✦ patients with hiv-1 rna at least 5000 copies/ml were…
✱ at 4-weekly clinic visits , art drugs were provided and …
✦ the primary objective was to demonstrate non-inferiority…
✱ viral load response was assessed in a subset of patients…	✦ efficacy and safety of once-daily darunavir/ritonavir versus lopinavir/ritonavir in treatment-naive hiv-1-infected patients at week 48.
♣ we explored the link between serum alpha-fetoprotein levels…
✱ drug possession ratio ( percentage of drugs taken between…
♣ a low alpha-fetoprotein level ( $<$ 5.0 ng/ml ) was an…
✦ six hundred and eighty-nine patients were randomized…
✦ at 48 weeks , 84 % of drv/r and 78 % of lpv/r patients…	♣ serum alpha-fetoprotein predicts virologic response to hepatitis c treatment in hiv coinfected patients.
✓ hiv-1 dna was quantified by real-time pcr .…
♣ serum alpha-fetoprote in measurement should be integrated …

Table 11. Table 11 : Interleaved sentences of 3 articles, and corresponding ground-truth and hier2hier generated summaries. The top 2 sentences that were attended ( 𝜸 𝜸 \boldsymbol{\gamma} ) for the generation are on the left. Additionally, top words ( 𝜷 𝜷 \boldsymbol{\beta} ) attended for the generation are colored accordingly.

	Interleaved Texts
$0$	botulinum toxin a is effective for treatment of idiopathic detrusor overactivity ( [UNK] )
$1$	the [UNK] is generally [UNK] because of the theoretical risk of [UNK] reflux ( [UNK] ) , although studies assessing…
$\dots$	…
$3$	most [UNK] injection drug users ( idus ) have been infected with hepatitis c virus ( hcv ) , but…
$4$	[UNK] differences in the incidence of bleeding and its relation to subsequent mortality in patients with st-segment elevation myocardial infarction…
$\dots$	…
$8$	optimal hcv management approaches for idus remain unknown .…
$\dots$	…
	GroundTruth/Generation
	prospective randomised controlled trial comparing trigone-sparing versus trigone-including intradetrusor injection of abobotulinumtoxina for refractory idiopathic detrusor overactivity.
0,1	efficacy of [UNK] [UNK] in patients with idiopathic detrusor overactivity : rationale , design
	rationale and design of a randomized controlled trial of directly observed hepatitis c treatment delivered in methadone clinics.
3,4	validation of a point-of-care hepatitis injection drug injection drug , hcv medication , and
	comparison of incidence of bleeding and mortality of men versus women with st-elevation myocardial infarction treated with fibrinolysis .
4,8	subgroup analysis of patients with st-elevation myocardial infarction with st-elevation myocardial infarction .

Table 12. Table 12 : Interleaved sentences of 4 articles, and corresponding ground-truth and hier2hier generated summaries. The top 2 sentences that were attended ( 𝜸 𝜸 \boldsymbol{\gamma} ) for the generation are on the left. Additionally, top words ( 𝜷 𝜷 \boldsymbol{\beta} ) attended for the generation are colored accordingly.

	Interleaved Texts
$0$	the effects of short-course antiretrovirals given to reduce mother-to-child transmission ( [UNK] ) on temporal patterns of [UNK] hiv-1 rna
$1$	good adherence is essential for successful antiretroviral therapy ( art ) provision , but simple measures have rarely been validated…
$2$	women in kenya received short-course zidovudine ( zdv ) , single-dose nevirapine ( [UNK] ) , combination [UNK] or short-course…
$3$	breast milk samples were collected two to three times weekly for 4-6 weeks .…
$4$	the present primary analysis of antiretroviral therapy with [UNK] examined in naive subjects ( [UNK] ) compares the efficacy and…
$\dots$	…
$10$	we explored the link between serum [UNK] levels and virologic response in [UNK] [UNK] c virus coinfected patients .…
$\dots$	…
	GroundTruth/Generation
	hiv-1 persists in breast milk cells despite antiretroviral treatment to prevent mother-to-child transmission .
0,2	impact of hiv-1 persists on hiv-1 rna in human immunodeficiency virus-infected individuals with hiv-1
	patterns of individual and population-level adherence to antiretroviral therapy and risk factors for poor adherence in the first year of the dart trial in uganda and zimbabwe .
1,3	impact of a antiretroviral treatment algorithm on adherence to antiretroviral therapy in [UNK] ,
	efficacy and safety of once-daily darunavir/ritonavir versus lopinavir/ritonavir in treatment-naive hiv-1-infected patients at week 48 .
4,2	a randomized trial of [UNK] versus [UNK] in treatment-naive hiv-1-infected patients with hiv-1 infection
	serum alpha-fetoprotein predicts virologic response to hepatitis c treatment in hiv coinfected patients .
10,12	predicting virologic response in [UNK] coinfected patients coinfected with hiv-1 : a [UNK] randomized

Equations14

p_{k}^{S T O P} = σ (\mbox g (h_{k}^{D_{t 2 t}}))

p_{k}^{S T O P} = σ (\mbox g (h_{k}^{D_{t 2 t}}))

γ_{i}^{k} = σ (\mbox a tt n^{γ} (h_{k - 1}^{D_{t 2 t}}, P_{i}) i \in {1, \dots, n}

γ_{i}^{k} = σ (\mbox a tt n^{γ} (h_{k - 1}^{D_{t 2 t}}, P_{i}) i \in {1, \dots, n}

β_{i, j}^{k}

β_{i, j}^{k}

where a_{i, j} = a dd (W_{i, j}, P_{i}),

i \in {1, \dots, n}, j \in {1, \dots, p}

α_{i, j}^{k, l}

α_{i, j}^{k, l}

where e_{ij}^{k, l} = \mbox a tt n^{α} (h_{k, l - 1}^{D_{w 2 w}}, a_{i, j})

k = 1 \sum m l = 1 \sum q

k = 1 \sum m l = 1 \sum q

+ λ k = 1 \sum m y_{k}^{S T O P} lo g (p_{k}^{S T O P})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques

Full text

A Hierarchical Decoder with Three-level Hierarchical Attention to Generate Abstractive Summaries of Interleaved Texts

Sanjeev Kumar Karn

[email protected]

Machine Intelligence, Siemens CT, Munich, Germany

Francine Chen

{chen,yanying}@fxpal.com

Yan-Ying Chen

{chen,yanying}@fxpal.com

Ulli Waltinger

Machine Intelligence, Siemens CT, Munich, Germany

Hinrich Schütze

[email protected]

Abstract

Interleaved texts, where posts belonging to different threads occur in one sequence, are a common occurrence, e.g., online chat conversations. To quickly obtain an overview of such texts, existing systems first disentangle the posts by threads and then extract summaries from those threads. The major issues with such systems are error propagation and non-fluent summary. To address those, we propose an end-to-end trainable hierarchical encoder-decoder system. We also introduce a novel hierarchical attention mechanism which combines three levels of information from an interleaved text, i.e, posts, phrases and words, and implicitly disentangles the threads. We evaluated the proposed system on multiple interleaved text datasets, and it out-performs a SOTA two-step system by 20-40%.

1 Introduction

Interleaved texts, e.g., multi-author entries for activity reports, and social media conversations, such as Slack are increasingly common. However, getting a quick sense of different threads in interleaved texts is often difficult due to entanglement of threads, i.e, posts belonging to different threads occurring in one sequence; see a hypothetical example in Figure 1.

In conversation disentanglement, interleaved posts are grouped by the thread. However, a reader still has to read all posts in all clustered threads to get the insights. To address this shortcoming, Shang et al. (2018) proposed a system that takes an interleaved text as input and provides the reader with its summaries. Their system is an unsupervised two-step system, first, a conversation disentanglement component clusters the posts thread-wise, and second, a multi-sentence compression component compresses the thread-wise posts to single-sentence summaries. However, this system has two major disadvantages: first, the disentanglement obtained through either supervised Jiang et al. (2018) or unsupervised Wang and Oard (2009) methods propagate its errors to the downstream summarization task, and therefore, degrades the overall performance, and second, the compression component is restricted to formulate summaries out of disentangled threads, and therefore, cannot bring new words to improve the fluency. We aim to tackle these issues through an end-to-end trainable encoder-decoder system that takes a variable length input, e.g., interleaved texts, processes it and generates a variable length output, e.g., a multi-sentence summary. An end-to-end system eliminates the disentanglement component, and thus, the error propagation. Furthermore, the corpus-level vocabulary of the decoder provided it with greater selection of words, and thus, a possibility to improve language fluency.

In the domain of text summarization, hierarchical encoder, encoding words in a sentence (post) followed by the encoding of sentences in a document (channel), is a very commonly used method Nallapati et al. (2016); Hsu et al. (2018). However, hierarchical decoding is rare, as many works in the domain aim to comprehend an important fact from single or multiple documents. Summarizing interleaved texts provides us a unique opportunity to employ hierarchical decoding as such texts comprise several facts from several threads. We also propose novel hierarchical attention, which assists the decoder in its summary generation process with 3-levels of information from the interleaved text; posts, phrases, and words, rather than traditional two levels; post and word Nallapati et al. (2017, 2016); Tan et al. (2017); Cheng and Lapata (2016).

As labeling of interleaved texts is a difficult and expensive task, we devised an algorithm that synthesizes interleaved text-summary pairs corpora of different difficulty levels (in terms of entanglement) from a regular document-summary pairs corpus. Using these corpora, we show the encoder-decoder system not only obviates disentanglement component, but also enhances performance. Further, our hierarchical encoder-decoder system consistently outperforms traditional sequential ones.

To summarize, our contributions are:

•

We propose an end-to-end encoder-decoder system over pipeline to obtain a quick overview of interleaved texts.

•

To the best of our knowledge, we are first to use a hierarchical decoder to obtain multi-sentence abstractive summaries from texts.

•

We propose a novel hierarchical attention that integrates information from 3 levels; posts, phrases and words, and is trained end-to-end.

•

We devise an algorithm that synthesizes interleaved text-summary corpora, on which we verify pipeline system vs. encoder-decoder, sequential vs. hierarchical decoding, 2- vs. 3-level hierarchical attention. Overall, the proposed system attains 20-40% performance gains on both real-world (AMI) and synthetic datasets.

2 Related Work

Ma et al. (2012); Aker et al. (2016); Shang et al. (2018) designed earlier systems that summarize posts in multi-party conversations in order to provide readers with overview on the discussed matters. They broadly follow the same approach: cluster the posts and then extract a summary from each cluster.

There are two kinds of summarization: abstractive and extractive. In abstractive summarization, the model utilizes a corpus level vocabulary and generates novel sentences as the summary, while extractive models extract or rearrange the source words as the summary. Abstractive models based on neural sequence-to-sequence (seq2seq) Rush et al. (2015) proved to generate summaries with higher ROUGE scores than the feature-based abstractive models. Integration of attention into seq2seq Bahdanau et al. (2014) led to further advancement of abstractive summarization Nallapati et al. (2016); Chopra et al. (2016).

Li et al. (2015) proposed an encoder-decoder (auto-encoder) model that utilizes a hierarchy of networks: word-to-word followed by sentence-to-sentence. Their model is better at capturing the underlying structure than a vanilla sequential encoder-decoder model (seq2seq). Krause et al. (2017); Jing et al. (2018) showed multi-sentence captioning of an image through hierarchical Recurrent Neural Network (RNN), topic-to-topic followed by word-to-word, is better than seq2seq.

These works suggest a hierarchical encoder, with word-to-word encoding followed by post-to-post, will better recognize the dispersed information in interleaved texts. Similarly, a hierarchical decoder, thread-to-thread followed by word-to-word, will intrinsically disentangle the posts, and therefore, generate more appropriate summaries.

Nallapati et al. (2016) devised a hierarchical attention mechanism for a seq2seq model, where two levels of attention distributions over the source, i.e., sentence and word, are computed at every step of the word decoding. Based on the sentence attentions, the word attentions are rescaled. Hsu et al. (2018) slightly simplified this mechanism and computed the sentence attention only at the first step. Our hierarchical attention is more intuitive and computes new sentence attentions for every new summary sentence, and unlike Hsu et al. (2018), is trained end-to-end.

3 Model

Problem Statement

We aim to design a system that when given a sequence of posts, $\mathit{C}=\langle\mathit{P}_{1},\ldots,\mathit{P}_{|\mathit{C}|}\rangle$ , produces a sequence of summaries, $\mathit{T}=\langle\mathit{S}_{1},\ldots,\mathit{S}_{|\mathit{T}|}\rangle$ . For simplicity and clarity, unless otherwise noted, we will use lowercase italics for variables, uppercase italics for sequences, lowercase bold for vectors and uppercase bold for matrices.

3.1 Encoder

The hierarchical encoder (see Figure 2 left hand section) is based on Nallapati et al. (2017), where word-to-word and post-to-post encoders are bi-directional LSTMs. The word-to-word BiLSTM encoder ( $E_{w2w}$ ) runs over word embeddings of post $\mathit{P}_{i}$ and generates a set of hidden representations, $\langle\mathbf{h}^{{E_{w2w}}}_{i,0},\ldots,\mathbf{h}^{{E_{w2w}}}_{i,p}\rangle$ , of $d$ dimensions. The average pooled value of the word-to-word representations of post $\mathit{P}_{i}$ ( $\frac{1}{p}\sum_{j=0}^{p}\mathbf{h}^{{E_{w2w}}}_{i,j}$ ) is input to the post-to-post BiLSTM encoder ( $E_{t2t}$ ), which then generates a set of representations, $\langle\mathbf{h}^{E_{p2p}}_{0},\ldots,\mathbf{h}^{E_{p2p}}_{n}\rangle$ , corresponding to the posts. Overall, for a given channel $\mathit{C}$ , output representations of word-to-word, $\mathbf{W}$ , and post-to-post, $\mathbf{P}$ , has $n\times p\times 2d$ and $n\times 2d$ dimensions respectively.

3.2 Decoder

Our hierarchical decoder structure and arrangement is similar to Li et al. (2015) hierarchical auto encoder, with two uni-directional LSTM decoders, thread-to-thread and word-to-word (see right-hand side in Figure 2), however, in terms of inputs, initial states and attentions it differs a lot, which we explain in the next two sections.

The initial state $\mathbf{h}^{D_{t2t}}_{0}$ of the thread-to-thread LSTM decoder ( $f^{D_{t2t}}$ ) is set with a feedforward-mapped representation of an average pooled post representations ( $\mathbf{c}^{\prime}=\frac{1}{n}\sum_{i=0}^{n}\mathbf{h}^{p2p}_{i}$ ). At each step $k$ of the $f^{D_{t2t}}$ , a sequence of attention weights, $\langle\mathit{\hat{\beta}}^{k}_{0,0},\ldots,\mathit{\hat{\beta}}^{k}_{n,p}\rangle$ , corresponding to the set of encoded word representations, $\langle\mathbf{h}^{w2w}_{0,0},\ldots,\mathbf{h}^{w2w}_{n,p}\rangle$ are computed utilizing the previous state, $\mathbf{h}^{D_{t2t}}_{k-1}$ . We will elaborate the attention computation in the next section.

A weighted representation of the words (crossed blue circle) is then computed: $\sum_{i=1}^{n}\sum_{j=1}^{p}\hat{\beta}^{k}_{i,j}\mathbf{W}_{ij}$ , Additionally, we use the last hidden state $\mathbf{h}^{D_{w2w}}_{k-1,q}$ of the word-to-word decoder LSTM ( ${D_{w2w}}$ ) of the previously generated summary sentence as the second input to compute the next state of thread-to-thread decoder, i.e., $\mathbf{h}^{D_{t2t}}_{k}$ . The motivation is to provide information about the previous sentence.

The current state $\mathbf{h}^{D_{t2t}}_{k}$ is passed through a single layer feedforward network and a distribution over STOP=1 and CONTINUE=0 is computed:

[TABLE]

where g is a feedforward network. In Figure 2, the process is depicted by a yellow circle. The thread-to-thread decoder keeps decoding until $\mathit{p}_{k}^{STOP}$ is greater than 0.5.

Additionally, the current state $\mathbf{h}^{D_{t2t}}_{k}$ and inputs to $D_{t2t}$ at that step are passed through a two-layer feedforward network r followed by a dropout layer to compute the thread representation $\mathbf{s}_{k}=\mbox{r}\left({\mathbf{h}^{D_{t2t}}_{k};\mathbf{h}^{D_{w2w}}_{k-1,q};\boldsymbol{\hat{\beta}}^{k}*\mathbf{W}}\right)$ .

Given a thread representation $\mathbf{s}_{k}$ , the word-to-word decoder generates a summary for the thread. Our word-to-word decoder is based on Bahdanau et al. (2014). It is a unidirectional attentional LSTM ( $f^{D_{w2w}}$ ); see the right-hand side of Figure 2. We refer to Bahdanau et al. (2014) for further details.

3.3 Hierarchical Attention

Our novel hierarchical attention works at 3 levels, the post level (corresponding to posts), i.e., $\boldsymbol{\gamma}$ , and phrase level (corresponding to source tokens), i.e., $\boldsymbol{\beta}$ , and are computed while obtaining a thread representation, $\mathbf{s}$ . The word level attention (also corresponding to source tokens), i.e., $\boldsymbol{\alpha}$ , is computed while generating a word, $y$ , of a summary, $\mathit{S}$ .; see Figure 3.

We draw inspiration for the hierarchical attention from some of the recent works in computer vision Noh et al. (2017); Teichmann et al. (2019), in which, they show a convolutional neural network (CNN)-based local descriptor with attention is better at obtaining key points from an image than CNN-based global descriptor. Phrases from posts of interleaved texts are equivalent to visual patterns in images, and thus, extracting phrases is more relevant for thread recognition than extracting posts. Thus, contrary to popular hierarchical attention Nallapati et al. (2016); Cheng and Lapata (2016); Tan et al. (2017), we have additional phrase-level attention focusing again on words, but with a different responsibility. Further, the popularly held intuition of hierarchical attention, i.e., sentence attention scales word attention, is still intact as gamma (post-attention) scales beta.

At step $k$ of thread decoding, we compute elements of post-level attention, i.e., $\boldsymbol{\gamma}^{k,\cdot}$ as.

[TABLE]

, where $\mbox{attn}^{\gamma}$ aligns the current thread decoder state vector $\mathbf{h}^{D_{t2t}}_{i-1}$ to vectors of matrix $\mathbf{P}_{i}$ and then maps aligned vectors to scalar values through a feed-forward network. At the same step, we also compute elements of phrase-level attention, i.e, $\boldsymbol{\beta}^{k}_{i,j}$ as.

[TABLE]

, ${add}$ aligns a post representation to its constituting word representations and does element-wise addition, and $\mbox{attn}^{\beta}$ is a feedforward network that maps the current thread decoder state $\mathbf{h}^{D_{t2t}}_{k-1}$ and vector $\mathbf{a}_{i,j}$ to a scalar value. Importantly, $\sigma(\cdot)$ in $\gamma$ and $\beta$ will allow a thread not to be associated with any relevant phrase, and thereby, indicating a halt in decoding.

Then, we use $\boldsymbol{\gamma}^{k}$ to rescale phrase-level attentions, $\boldsymbol{\beta}^{k}$ as $\hat{\beta}^{k}_{i,j}=\beta^{k}_{i,j}*\gamma^{k}_{i}$ .

At step $l$ of word-to-word decoding of summary thread $k$ , we compute elements of word level attention, i.e., $\boldsymbol{\alpha}^{k,l}_{i,\cdot}$ as below.

[TABLE]

, and $\mathbf{a}_{k}$ is same as in Eq. 3 and $\mbox{attn}^{\alpha}$ is a feedforward network that maps the current word decoder state $\mathbf{h}^{D_{w2w}}_{k,l-1}$ and vector $\mathbf{a}_{i,j}$ to a scalar value.

Finally, we use rescaled phrase-level word attentions, $\hat{\boldsymbol{\beta}^{k}}$ , for rescaling word level attention, $\alpha^{k,l}$ as $\hat{\alpha}^{k,l}_{i,j}=\hat{\beta}^{k}_{i,j}\times\alpha^{k,l}_{ij}$

3.4 Training Objective

We train our hierarchical encoder-decoder network similarly to an attentive seq2seq model Bahdanau et al. (2014), but with an additional weighted sum of sigmoid cross-entropy loss on stopping distribution; see Eq. 1. Given a thread summary, $\mathit{Y}^{k}=\langle\mathit{w}^{k,0},\ldots,\mathit{w}^{k,q}\rangle$ , our word-to-word decoder generates a target $\hat{\mathit{Y}}^{k}=\langle\mathit{y}^{k,0},\ldots,\mathit{y}^{k,q}\rangle$ , with words from a same vocabulary $\mathit{U}$ . We train our model end-to-end by minimizing the objective given in Eq. 5.

[TABLE]

4 Dataset

Obtaining labeled training data for conversation summarization is challenging. The available ones are either extractive Verberne et al. (2018) or too small Barker et al. (2016); Anguera et al. (2012) to train a neural model. To get around this issue and thoroughly verify the proposed architecture, we synthesized a dataset by utilizing a corpus of conventional texts for which summaries are available. We create two corpora of interleaved texts: one from the abstracts and titles of articles from the PubMed corpusand one from the questions and titles of Stack Exchange questions. A random interleaving of sentences from a few PubMed abstracts or Stack Exchange questions roughly resembles interleaved texts, and correspondingly interleaving of titles resembles its multi-sentence summary.

The algorithm that we devised for creating synthetic interleaved texts is defined in detail in the Appendix. The number of abstracts to include in the interleaved texts is given as a range (from $a$ to $b$ ) and the number of sentences per abstract to include is given as a second range (from $m$ to $n$ ). We vary the parameters as below and create three different corpora for experiments: Easy ( $a$ =2, $b$ =2, $m$ =5 and $n$ =5), Medium ( $a$ =2, $b$ =3, $m$ =2 and $n$ =5) and Hard ( $a$ =2, $b$ =5, $m$ =2 and $n$ =5). Table 1 shows an example of a data instance in the Hard Interleaved RCT corpus.

5 Experiments

We report ROUGE-1, ROUGE-2, and ROUGE-L as the quantitative evaluation of the models. The hyper-parameters for experiments are described in detail in the Appendix and remain the same unless otherwise noted.

5.1 Upper-bound

In upper-bound experiments, we check the impact of disentanglement on the abstractive summarization models, e.g., seq2seq and hier2hier. In order to do this, firstly, we provide the ground-truth disentanglement (cluster) information and evaluate the performance of these models. Secondly, we let the models to perform either end-to-end or two-step summarization. In order to perform these experiments, we compiled three corpora of different entanglement difficulty using Pubmed corpus of MeSH type Disease and Chemical111interleaving is performed within a MeSH type. The training, evaluation and test sets are of sizes of 170k, 4k and 4k respectively.

The seq2seq model can use ground-truth disentanglement information in two ways, i.e., summarize threads individually or summarize concatenated threads. The first two rows in Table 2 compares performance of those two sets of experiments. Clearly, seq2seq model can easily detect thread boundary in concatenated threads and perform as good as individual model. However, hier2hier is better than seq2seq in detecting thread boundaries as indicated by its performance gain on Medium and Hard corpora (see row 3 in Table 2), and therefore, sets the upper bound for interleaved text summarization.

Additionally, we also utilize Shang et al. (2018)’s unsupervised disentanglement component and cluster the entangled threads. Importantly, their disentanglement component requires a fixed cluster size as an input; however, our Medium and Hard corpora have a varying cluster size, and therefore, we give their system benefit of the doubt and input the maximum cluster size, i.e., 3 and 5 respectively. We sort the clusters by their association to a sequence of summary, where the association is measured by Rouge-L between them. We then take the seq2seq trained on ground-truth disentanglement and test it on these unsupervised-disentangled texts to understand the strength of unsupervised clustering. The performance of the pre-trained model remains somewhat similar (see row 2 and 4), indicating a strong disentanglement component. We also train and test a seq2seq on unsupervised-disentangled texts; however, its performance lowers slightly (see row 5), which we believe is due to noise inserted by heuristic sorting of clusters.

In real-world scenarios, i.e., without ground-truth disentanglement, Shang et al. (2018)’s unsupervised two-step system performs worse than seq2seq on unsupervised disentanglement (see row 5 and 6), the reason being a seq2seq model trained on a sufficiently large dataset is better at summarization than the unsupervised sentence compression (extractive) method. At the same time, a seq2seq model trained on entangled texts performs similar to a seq2seq trained on unsupervised disentangled texts (see row 5 and 7), and thereby, showing that the disentanglement component is not necessary. Finally, a hier2hier trained on entangled texts is the only model that reaches closest to the upper-bound set by hier2hier on disentangled texts (see row 3 and 8).

6 Seq2seq vs. hier2hier models

Further, we compare the proposed hierarchical approach against the seq-to-seq approach in summarizing the interleaved texts by experimenting on the Medium and Hard corpora obtained from much-varied base document-summary pairs. We interleave Pubmed corpus of 10 MeSH types, e.g., anatomy and organism. Similarly, we interleave Stack Exchange posts-question pairs of 12 different categories with regular vocabularies, e.g., science fiction and travel. As before, the interleaving is performed within a type or category. The training, evaluation and test sets of Pubmed are of sizes 280k, 5k and 5k and Stack Exchange are of sizes 140k, 4k and 4k respectively. Results in Table 3 shows that a noticeable improvement is observed on changing the decoder to hierarchical, i.e., 1.5-3 Rouge points in Pubmed and 2-4.5 points in Stack Exchange.

Additionally, we evaluated models strength in recognizing threads where summaries are ordered by the location of each thread’s greatest density. Here, density refers to smallest window of posts with over 50% of posts belonging to a thread; e.g., post1-thread1, post1-thread2, post-2-thread2, post2-thread1, post3-thread1, post4-thread1 $\rightarrow$ thread2-summary, thread1-summary. In this example, although thread1 occurs early, as the majority of posts on thread1 occurs latter, therefore, its summary also occurs later. So, we create Medium and Hard corpora of the Stack Exchange with summaries sorted by thread density and perform abstractive summarization studies. As seen in Table 4, both the seq2seq and hier2hier models perform similar to the corpora with summaries sorted by thread occurrence (see Table 3), which indicates a strong disentanglement in such abstractive models irrespective of summary arrangement. In addition, the hier2hier model is still consistently better than the seq2seq model.

To understand the impact of hierarchy on the hier2hier model, we perform an ablation study and use the Hard Pubmed corpus for the experiments, and Table 5 shows the results. Clearly, adding hierarchical decoding already provides a boost in the performance. Hierarchical encoding also adds some improvements to the performance; however, the enhancement attained in training and inference speed by the hierarchical encoding is much more valuable (see Figure 1 in Appendix C)222hier2hier takes $\approx$ 1.5 days for training on a Tesla V100 GPU, while seq2seq takes $\approx$ 4.5 days. Thus, hier2hier model not only achieves greater accuracy but also reduces training and inference time.

7 Hierarchical attention

To understand the impact of hierarchical attention on the hier2hier model, we perform an ablation study of post-level attentions ( $\boldsymbol{\gamma}$ ) and phrase-level attentions ( $\boldsymbol{\beta}$ ), using the Pubmed Hard corpus.

Table 6 shows the performance comparison. $\boldsymbol{\gamma}$ attention improves the performance (0.5-1) of hierarchical decoding but not a lot. The phrase-level attention, i.e., $\boldsymbol{\beta}$ is very important as without it the model performance is noticeably reduced (Rouge values decrease from 2-3). The closest hierarchical attentions to ours, i.e., Nallapati et al. (2017, 2016); Tan et al. (2017); Cheng and Lapata (2016) do not use $\boldsymbol{\beta}$ , and therefore, is equivalent to hier2hier $(+\boldsymbol{\gamma}-\boldsymbol{\beta})$ , whose performs worse than hier2hier $(-\boldsymbol{\gamma}+\boldsymbol{\beta})$ and hier2hier $(+\boldsymbol{\gamma}+\boldsymbol{\beta})$ , thus signifying importance of $\beta$ . We also include Li et al. (2015) type post-level attention technique in the comparison, where a softmax $\gamma$ instead of $\sigma(\cdot)$ based $\gamma$ and $\beta$ is used to compute thread representation. Results indicate $\sigma(\cdot)$ fits better in this case. Lastly, removing both the $\boldsymbol{\gamma}$ and $\boldsymbol{\beta})$ makes the hier2hier similar to seq2seq, except a few more parameters, i.e., two additional LSTM, and the performance is also very similar.

8 AMI Experiments

We also experimented both abstractive models; seq2seq and hier2hier, on the popular meeting AMI corpus McCowan et al. (2005), and compare them against Shang et al. (2018) two-step system. We follow the standard train, eval and test split. Results in Table 7 show hier2hier outperforms both systems by a large margin.

9 Discussion

Table 8 shows an output of our hierarchical abstractive system, in which interleaved texts are in the top, and ground-truth and generated summaries in the bottom. Table 8 also shows the top two post indexes attended by the post-level attention ( $\boldsymbol{\gamma}$ ) while generating those summaries, and they coincide with relevant posts. Similarly, the top 10 indexes (words) of the phrase-level attention ( $\boldsymbol{\beta}$ ) is directly visualized in the table through the color coding matching the generation. The system not only manages to disentangle the interleaved texts but also to generate appropriate abstractive summaries. Meanwhile, $\boldsymbol{\beta}$ provides explainability of the output.

The next step in this research is transfer learning of the hierarchical system trained on the synthetic corpus to real-world examples. Further, we aim to modify hier2hier to include some of the recent additions of seq2seq models, e.g., See et al. (2017) pointer mechanism.

10 Conclusion

We presented an end-to-end trainable hierarchical encoder-decoder architecture with novel hierarchical attention which implicitly disentangles interleaved texts and generates abstractive summaries covering the text threads. The architecture addresses the error propagation and fluency issues that occur in the two-step architectures, and thereby, adding performance gains of 20-40% on both real-world and synthetic datasets.

Appendix

Appendix A Interleave Algorithm

In Algorithm. 1, Interleave takes a set of concatenated abstracts and titles, $\textit{C}=\langle\textit{A}_{1};\textit{T}_{1},\ldots,\textit{A}_{|C|};\textit{T}_{|C|}\rangle$ , minimum, $a$ , and maximum, $b$ , number of abstracts to interleave, and minimum, $m$ , and maximum, $n$ , number of sentences in a source, and then returns a set of concatenated interleaved texts and summaries. window takes a sequence of texts, X, and returns a window iterator of size $\frac{|\mathit{X}|-\textit{w}}{t}+1$ , where w and t are window size and sliding step respectively. window reuses elements of X, and therefore, enlarges the corpus size. Notations $\mathcal{U}$ refers to a uniform sampling, $\left[\cdot\right]$ to array indexing, and Reverse to reversing an array.

Appendix B Parameters

For the word-to-word encoder, the steps are limited to 20, while the steps in the word-to-word decoder are limited to 15. The steps in the post-to-post encoder and thread-to-thread decoder depend on the corpus type, e.g., Medium has 15 steps in post-to-post and 3 steps in thread-to-thread. In seq2seq experiments, the source is flattened, and therefore, the number of steps in the source is limited to 300. We initialized all weights, including word embeddings, with a random normal distribution with mean 0 and standard deviation 0.1. The embedding vectors and hidden states of the encoder and decoder in the models are set to dimension 100. Texts are lowercased. The vocabulary size is limited to 8000 and 15000 for Pubmed and Stack Exchange corpora respectively. We pad short sequences with a special token, $\langle PAD\rangle$ . We use Adam (Kingma et al. 2014)with an initial learning rate of .0001 and batch size of 64 for training.

Appendix C Training Loss

Appendix D Examples

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aker et al. (2016) Ahmet Aker, Monica Paramita, Emina Kurtic, Adam Funk, Emma Barker, Mark Hepple, and Rob Gaizauskas. 2016. Automatic label generation for news comment clusters. In Proceedings of the 9th International Natural Language Generation Conference , pages 61--69.
2Anguera et al. (2012) Xavier Anguera, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald Friedland, and Oriol Vinyals. 2012. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing , 20(2):356--370.
3Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate . Co RR , abs/1409.0473.
4Barker et al. (2016) Emma Barker, Monica Lestari Paramita, Ahmet Aker, Emina Kurtic, Mark Hepple, and Robert Gaizauskas. 2016. The sensei annotated corpus: Human summaries of reader comment conversations in on-line news. In Proceedings of the 17th annual meeting of the special interest group on discourse and dialogue , pages 42--52.
5Cheng and Lapata (2016) Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. ar Xiv preprint ar Xiv:1603.07252 .
6Chopra et al. (2016) Sumit Chopra, Michael Auli, and Alexander M. Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks . In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016 , pages 93--98. The Association for Computational Linguistics.
7Hsu et al. (2018) Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, Jing Tang, and Min Sun. 2018. A unified model for extractive and abstractive summarization using inconsistency loss . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 132--141. Association for Computational Linguistics.
8Jiang et al. (2018) Jyun-Yu Jiang, Francine Chen, Yan-Ying Chen, and Wei Wang. 2018. Learning to disentangle interleaved conversational threads with a siamese hierarchical network and similarity ranking . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 1812--1822. Association for Computational Linguistics. · doi ↗