Abstractive Text Summarization using Attentive GRU based Encoder-Decoder

Tohida Rehman; Suchandan Das; Debarshi Kumar Sanyal; Samiran; Chattopadhyay

arXiv:2302.13117·cs.CL·February 28, 2023

Abstractive Text Summarization using Attentive GRU based Encoder-Decoder

Tohida Rehman, Suchandan Das, Debarshi Kumar Sanyal, Samiran, Chattopadhyay

PDF

TL;DR

This paper presents an abstractive text summarization model using an attentive GRU encoder-decoder architecture, which effectively handles long sequences and outperforms existing models on news datasets.

Contribution

Introduces a novel attentive GRU-based encoder-decoder model for abstractive summarization with improved handling of long input sequences.

Findings

01

Outperforms existing models on news summarization datasets

02

Handles long sequences effectively with attention mechanism

03

Generates summaries comparable to newspaper headlines

Abstract

In todays era huge volume of information exists everywhere. Therefore, it is very crucial to evaluate that information and extract useful, and often summarized, information out of it so that it may be used for relevant purposes. This extraction can be achieved through a crucial technique of artificial intelligence, namely, machine learning. Indeed automatic text summarization has emerged as an important application of machine learning in text processing. In this paper, an english text summarizer has been built with GRU-based encoder and decoder. Bahdanau attention mechanism has been added to overcome the problem of handling long sequences in the input text. A news-summary dataset has been used to train the model. The output is observed to outperform competitive models in the literature. The generated summary can be used as a newspaper headline.

Tables2

Table 1. Table 1: ROUGH score (F1) on the basis of output from model.

ROUGH-1	ROUGH-L
F1	F1
35.29	35.25

Table 2. Table 2: Comparison of the ROUGH score (F1) with some existing model

Model	ROUGH-1	ROUGH-L
Model	F1	F1
Words-lvt5k-1sent [2]	28.61	25.423
Words-lvt2k-temp-att [2]	35.46	32.65
ABS+ (Rush et al.)[15]	28.18	23.81
RAS-Elman (k=10)(Chopra et al.) [16]	33.78	31.15
Our Model	35.29	35.25

Equations18

I = X_{1}, X_{2} - - - - - - - - - - X_{d}

I = X_{1}, X_{2} - - - - - - - - - - X_{d}

O = Y_{1}, Y_{2} - - - - - - - - - - Y_{s}

O = Y_{1}, Y_{2} - - - - - - - - - - Y_{s}

h_{i} = [h_{i}^{T}, h_{i}^{T}]^{T}

h_{i} = [h_{i}^{T}, h_{i}^{T}]^{T}

e_{ij} = a tt (s_{i - 1}, h_{j})

e_{ij} = a tt (s_{i - 1}, h_{j})

a tt (s_{i - 1}, h_{j}) = V^{⊤} tanh (W [s_{i - 1}, h_{j}])

a tt (s_{i - 1}, h_{j}) = V^{⊤} tanh (W [s_{i - 1}, h_{j}])

α_{ij} = \frac{e x p ( e _{ij} )}{\sum _{k = 1}^{T_{x}} e x p ( e _{ik} )}

α_{ij} = \frac{e x p ( e _{ij} )}{\sum _{k = 1}^{T_{x}} e x p ( e _{ik} )}

c_{i} = j = 1 \sum T_{x} α_{ij} h_{j}

c_{i} = j = 1 \sum T_{x} α_{ij} h_{j}

s_{i} = f (s_{i - 1}, y_{i - 1}, c_{i})

s_{i} = f (s_{i - 1}, y_{i - 1}, c_{i})

P (y_{i} ∣ y_{i - 1}, y_{i - 2} .... y_{1}, X) = g (y_{i - 1}, s_{i}, c_{i})

P (y_{i} ∣ y_{i - 1}, y_{i - 2} .... y_{1}, X) = g (y_{i - 1}, s_{i}, c_{i})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Jadavpur University, Kolkata, India.

11email: {tohida.rehman, suithit}@gmail.com 22institutetext: Indian Association for the Cultivation of Science, Kolkata, India.

22email: [email protected] 33institutetext: TCG Crest; Jadavpur University, Kolkata, India.

33email: [email protected]

Abstractive Text Summarization using Attentive GRU based Encoder-Decoder

Tohida Rehman corresponding author11

Suchandan Das 11

Debarshi Kumar Sanyal 22

Samiran Chattopadhyay 33

Abstract

In today’s era huge volume of information exists everywhere. Therefore, it is very crucial to evaluate that information and extract useful, and often summarized, information out of it so that it may be used for relevant purposes. This extraction can be achieved through a crucial technique of artificial intelligence, namely, machine learning. Indeed automatic text summarization has emerged as an important application of machine learning in text processing. In this paper, an english text summarizer has been built with GRU-based encoder and decoder. Bahdanau attention mechanism has been added to overcome the problem of handling long sequences in the input text. A news-summary dataset has been used to train the model. The output is observed to outperform competitive models in the literature. The generated summary can be used as a newspaper headline.

Keywords:

Abstractive Text Summarization, GRU, Encoder, Decoder, Attention mechanism.

1 Introduction

The quantity of data around us is increasing at such a high velocity that we all need a mechanism to access correct and quick information that cuts through the noise and is brief enough to be assimilated yet not lacking in crucial content. We need a method to obtain a correct summary from an outsized volume of data. Automatic text summarization is such a technique through which a large chunk of information can be condensed into a meaningful summary. Extractive and abstractive summarization are two types of text summarization methods. A technique for extracting essential sentences or paragraphs from the source text and condensing them into a shorter text is known as extractive summarization. The statistical and linguistic properties of sentences, as well as their extraction and placement in the output text, are used to determine the relevance of sentences. An abstractive summarization technique tries to present the text’s primary idea in natural language without the verbatim use of terms from the text. The original text is transformed into a more comprehensible conceptual form in the abstractive summary approach, resulting in a shorter summary of the original text content.

In this paper, we present an encoder-decoder based model to summarize documents. A gated recurrent unit (GRU) has been used to boost a recurrent neural network’s memory capacity as well as to make training a model easier. It also helps us to overcome the vanishing gradient problem. In attention mechanism, the context vector concatenated with the previous decoder output. That are fed along with previous decoder hidden state into the Decoder GRU component for each time step to generate the output [1]. We have used the CNN/Daily Mail dataset[2, 3]. We obtained higher F1 scores using ROUGH-1 and ROUGH-L compared to some other competitive baselines in the literature.

2 Related Works

Nallapati et al [2] has proposed baseline encoder and decoder architecture where LSTM has been used. Bidirectional as well as unidirectional LSTM was used at encoder and decoder correspondingly. Word level and sentence level bidirectional GRU was used. Performance of basic encoder and decoder model has been improved through Bahdanau et al [1]. See et al. [3] offered a detailed study of numerous abstractive text summarization models for pointer-generator and RNN seq2seq models that are based on sequence-to-sequence encoder-decoder architecture. Sutskever et al. [4] proposed a multilayer LSTM based end-to-end solution to sequence learning. The input for the encoder was a fixed length of text, and the output for the decoder was the same. Lin et al[5] proposed global encoding mechanism of abstractive text summarization. In this paper, we have designed GRU based encoder and decoder with one extra attention layer. Shi et al [6] proposed to “improve seq2seq models, making them capable of handling different challenges, such as saliency, fluency and human readability, and generate high-quality summaries”. Generally speaking, most of these techniques differ in one of these three categories: network structure, parameter inference, and decoding/generation. Luong et al [7] examines two simple and effective classes of attention mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. Ksenov et al [8] proposed “the encoder and decoder of a Transformer-based neural model on the BERT language model”. Recently, a model proposed as “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension” [9] which captures the simplicity of BERT (Devlin et al.)[10] and GPT (Radford et al.) [11] and others pre-training schemes. BART opens many ways to thinking for fine-tuning in text summarization application.

3 Methodology

In this section, we describe the methodology that we have used to design our abstractive text summarizer. Generic work flow of our model shown in Fig.1. Here we used GRU [12] in Seq2seq model. The GRU has gating units to manage flow of information inside the unit.

Several crucial steps were followed such as data collection and pre-processing, tokenization, encoder and decoder model design, training the model, evaluation of the model and so on to overcome text generation problem to predict proper semantics meaningful summary.

Let consider input sequence is like below

[TABLE]

Where, d is the vocabulary size of input sequence for above mentioned input sequence, the output sequence will be like

[TABLE]

Where s is the vocabulary size of output sequence. Here, s $<$ d, it means length of output sequence is less than the length of input sequence.

3.1 Data collection and pre-processing

Dataset plays a key role in each and every deep learning process. To get better result, it is very important to get good dataset. Various type of data sets is present in different resources. We have used the CNN/Daily Mail dataset[2, 3]. There are different columns present in the data set but we have taken news and summary description to fulfill our purpose. Due to low configuration of our system, 10000 examples from CNN/Daily Mail dataset has been used.

Before we begin creating the model, we must first complete some basic pre-processing tasks. A decision based on messy and filthy text understanding could be disastrous. As a result, we have removed all unneeded symbols, letters, and other elements from the text that do not affect the target of our downside throughout this phase. We have removed HTML tags, parenthesis, and special character.

To begin with, we changed the entirety of the content to lower case, and afterward we split it up [13]. There is different constriction in the English language, for example, doesn’t, aren’t, etc. We have added contractions mapping in pre-processing phase. We have removed unnecessary components from the raw text to get the cleaned text. Then, at that point we lemmatized the words that have various types of a similar term. At the beginning and end of the news and summary description, we have included START and END tokens. Fig. 2 represents steps that we used to clean data set to prepare as news abstract and summary pair. Fig. 3 refer some cleaned data.

3.2 GRU based encoder-decoder with attention

Cho et al [14] introduced the RNN based encoder-decoder where RNN in encoder helps to encodes sequence of words into a fixed length vector representation and in other hand RNN in decoder helps to decode the incoming representation into a sequence of words. We used a bidirectional GRU encoder, a unidirectional GRU decoder with attention mechanism [2].

Here, seq2seq model with attention mechanism[1] builds a context vector using all hidden states present in the encoder. It aids in focusing on the most important information in the source sequence. The decoder uses the context vectors associated with the source position and the previously created target words to predict the target word at each time stamp. Below are the steps which describe how the Bahdanau attention mechanism works[1].

The encoder produces the annotation $(h_{i})$ for each word $x_{i}$ , for an input sentence of length T words at each time step i. Encoder has bidirectional GRU, reads the input sentence in forward as well as in backward direction to generate the $(h_{i})$ for each time steps.

[TABLE] 2. 2.

At each time step, the decoder takes the annotations $(h_{i})$ and the previous hidden states $s_{i-1}$ to calculate attention score $(e_{ij})$ . It can be written as follows.

[TABLE]

Bahdanau et al. is referred to as additive attention is defined below:

[TABLE]

Where $W$ , $V$ are the trainable weights. 3. 3.

The attention weights $(\alpha_{ij})$ are computed as follows:

[TABLE] 4. 4.

Linear sum is computed using attention weight $(\alpha_{ij})$ and hidden state of encoder to generate the context vector. This context vector is calculated as follows:

[TABLE] 5. 5.

At time step $i$ , the decoder produces the hidden state $(s_{i})$ depending upon $s_{i-1}$ which is the previous hidden state, $y_{i-1}$ which is the target word at time step $i-1$ , and $c_{i}$ which is the context vector.

[TABLE] 6. 6.

Steps 2 to 5 are repeated until the end of the sentence or the maximum length of generated tokens is reached. Each word is predicted based on the following rule:

[TABLE]

In Fig 4, it shows how attention works in sequence to sequence encoder-decoder model based on GRU.

4 Experiment and Result analysis

As the computational power of our machines was low, a small dataset has been used. Here, we have used 10000 examples from CNN/Daily Mail dataset[2, 3], Adam optimizer, a Sparse Categorical Cross entropy loss function with batch size =128, embedding dimension = 256, hidden units = 1024. We used $80\%$ of the data for training purposes and $20\%$ for testing purposes. We have trained the model for 100 epochs. Loss has been reduced to 0.0480. Table 1 shows F1 of ROUGH-1 and ROUGH-L score on the basis of the output from the model. We now provide some illustrative examples of the output of our model.

4.1 Sample Output

Input: “actress deepika padukone has said that she will not be walking the red carpet at the cannes film festival deepika added right now all my energies are focused on padmavati earlier it was reported that deepika had been ap-pointed the brand ambassador of oral and would represent the brand at the film festival”

Actual Summary: deepika padukone will not be walking the red carpet at the cannes film festival.

Predicted Summary: not walking red carpet at cannes film festival says deepika.

Input: “beverage giant pepsico ceo indra nooyi received million over crore in compensation for marking increase in her pay this was the fourth consecutive pay raise for nooyi who has been the ceo since the rise in compensation came as efforts to steer the companys port-folio away from sugary products helped earnings”

Actual Summary: pepsico ceo indra nooyi received million over crore.

Predicted Summary: pepsico ceo indra nooyi pay rises to crore in year.

4.1.1 Heatmap:

Heatmaps for predictive outputs are given below figures which are more interesting. In attention heatmap plot, x axis denotes the actual input, y axis denotes the summary output and z axis indicates attention plot weight. Main goal of using attention mechanism is to emphasize on the important information. In Fig. 5 shows which parts of the input sentence has the model’s attention while generating summary.

In our proposed solution, we have used daily news dataset get 35.29 ROUGH 1 score and 35.25 ROUGH L score as F1 score which is slightly better than some existing model. It generates more semantics meaningful single sentence summary. We have also tried to compare the model performance with other existing model using ROUGH score. Below Table 2 shows the comparison of ROUGH 1 and ROUGH L scores with some existing model. k refers to the size of the beam for generation.

5 Conclusion and Future Work

GRU-based encoder and decoder model with Bahdanau attention mechanism has been used to design an automatic text summarizer. The attention mechanism also emphasizes the important word of the sequence and copy the same in the output summary. The proposed method provides better result than several other approaches in the literature. A meaningful summary with single sentence has been generated which can be used for news headline generation. However, we also observed that our model is not always producing the best result. In future, we will use BERT based pre-training model to enhance model’s performance and to generate more meaningful summary. We will try to create summary of Covid-19 related scientific articles which can help the medical community by providing a clean and meaningful high-quality knowledge base of the pandemic.

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” ar Xiv preprint ar Xiv:1409.0473 , 2014.
2[2] R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang et al. , “Abstractive text summarization using sequence-to-sequence rnns and beyond,” ar Xiv preprint ar Xiv:1602.06023 , 2016.
3[3] A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization with pointer-generator networks,” ar Xiv preprint ar Xiv:1704.04368 , 2017.
4[4] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems , 2014, pp. 3104–3112.
5[5] J. Lin, X. Sun, S. Ma, and Q. Su, “Global encoding for abstractive summarization,” ar Xiv preprint ar Xiv:1805.03989 , 2018.
6[6] T. Shi, Y. Keneshloo, N. Ramakrishnan, and C. K. Reddy, “Neural abstractive text summarization with sequence-to-sequence models,” ACM Transactions on Data Science , vol. 2, no. 1, pp. 1–37, 2021.
7[7] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” ar Xiv preprint ar Xiv:1508.04025 , 2015.
8[8] D. Aksenov, J. Moreno-Schneider, P. Bourgonje, R. Schwarzenberg, L. Hennig, and G. Rehm, “Abstractive text summarization based on language model conditioning and locality modeling,” ar Xiv preprint ar Xiv:2003.13027 , 2020.