Hierarchical Attentional Hybrid Neural Networks for Document   Classification

Jader Abreu; Luis Fred; David Mac\^edo; Cleber Zanchettin

arXiv:1901.06610·cs.CL·October 15, 2019

Hierarchical Attentional Hybrid Neural Networks for Document Classification

Jader Abreu, Luis Fred, David Mac\^edo, Cleber Zanchettin

PDF

2 Repos

TL;DR

This paper introduces a hierarchical neural network model combining convolutional layers, gated recurrent units, and attention mechanisms to improve document classification by better capturing document structure and contextual importance.

Contribution

It presents a novel hierarchical model that effectively incorporates document structure and context, outperforming existing attention-based methods.

Findings

01

Improved classification accuracy over existing models

02

Effective hierarchical feature extraction

03

Enhanced understanding of document structure

Abstract

Document classification is a challenging task with important applications. The deep learning approaches to the problem have gained much attention recently. Despite the progress, the proposed models do not incorporate the knowledge of the document structure in the architecture efficiently and not take into account the contexting importance of words and sentences. In this paper, we propose a new approach based on a combination of convolutional neural networks, gated recurrent units, and attention mechanisms for document classification tasks. The main contribution of this work is the use of convolution layers to extract more meaningful, generalizable and abstract features by the hierarchical representation. The proposed method in this paper improves the results of the current attention-based approaches for document classification.

Figures34

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1 : Results in classification accuracies.

Method	Accuracy on test set
	Yelp 2018 (five classes)	IMDb (two classes)
VDNN [7]	62.14	79.47
HN-ATT [6]	72.73	89.02
CNN [5]	71.81	91.34
Our model with CNN	73.28	92.26
Our model with TCN	72.63	95.17

Equations24

F (s) = (x *_{d} f) (s) = i = o \sum k - 1 f (i) \cdot x_{s - d \cdot i}

F (s) = (x *_{d} f) (s) = i = o \sum k - 1 f (i) \cdot x_{s - d \cdot i}

x_{it} = W_{e} w_{it}, t \in [1, T],

x_{it} = W_{e} w_{it}, t \in [1, T],

h_{i t} = GR U (x_{it}), t \in [1, T],

h_{i t} = GR U (x_{it}), t \in [1, T],

h_{i t} = GR U (x_{it}), t \in [T, 1] .

h_{i t} = GR U (x_{it}), t \in [T, 1] .

u_{it} = tanh (W_{w} h_{it} + b_{w})

u_{it} = tanh (W_{w} h_{it} + b_{w})

α_{i t} = \frac{exp ( u _{i t}^{⊤} u _{w} )}{\sum _{t} exp ( u _{i t}^{⊤} u _{w} )}

α_{i t} = \frac{exp ( u _{i t}^{⊤} u _{w} )}{\sum _{t} exp ( u _{i t}^{⊤} u _{w} )}

s_{i} = \sum α_{i t} h_{i t}

s_{i} = \sum α_{i t} h_{i t}

h_{i t} = GR U (s_{i}), i \in [1, L],

h_{i t} = GR U (s_{i}), i \in [1, L],

h_{i t} = GR U (s_{i}), i \in [L, 1] .

h_{i t} = GR U (s_{i}), i \in [L, 1] .

u_{it} = tanh (W_{s} h_{i} + b_{s})

u_{it} = tanh (W_{s} h_{i} + b_{s})

α_{i t} = \frac{exp ( u _{i}^{⊤} u _{s} )}{\sum _{i} exp ( u _{i}^{⊤} u _{s} )}

α_{i t} = \frac{exp ( u _{i}^{⊤} u _{s} )}{\sum _{i} exp ( u _{i}^{⊤} u _{s} )}

v = \sum α_{i} h_{i}

v = \sum α_{i} h_{i}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConvolution

Full text

11institutetext: Centro de *Informática

Universidade Federal de Pernambuco

50.740-560, Recife, PE, Brazil

11email: {jaoa,lfgs,dlm,cz}@cin.ufpe.br*

Hierarchical Attentional Hybrid Neural Networks for Document Classification

Jader Abreu*⋆*

Luis Fred Authors contributed equally and are both first authors.

David Macêdo

Cleber Zanchettin

Abstract

Document classification is a challenging task with important applications. The deep learning approaches to the problem have gained much attention recently. Despite the progress, the proposed models do not incorporate the knowledge of the document structure in the architecture efficiently and not take into account the contexting importance of words and sentences. In this paper, we propose a new approach based on a combination of convolutional neural networks, gated recurrent units, and attention mechanisms for document classification tasks. The main contribution of this work is the use of convolution layers to extract more meaningful, generalizable and abstract features by the hierarchical representation. The proposed method in this paper improves the results of the current attention-based approaches.

Keywords:

Text classification Attention mechanisms Document classification Convolutional Neural Networks.

1 Introduction

Text classification is one of the most classical and important tasks in the machine learning field. The document classification, which is essential to organize documents for retrieval, analysis, and curation, is traditionally performed by classifiers such as Support Vector Machines or Random Forests. As in different areas, the deep learning methods are presenting a performance quite superior to traditional approaches in this field [5]. Deep learning is also playing a central role in Natural Language Processing (NLP) through learned word vector representations. It aims to represent words in terms of fixed-length, continuous and dense feature vectors, capturing semantic word relations: similar words are close to each other in the vector space.

In most NLP tasks for document classification, the proposed models do not incorporate the knowledge of the document structure in the architecture efficiently and not take into account the contexting importance of words and sentences. Much of these approaches do not select qualitative or informative words and sentences since some words are more informative than others in a document. Moreover, these models are frequently based on recurrent neural networks only [6]. Since CNN has leveraged strong performance on deep learning models by extracting more abundant features and reducing the number of parameters, we guess it not only improves computational performance but also yields better generalization on neural models for document classification.

A recent trend in NLP is to use attentional mechanisms to modeling information dependencies without regard to their distance between words in the input sequences. In [6] is proposed a hierarchical neural architecture for document classification, which employs attentional mechanisms, trying to mirror the hierarchical structure of the document. The intuition underlying the model is that not all parts of a text are equally relevant to represent it. Further, determining the relevant sections involves modeling the interactions and importance among the words and not just their presence in the text.

In this paper, we propose a new approach for document classification based on CNN, GRU [4] hidden units and attentional mechanisms to improve the model performance by selectively focusing the network on essential parts of the text sentences during the model training. Inspired by [6], we have used the hierarchical concept to better representation of document structure. We call our model as Hierarchical Attentional Hybrid Neural Networks (HAHNN). We also used temporal convolutions [2], which give us more flexible receptive field sizes. We evaluate the proposed approach comparing its results with state-of-the-art models and the model shows an improved accuracy.

2 Hierarchical Attentional Hybrid Neural Networks

The HAHNN model combines convolutional layers, Gated Recurrent Units, and attention mechanisms. Figure 1 shows the proposed architecture. The first layer of HAHNN is a pre-processed word embedding layer (black circles in the Figure 1). The second layer contains a stack of CNN layers that consist of convolutional layers with multiple filters (varying window sizes) and feature maps. We also have performed some trials with temporal convolutional layers with dilated convolutions and gotten promising results. Besides, we used Dropout for regularization. In the next layers, we use a word encoder applying the attention mechanism on word level context vector. In sequence, a sentence encoder applying the attention on sentence-level context vector. The last layer uses a Softmax function to generate the output probability distribution over the classes.

We use CNN to extract more meaningful, generalizable and abstract features by the hierarchical representation. Combining convolutional layers in different filter sizes with both word and sentence encoder in a hierarchical architecture let our model extract more rich features and improves generalization performance in document classification. To obtain representations of more rare words, by taking into account subwords information, we used FastText [3] in the word embedding initialization.

We investigate two variants of the proposed architecture. There is a basic version, as described in Figure 1, and there is another which implements a TCN [2] layer. The goal is to simulate RNNs with very long memory size by adopting a combination of dilated and regular convolutions with residual connections. Dilated convolutions are considered beneficial in longer sequences as they enable an exponentially larger receptive field in convolutional layers. More formally, for a 1-D sequence input $\mathrm{x\in\mathbb{R}^{\it{n}}}$ and a filter $f:\{0,...,k-1\}\rightarrow\mathbb{R}$ , the dilated convolution operation F on element s of the sequence is defined as

[TABLE]

where d is the dilatation factor, k is the filter size, and $\it{s-d\cdot i}$ accounts for the past information direction. Dilation is thus equivalent to introducing a fixed step between every two adjacent filter maps. When d = 1, a dilated convolution reduces to a regular convolution. The use of larger dilation enables an output at the top level to represent a wider range of inputs, expanding the receptive field.

The proposed model takes into account that the different parts of a document have no similar relevant information. Moreover, determining the relevant sections involves modeling the interactions among the words, not just their isolated presence in the text. Therefore, to consider this aspect, the model includes two levels of attention mechanisms [1]. One structure at the word level and other at the sentence level, which let the model pay more or less attention to individual words and sentences when constructing the document representation.

The strategy consists of different parts: 1) A word sequence encoder and a word-level attention layer; and 2) A sentence encoder and a sentence-level attention layer. In the word encoder, the model uses bidirectional GRU [1] to produce annotations of words by summarizing information from both directions. Therefore, it incorporates the contextual information in the annotation. The attention levels let the model pay more or less attention to individual words and sentences when constructing the representation of the document [6].

Given a sentence with words $\it{w}_{it},t\in[0,T]$ and an embedding matrix $\it{W_{e}}$ , a bidirectional GRU contains the forward $GRU\overrightarrow{f}$ which reads the sentence $s_{i}$ from $w_{i1}$ to $w_{iT}$ and a backward $GRU\overleftarrow{f}$ which reads from $w_{iT}$ to $w_{i1}$ :

[TABLE]

An annotation for a given word $\it{w_{it}}$ is obtained by concatenating the forward hidden state and backward hidden state, i.e., $\it{h_{it}}=[\overrightarrow{h_{it}},\overleftarrow{h_{it}}]$ , which summarizes the information of the whole sentence. We use the attention mechanism to evaluates words that are important to the meaning of the sentence and to aggregate the representation of those informative words into a sentence vector. Specifically,

[TABLE]

The model measures the importance of a word as the similarity of $\it{u_{it}}$ with a word level context vector $\it{u_{w}}$ and learns a normalized importance weight $\alpha_{it}$ through a softmax function. After that, the architecture computes the sentence vector $s_{i}$ as a weighted sum of the word annotations based on the weights. The word context vector $u_{w}$ is randomly initialized and jointly learned during the training process.

Given the sentence vectors $s_{i}$ , and the document vector, the sentence attention is obtained as:

[TABLE]

The proposed solution concatenates $h_{i}=[\overrightarrow{h_{i}},\overleftarrow{h_{i}}]$ $h_{i}$ which summarizes the neighbor sentences around sentence $\it{i}$ but still focus on sentence $\it{i}$ . To reward sentences that are relevant to correctly classify a document, the solution again use attention mechanism and introduce a sentence level context vector $u_{s}$ using it to measure the importance of the sentences:

[TABLE]

In the above equation, v is the document vector that summarizes all the information of sentences in a document. Similarly, the sentence level context vector $u_{s}$ can be randomly initialized and jointly learned during the training process. The output of the sentence attention layer feeds a fully connected softmax layer. It gives us a probability distribution over the classes. The proposed method is openly available in the github repository 111https://github.com/luisfredgs/cnn-hierarchical-network-for-document-classification.

3 Experiments and Results

We evaluate the proposed model on two document classification datasets using 90% of the data for training and the remaining 10% for tests. We split documents into sentences and tokenize each sentence. The word embeddings have dimension 200 and we use Adam optimizer with a learning rate of 0.001. The datasets used are the IMDb Movie Reviews 222http://ai.stanford.edu/ amaas/data/sentiment/ and Yelp 2018 333https://www.yelp.com/dataset/challenge. The former contains a set of 25k highly polar movie reviews for training and 25k for testing, whereas the classification involves detecting positive/negative reviews. The latter include users ratings and write reviews about stores and services on Yelp, being a dataset for multiclass classification (ratings from 0-5 stars). Yelp 2018 contains around 5M full review text data, but we fix in 500k the number of used samples for computational purposes.

Table 1 shows the experiment results comparing our results with related works. Note that HN-ATT [6] obtained an accuracy of 72,73% in the Yelp test set, whereas the proposed model obtained an accuracy of 73,28%. Our results also outperformed CNN [6] and VDNN [7]. We can see an improvement of the results in Yelp with our approach using CNN and varying window sizes in filters. The model also performs better in the results with IMDb using both CNN and TCN.

3.1 Attention Weights Visualizations

To validate the model performance in select informative words and sentences, we present the visualizations of attention weights in Figure 2. There is an example of the attention visualizations for a positive and negative class in test reviews. Every line is a sentence. Blue color denotes the sentence weight, and red denotes the word weight in determining the sentence meaning. There is a greater focus on more important features despite some exceptions. For example, the word “loving” and “amazed” in Figure 2 (a) and “disappointment” in Figure 2 (b).

Occasionally, we have found issues in some sentences, where fewer important words are getting higher importance. For example, in Figure 2 (a) notes that the word “translate” has received high importance even though it represents a neutral word. These drawbacks will be taken into account in future works.

4 Final Remarks

In this paper, we have presented the HAHNN architecture for document classification. The method combines CNN with attention mechanisms in both word and sentence level. HAHNN improves accuracy in document classification by incorporate the document structure in the model and employing CNN’s for the extraction of more abundant features.

Bibliography7

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] BAHDANAU, Dzmitry; CHO, Kyunghyun; BENGIO, Yoshua. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014.
2[2] BAI, Shaojie; KOLTER, J. Zico; KOLTUN, Vladlen. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. ar Xiv preprint ar Xiv:1803.01271, 2018.
3[3] BOJANOWSKI, Piotr et al. Enriching word vectors with subword information. ar Xiv preprint ar Xiv:1607.04606, 2016.
4[4] CHO, Kyunghyun et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078, 2014.
5[5] KIM, Yoon. Convolutional neural networks for sentence classification. ar Xiv preprint ar Xiv:1408.5882, 2014.
6[6] YANG, Zichao et al. Hierarchical attention networks for document classification. In: Conf. North Am. Chapter of the Assoc. for Comp. Ling. 2016. p.1480-1489, San Diego, CA, USA.
7[7] Conneau, Alexis, et al. ”Very deep convolutional networks for text classification.” ar Xiv preprint ar Xiv:1606.01781 (2016).