Resilient Combination of Complementary CNN and RNN Features for Text   Classification through Attention and Ensembling

Athanasios Giannakopoulos; Maxime Coriou; Andreea Hossmann; Michael; Baeriswyl; Claudiu Musat

arXiv:1903.12157·cs.CL·March 29, 2019

Resilient Combination of Complementary CNN and RNN Features for Text Classification through Attention and Ensembling

Athanasios Giannakopoulos, Maxime Coriou, Andreea Hossmann, Michael, Baeriswyl, Claudiu Musat

PDF

TL;DR

This paper presents ECGA, an end-to-end neural architecture that combines CNN, RNN, and attention modules with ensembling to improve text classification performance across diverse datasets.

Contribution

The paper introduces ECGA, a novel architecture that effectively integrates multiple neural modules and ensembling for robust, high-performing text classification.

Findings

01

ECGA surpasses state-of-the-art on various datasets.

02

It is effective in both low and high data regimes.

03

The combination of modules is shown to be complementary.

Abstract

State-of-the-art methods for text classification include several distinct steps of pre-processing, feature extraction and post-processing. In this work, we focus on end-to-end neural architectures and show that the best performance in text classification is obtained by combining information from different neural modules. Concretely, we combine convolution, recurrent and attention modules with ensemble methods and show that they are complementary. We introduce ECGA, an end-to-end go-to architecture for novel text classification tasks. We prove that it is efficient and robust, as it attains or surpasses the state-of-the-art on varied datasets, including both low and high data regimes.

Tables5

Table 1. Table 1: Performance comparison on DBpedia.

	Error rate (%)
Johnson and Zhang (2016)	0.84
CNN	1.29
BiGRU+ATT	0.88
CNN+BiGRU	0.87
CNN+BiGRU+ATT	0.85
ECGA	0.84

Table 2. (a) Argumentation mining: Task A. CGA stands for CNN + BiGRU + ATT.

	Task A (Accuracy %)
	V	R	D	I	Avg.
FastText	68.0	71.1	76.9	69.4	71.4
CNN	70.0	72.0	76.0	70.4	72.1
BiGRU+ATT	72.3	74.4	77.3	71.9	73.9
CNN+BiGRU	70.9	73.8	77.0	71.2	73.2
CGA	72.2	74.2	77.6	72.1	74.0
ECGA	72.5	75.0	78.2	72.4	74.5

Table 3. (a) Argumentation mining: Task A. CGA stands for CNN + BiGRU + ATT.

	Task A (Accuracy %)
	V	R	D	I	Avg.
FastText	68.0	71.1	76.9	69.4	71.4
CNN	70.0	72.0	76.0	70.4	72.1
BiGRU+ATT	72.3	74.4	77.3	71.9	73.9
CNN+BiGRU	70.9	73.8	77.0	71.2	73.2
CGA	72.2	74.2	77.6	72.1	74.0
ECGA	72.5	75.0	78.2	72.4	74.5

Table 4. (b) Argumentation mining: Task C.

	Task C (F-score %)
FastText	65.4
CNN	67.0
BiGRU+ATT	70.9
CNN+BiGRU	70.4
CNN+BiGRU+ATT	71.3
ECGA	71.6

Table 5. Table 3: Model performance on churn detection.

	Macro F-score (%)
Gridach et al. (2017)	83.85
CNN	81.94
BiGRU+ATT	84.21
CNN+BiGRU	84.48
CNN+BiGRU+ATT	86.26
ECGA	87.00

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Resilient Combination of Complementary CNN and RNN Features for Text Classification through Attention and Ensembling

Athanasios Giannakopoulos*†, Maxime Coriou†*1, Andreea Hossmann,

**Michael Baeriswyl ** Work done during the master thesis of Maxime Coriou at Data, Analytics & AI — Swisscom AG.

**Claudiu Musat

Data, Analytics & AI Group — Swisscom AG

[email protected]

*†***Equal contribution

Abstract

State-of-the-art methods for text classification include several distinct steps of pre-processing, feature extraction and post-processing. In this work, we focus on end-to-end neural architectures and show that the best performance in text classification is obtained by combining information from different neural modules. Concretely, we combine convolution, recurrent and attention modules with ensemble methods and show that they are complementary. We introduce ECGA, an end-to-end go-to architecture for novel text classification tasks. We prove that it is efficient and robust, as it attains or surpasses the state-of-the-art on varied datasets, including both low and high data regimes.

1 Introduction

Text classification is among the most common natural language processing problems. Its applications vary from separating documents into classes (Yang et al., 2016) to finding argumentative phrases (Fierro et al., 2017) or detecting churny tweets (Amiri and Daumé III, 2015). Techniques range from traditional tf-idf methods to modern deep neural networks. Generally, traditional methods are used for simpler classification spaces and in low data regimes (Amiri and Daumé III, 2015). However, neural modules are becoming the norm for complex problems with higher data availability.

Modern text classifiers use mainly neural modules for feature extraction. Convolutional neural networks (CNN) (LeCun et al., 1998) or embedded convolution layers in larger networks can be used as feature extractors because of their location invariance property. Recurrent neural units such as Gated Recurrent Units (GRU) (Cho et al., 2014) are used (Socher et al., 2013) because of their sequence modelling capabilities. Finally, the use of attention (Li et al., 2017) is constantly growing since it can tackle the forgetfulness of recurrent cells on long sequences (Luong et al., 2015).

Dropout (Srivastava et al., 2014) and ensemble methods (e.g. Random Forests (RF) (Breiman, 2001)) are two popular countermeasures for overfitting, which is a constant risk for deep learning models trained in low data regimes. In the case of ensemble methods, predictions of multiple learners are combined to form the final prediction.

In this work, we combine all aforementioned components and introduce a widely applicable text classifier: an Ensemble of CNN-GRU-Attention, hereafter denoted as ECGA. ECGA benefits from the complementary feature representation capacities of the three neural modules it exploits. At the same time, it constitutes an efficient way of limiting overfitting since it is based on ensemble methods, i.e. the final prediction is done by averaging the predictions from multiple learners. We are aware that individual neural components are widely used in text classification either individually or in pairs of two. However, combining all of them in one text classifier is novel.

We deploy and test ECGA in three different text classification tasks, namely

(i) argumentation mining,

(ii) topic classification and

(iii) textual churn detection.

The first two tasks are complex multi-class classification problems with large datasets containing up to 44 classes. The third task is a binary classification, however the nature of the text and the task is difficult even for human annotators111Confirmed by the low annotation confidence in (Amiri and Daumé III, 2015).. The dataset for the third task is small, which forces ECGA to operate in low data regimes.

The first finding that emerges from our results is that ECGA exceeds or at least attains the state-of-the-art in all aforementioned classification tasks. It does so in an end-to-end way without any changes to its architecture, except for hyper-parameter tuning. This resilience makes ECGA a prime choice for new tasks, as it even outperforms architectures that were tailored for the studied tasks.

The second finding is that everything matters, i.e. ensemble methods combined with all neural modules lead to a performance increase. By gradually adding complementary neural components we obtain sustained performance increases.

2 Related Work

Fierro et al. (2017) have contributed the most on argumentation mining after releasing a dataset with more than $200000$ arguments. The best performance on this dataset is based on the FastText classifier (Joulin et al., 2016).

With respect to topic classification, Zhang et al. (2015) created the DBpedia dataset for multi-class text classification. Numerous research teams (e.g. Johnson and Zhang (2016) and Johnson and Zhang (2017)) have worked on DBpedia by applying different models and feature extraction methods. Lately, Howard and Ruder (2018) employed transfer learning and achieved the state-of-the-art on this dataset.

Amiri and Daumé III (2016) performed textual churn detection using tweets about 3 mobile providers and obtained their best results by using recurrent cells. Later, Gridach et al. (2017) improved the performance by adding hand-crafted features based on logic rules to a CNN.

3 ECGA Architecture

ECGA orchestrates all types of feature extraction and text classification modules. We want to show that the techniques of

(i) convolution,

(ii) recurrence,

(iii) attention and

(iv) ensembles

are complementary.

1. We employ CNNs that are great feature extractors for text classification (Yin et al., 2017). We create an $n\times m$ input matrix – $n$ is the number of words of the input text and $m$ equals to the number of features – and apply convolution on it with $f$ filters of kernel size $k$ . Each filter slides over $k$ words (i.e. $k$ -grams) and creates a vector of size $n-k+1$ . We concatenate the output of the $f$ filters without max pooling and create an $(n-k+1)\times f$ matrix. Hence, the $j^{th}$ row of this matrix is a feature of the $j^{th}$ $k$ -gram of the input sentence.

2. We then feed the output of the CNN into a bidirectional GRU (BiGRU) i.e. the input size of the BiGRU network is $n-k+1$ . The output vector of each state embeds information about the structure of the input text learned from the sequences of $k$ -grams.

3. We incorporate and apply attention on the output states of the BiGRU network (Li et al., 2017). This allows us to construct a final feature vector $\boldsymbol{\alpha}$ of the input text using a weighted sum of all the output states of the BiGRU network. The final layer of ECGA passes $\boldsymbol{\alpha}$ through a softmax activation for the text classification.

4. Finally, we exploit multiple learners, i.e. ensemble methods, in order to combine diverse predictions and attain higher performance. We do so by performing convolutions with different kernel sizes $k_{i}$ on the input matrix. This allows us to extract at the same time features for 2-grams, 3-grams, etc. by choosing different values for $k_{i}$ . We then fork the deeper layers of the network (i.e. BiGRU, attention and softmax) according to the number of different kernel sizes we use. In that way, we create multiple learners (similar to random forests) and train them using different features for the same task. The final prediction is done by averaging the predictions of all the learners. Figure 1 shows ECGA with two learners.

4 Experiments and Results

We wish to prove that all techniques of

(i) convolutions,

(ii) recurrent units,

(iii) attention and

(iv) ensembles

contribute in the performance increase. We use models that exploit only a subset of the available neural models as baselines and show that ECGA outperforms them, i.e. the best performance comes after combining ensemble methods with all available neural models.

We chose three datasets that emphasize the diversity of situations that ECGA can perform in. We first aim for a large, well-studied dataset with a high number of classes. With its 14 classes, the DBpedia dataset overshadows others like AGNews, that contains only 4. We then focus on a different language (Spanish), in a classification setting with an even higher class count, concretely 44. Finally, we hypothesize that, despite its apparent size, ECGA can become the new state-of-the-art in a complex low data regime – represented by the textual churn detection dataset (Amiri and Daumé III, 2015). The complexity of the third task – textual churn detection – relies on two factors. First, the nature of the task is inherently difficult even for human annotators. Secondly, the available dataset is quite small and very unbalanced.

Finally, we did not focus on tasks where the state-of-the-art results are obtained mainly after heavy fine-tuning and pre-processing, a practice that does not generalize to new domains. Examples of this include sentiment analysis on datasets like IMDB (Maas et al., 2011) and YELP. ECGA achieves very good performance without any cumbersome data pre-processing. Moreover, we are not tackling a multi-label setting, therefore datasets like Reuters are not suitable for our analysis.

In all experiments the hyper-parameter tuning consists of grid search, with at least 5 experiments for each setting. We do not report confidence intervals as conducting one experiment on the argumentation mining and DBpedia datasets takes more than $7h$ and $12h$ respectively. This is also the reason we do not perform experiments with more than two learners222Adding extra learners increases the model parameters, thus also the training time of the model..

4.1 DBpedia

The DBpedia dataset is compiled for multi-class text classification using Wikipedia article titles and abstracts. It contains datapoints from $14$ classes with a pre-defined train and test set (Zhang et al., 2015). The state-of-the-art performance on DBpedia is achieved by Howard and Ruder (2018) through a non end-to-end system that uses transfer learning. However, this implies the dataset availability from at least two similar domains and therefore we do not compare ECGA against their system. The best comparable method, which does not use transfer learning is that of Johnson and Zhang (2016), who reach an error rate of $0.84\%$ .

For training, we use the FastText word embeddings (Bojanowski et al., 2016) and pad all sentences to a length of 60. For pure CNN, we use a kernel size of 2 with 256 filters. The number of units equals to 128 whenever the model contains GRU cells. ECGA has two learners with kernel sizes of 2 and 3. The number of filters is 256 and the number of units equals to 128 for both learners. We also apply dropout with a rate of 0.3 between all layers. Finally, we exploit the adam optimizer with a learning rate of $10^{-4}$ , $\beta_{1}=0.7$ , $\beta_{2}=0.99$ .

We present the model comparison on DBpedia in Table 1. ECGA beats all baselines and attains the state-of-the-art in DBpedia. In addition, it does so without the need of training or fine tuning word embeddings while being an end-to-end model.

4.2 Argumentation Mining

We use the dataset released by Fierro et al. (2017) for argumentation mining, our second complex text classification task, in Spanish instead of English. It contains more than $200000$ data points and each one is labelled with a topic, concept and argument mode. The dataset can be used for two different classification tasks (Task A and Task C (Fierro et al., 2017)) with up to $44$ labels.

To assure a fair comparison, we adopt exactly the experimental setup of Fierro et al. (2017). For Task A, we predict the concept of a given data point. To do so, we split the data points in four disjoint topic sets – Values (V), Rights (R), Duties (D) and Institutions (I). We then train different classifiers on the four subsets in order to predict the concept. For Task C, we predict the argumentation mode of a data point after removing those with blank or undefined label.

Once again, we use a padding of 60 tokens and the FastText word embeddings. With respect to the hyper-parameters, we exploit the adam optimizer with its default parameter values333A learning rate of $10^{-3}$ , $\beta_{1}=0.9$ , $\beta_{2}=0.999$ . For CNN, we use a kernel size of 2 with 256 (for topics V and R) or 512 (for topics D and I) filters. The number of units in the GRU layer equals to either 128 (for topics V and R) or 256 (for topics D and I). ECGA employs two learners with kernel sizes of 2 and 3. Both learners have 256 filters and 128 units independently of the topic. We apply dropout between all layers with a rate of 0.5. For Task C we use the same parameters as for Task A. The only difference is that ECGA uses two learners with kernel sizes of 2 and 3 with 512 filters and 256 units.

Experimental results for Task A and C are tabulated in Tables 2(a) and 2(b) respectively with the same layout as in (Fierro et al., 2017). Once again, the performance we attain on both tasks proves that ECGA surpasses significantly all baselines and the state-of-the-art.

4.3 Textual Churn Detection

We use the publicly available dataset of Amiri and Daumé III (2015) for textual churn detection. The authors use only tweets with annotation confidence larger than $0.7$ . We follow the same approach in order to have a fair comparison against their system. The resulting dataset contains 4728 tweets and only 900 out of them are churny.

Gridach et al. (2017) achieve state-of-the-art in textual churn detection by enriching the features extracted from a CNN with hand-crafted ones. This approach does not scale, as additional human knowledge is not readily available in all cases.

We use the Twitter GloVe word embeddings and perform some data cleaning as standardization of URLs, smileys, usernames and numbers. In addition, we restrict our vocabulary to 1000 tokens and pad each tweet to a length of 50. We evaluate our models by performing 10-fold cross validation, same as Amiri and Daumé III (2016) and Gridach et al. (2017). The adam optimizer has again the default parameter values. For CNN, we use a kernel size of 3 with 64 filters and 64 units for BiGRU. The kernel size equals to 2, the filters to 128 and the units to 64 when CNN is combined with BiGRU (with or without Attention). Finally, ECGA has two learners with kernel sizes of 1 and 2 with 128 filters and 64 units.

The results of Table 3 show once again that the more neural modules we add, the more the performance increases. ECGA surpasses the state-of-the-art in textual churn detection by $3.15\%$ .

5 Conclusion

We work towards creating a one-size-fits-all go-to model for any novel text classification task. Our effort originates from our belief that all neural components can gradually contribute in the performance increase of a classifier. We introduce ECGA, a universal text classifier, that combines Ensembles, CNN, GRU and Attention. We perform extensive experiments for complex text classification tasks using diverse datasets for topic classification, argumentation mining and textual churn detection. Our experiments validate that ECGA is an end-to-end model that achieves or surpasses the existing state-of-the-art performance for manifold text classification tasks.

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Amiri and Daumé III (2015) Hadi Amiri and Hal Daumé III. 2015. Target-dependent churn classification in microblogs. In AAAI , pages 2361–2367.
2Amiri and Daumé III (2016) Hadi Amiri and Hal Daumé III. 2016. Short text representation for detecting churn in microblogs. In AAAI , pages 2566–2572.
3Bojanowski et al. (2016) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information.
4Breiman (2001) Leo Breiman. 2001. Random forests. Machine learning , 45(1):5–32.
5Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1724–1734.
6Fierro et al. (2017) Constanza Fierro, Claudio Fuentes, Jorge Pérez, and Mauricio Quezada. 2017. 200k+ crowdsourced political arguments for a new chilean constitution. In Proceedings of the 4th Workshop on Argument Mining , pages 1–10.
7Gridach et al. (2017) Mourad Gridach, Hatem Haddad, and Hala Mulki. 2017. Churn identification in microblogs using convolutional neural networks with structured logical knowledge. In Proceedings of the 3rd Workshop on Noisy User-generated Text , pages 21–30.
8Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Fine-tuned language models for text classification.