Zoho at SemEval-2019 Task 9: Semi-supervised Domain Adaptation using   Tri-training for Suggestion Mining

Sai Prasanna; Sri Ananda Seelan

arXiv:1902.10623·cs.CL·April 9, 2019

Zoho at SemEval-2019 Task 9: Semi-supervised Domain Adaptation using Tri-training for Suggestion Mining

Sai Prasanna, Sri Ananda Seelan

PDF

Open Access 1 Repo

TL;DR

This paper presents a semi-supervised domain adaptation method using tri-training with CNNs and contextual embeddings for suggestion mining, effectively handling domain shifts and improving cross-domain performance.

Contribution

It introduces a tri-training approach with CNNs and contextual embeddings for semi-supervised domain adaptation in suggestion mining, addressing domain shift without labeled data.

Findings

01

Achieved 68.07 F1-score in in-domain suggestion mining.

02

Achieved 81.94 F1-score in cross-domain suggestion mining.

03

Tri-training effectively improves domain adaptation performance.

Abstract

This paper describes our submission for the SemEval-2019 Suggestion Mining task. A simple Convolutional Neural Network (CNN) classifier with contextual word representations from a pre-trained language model was used for sentence classification. The model is trained using tri-training, a semi-supervised bootstrapping mechanism for labelling unseen data. Tri-training proved to be an effective technique to accommodate domain shift for cross-domain suggestion mining (Subtask B) where there is no hand labelled training data. For in-domain evaluation (Subtask A), we use the same technique to augment the training set. Our system ranks thirteenth in Subtask A with an $F_{1}$ -score of 68.07 and third in Subtask B with an $F_{1}$ -score of 81.94.

Tables3

Table 1. Table 1: Performance metrics of different models on validation and test sets of both subtasks. Confidence intervals for the metrics are reported for five runs using different random seeds on t-distribution with 95% confidence. Upsampling is used in the training dataset unless otherwise specified. Single model from experiments with * was used for the final submission.

Subtask A - Technical Domain
Experiment	Validation			Test
Experiment	Precision	Recall	$𝐅_{𝟏}$	Precision	Recall	$𝐅_{𝟏}$
Organizer Baseline	58.72	93.24	72.06	15.69	91.95	26.80
DAN +glove	68.51 $\pm$ 2.43	87.30 $\pm$ 5.00	76.69 $\pm$ 1.06	25.40 $\pm$ 3.56	84.60 $\pm$ 9.87	38.84 $\pm$ 3.10
DAN +bert	76.06 $\pm$ 1.31	90.27 $\pm$ 1.71	82.55 $\pm$ 0.50	45.80 $\pm$ 4.49	90.80 $\pm$ 1.75	60.82 $\pm$ 3.99
DAN +bert w/o upsampling	79.04 $\pm$ 2.67	83.38 $\pm$ 2.73	81.11 $\pm$ 0.68	55.06 $\pm$ 6.36	83.68 $\pm$ 2.75	66.28 $\pm$ 4.28
CNN +bert	80.34 $\pm$ 4.21	89.93 $\pm$ 4.23	84.76 $\pm$ 0.52	50.34 $\pm$ 6.70	91.72 $\pm$ 2.55	64.81 $\pm$ 4.86
CNN +bert w/o upsampling	83.22 $\pm$ 3.01	84.73 $\pm$ 3.86	83.90 $\pm$ 0.70	58.98 $\pm$ 5.41	88.05 $\pm$ 1.63	70.58 $\pm$ 4.24
CNN +bert +tritrain_Test*	83.06 $\pm$ 1.96	89.19 $\pm$ 1.88	86.00 $\pm$ 0.35	52.89 $\pm$ 2.69	90.80 $\pm$ 2.02	66.81 $\pm$ 1.90
Subtask B - Hotel Reviews Domain
Experiment	Validation			Test
Experiment	Precision	Recall	$𝐅_{𝟏}$	Precision	Recall	$𝐅_{𝟏}$
Organizer Baseline	72.84	81.68	77.01	68.86	78.16	73.21
DAN +glove	82.00 $\pm$ 4.25	52.97 $\pm$ 9.25	64.01 $\pm$ 5.75	73.32 $\pm$ 3.50	46.09 $\pm$ 7.21	56.35 $\pm$ 4.71
DAN +bert	89.75 $\pm$ 2.79	65.74 $\pm$ 8.71	75.65 $\pm$ 5.10	78.90 $\pm$ 4.03	64.20 $\pm$ 8.77	70.49 $\pm$ 4.09
DAN +bert w/o upsampling	94.26 $\pm$ 1.87	31.73 $\pm$ 5.73	47.31 $\pm$ 6.27	87.98 $\pm$ 3.41	31.09 $\pm$ 7.17	45.62 $\pm$ 7.47
CNN +bert	93.77 $\pm$ 1.34	51.88 $\pm$ 6.88	66.65 $\pm$ 5.68	90.17 $\pm$ 2.45	50.34 $\pm$ 8.71	64.31 $\pm$ 6.72
CNN +bert w/o upsampling	93.94 $\pm$ 1.36	45.99 $\pm$ 7.59	61.53 $\pm$ 6.73	89.75 $\pm$ 4.41	44.08 $\pm$ 9.38	58.66 $\pm$ 7.79
CNN +bert +tritrain_Test*	91.91 $\pm$ 2.06	88.32 $\pm$ 2.05	90.05 $\pm$ 0.76	81.26 $\pm$ 1.63	83.16 $\pm$ 1.40	82.19 $\pm$ 1.03
CNN +bert +tritrain_Yelp	88.09 $\pm$ 0.62	87.13 $\pm$ 0.38	87.61 $\pm$ 0.42	78.01 $\pm$ 5.42	86.67 $\pm$ 3.96	81.98 $\pm$ 2.05

Table 2. Table 2: Label distribution

Dataset	Suggestions (%)
Training	23
Subtask A validation	50
Subtask B validation	50
Subtask A Test	10
Subtask B Test	42

Table 3. Table 3: Pairwise comparison of various models using the McNemar’s Test. p ≤ 0.05 𝑝 0.05 p\leq 0.05 indicates a significant disagreement between the model predictions.

Subtask	Model A	Model B	p-value
A	DAN +glove	DAN +bert	$\approx 0$
A	DAN +bert	CNN +bert	0.046
A	CNN +bert	CNN +bert +tritrain_Test	0.108
B	DAN +glove	DAN +bert	$1.419 e - 05$
B	DAN +bert	CNN +bert	0.4208
B	CNN +bert	CNN +bert +tritrain_Test	$3.251 e - 08$
B	CNN +bert +tritrain_Test	CNN +bert +tritrain_Yelp	0.5862

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sai-prasanna/suggestion-mining-semeval19
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

Full text

Zoho at SemEval-2019 Task 9: Semi-supervised Domain Adaptation using Tri-training for Suggestion Mining

Sai Prasanna

Zoho

[email protected]

&Sri Ananda Seelan

Zoho

[email protected]

Abstract

This paper describes our submission for the SemEval-2019 Suggestion Mining task. A simple Convolutional Neural Network (CNN) classifier with contextual word representations from a pre-trained language model is used for sentence classification. The model is trained using tri-training, a semi-supervised bootstrapping mechanism for labelling unseen data. Tri-training proved to be an effective technique to accommodate domain shift for cross-domain suggestion mining (Subtask B) where there is no hand labelled training data. For in-domain evaluation (Subtask A), we use the same technique to augment the training set. Our system ranks thirteenth in Subtask A with an $F_{1}$ -score of 68.07 and third in Subtask B with an $F_{1}$ -score of 81.94.

1 Introduction

Task 9 of SemEval-2019 Negi et al. (2019) focuses on mining sentences that contain suggestions in online discussions and reviews. Suggestion Mining is modelled as a sentence classification task with two Subtasks:

•

Subtask A evaluates the classifier performance on a technical domain specific setting.

•

Subtask B evaluates the domain adaptability of a model by doing cross-domain suggestion classification on hotel reviews.

We approached this task as an opportunity to test the effectiveness of transfer learning and semi-supervised learning techniques. In Subtask A, the high class imbalance and relatively smaller size of the training data made it an ideal setup for evaluating the efficacy of recent transfer learning techniques. Using pre-trained language models for contextual word representations has been shown to improve many Natural Language Processing (NLP) tasks Peters et al. (2018); Ruder and Howard (2018); Radford (2018); Devlin et al. (2018). This transfer learning technique is also an effective method when less labelled data is available as shown in Ruder and Howard (2018). In this work, we use the BERT model Devlin et al. (2018) for obtaining contextual representations. This results in enhanced scores even for simple baseline classifiers.

Subtask B requires the system to not use manually labelled data and hence it lends itself to a classic semi-supervised learning scenario. Many methods have been proposed for domain adaptation for NLP Blitzer et al. (2007); Chen et al. (2011); Chen and Cardie (2018); Zhou and Li (2005); Blum and Mitchell (1998). We use a label bootstrapping technique called tri-training Zhou and Li (2005) with which unlabelled samples are labelled iteratively with increasing confidence at each training iteration(explained in Section 2.4). Ruder and Plank (2018) shows the effectiveness of tri-training for baseline deep neural models in text classification under domain shift. They also propose a multi-task approach for tri-training, however we only adapt the classic tri-training procedure presented for suggestion mining task.

Detailed explanation of the submitted system and experiments are elicited in the following sections. Section 2 describes the components of the system. Following this, Section 3 details the experiments, results and ablation studies that were performed.

2 System Description

The models and the training procedures are built using AllenNLP library Gardner et al. (2018). All the code to replicate our experiments are public and can be accessed from https://github.com/sai-prasanna/suggestion-mining-semeval19.

2.1 Data cleaning and pre-processing

Basic data pre-processing is done to normalize whitespace, remove noisy symbols and accents. Very short sentences with less than four words are disregarded from training.

2.2 Word Representations

We use GloVe word representations Pennington et al. (2014) and compare the performance improvement that we obtain with pre-trained BERT representations Devlin et al. (2018).

2.3 Suggestion Classification

Our baseline classifier is Deep Averaging Network (DAN) Iyyer et al. (2015). DAN is a neural bag-of-words model that is considered as a strong baseline for text classification. In DAN, a sentence representation is obtained by averaging the word level representations and is fed to a series of rectified linear unit (ReLU) layers with a final softmax layer.

A simple Convolutional Neural Network (CNN) text classifier Kim (2014) is used for the final submission.

2.4 Training

We use the classic tri-training procedure for label bootstrapping as mentioned in Ruder and Plank (2018). Consider a labelled dataset $L$ from the source domain $S$ and an unlabelled dataset $U$ from the target domain $T$ . The objective of tri-training is to label $U$ iteratively and augment it with $L$ . Three CNN +BERT classifiers $M_{1}$ , $M_{2}$ , $M_{3}$ are trained separately using subsets of $L$ namely $l_{1}$ , $l_{2}$ , $l_{3}$ respectively. These subsets are obtained from $L$ using bootstrap sampling with replacement.

The above mentioned models are used to predict labels for the unlabelled set $U$ . Predictions which are agreed by two models is considered as a new training example for the third model in the next iteration. For example, an unlabelled sentence $U_{1}\in U$ is added as a labelled example to $l_{1}$ , if and only if the label for $U_{1}$ is agreed upon by both $M_{2}$ and $M_{3}$ . Same way, $l_{2}$ is updated with newly labelled data if those labels have been agreed by $M_{1}$ and $M_{3}$ and so on. This constitutes a single iteration of tri-training. The procedure that is used for the training of our models is mentioned in Algorithm 1.

In this way, the original training data gets added with three newly labelled subsets which are again used for the next training iteration. At the end of each iteration, validation $F_{1}$ -score is calculated by using the predictions that are obtained through a majority vote. The procedure is continued until there is no improvement in the validation score.

3 Experiments and Results

This section details the various experiments that were performed using the above components for our submissions.

3.1 Data

The test set provided during the trial phase of the evaluation is used as the validation data for all our experiments. For those experiments that do not involve tri-training, we only use the provided labelled data from the technical domain for training.

In Subtask B, for those experiments that involve tri-training, $L$ is the same as mentioned above. $U$ here is obtained in two ways:

•

Unlabelled data from final test set of Subtask B.

•

Unlabelled data from Yelp hotel reviews Blomo et al. (2013).

The results reported are mean and confidence intervals of Precision, Recall and $F_{1}$ -score over five runs of the same experiments with different random seeds.

3.2 Input

For input representations, we use 300d GloVe vectors with dropout Srivastava et al. (2014) of 0.2 for regularization. We also experiment with the pre-trained BERT base uncased model. The BERT model is not fine-tuned during our training. A dropout of 0.5 is applied for the 768d representations obtained from BERT.

3.3 Baseline Deep Averaging Net

Our neural baseline is Deep Averaging Net (DAN) (Section 2.3). When used with GloVe, the hidden sizes of DAN are 300, 150, 75, and 2 respectively. When BERT representations are used, the hidden sizes of the network are 768, 324, 162, and 2 respectively. We report an $F_{1}$ -score of 60.82 when DAN is used with BERT in Subtask A and 70.49 in Subtask B. Both these scores are a significant improvements from those obtained with GloVe representations (Table 1).

We retain the same configuration of BERT embedding layer for other experiments also. Training is performed with Adam Kingma and Ba (2015) optimizer with a learning rate of $1\mathrm{e}{-3}$ for all the models.

3.4 CNN Classifier

The CNN classifier is composed of four 1-D convolution layers with filter widths ranging from two to five. Each convolutional layer has 192 filters. The output from each layer is max-pooled over sequence (time) dimension. This results in four 192d vectors, which are concatenated to get a 768d output.

The max-pooled outputs are passed through four fully connected feed forward layers with hidden dimensions of 768, 324, 162, and 2 respectively. The intermediate layers use ReLu activation and the final layer is a softmax layer. We use dropout of 0.2 on all layers of the feed forward network except for the final layer.

Without tri-training, this model obtains an absolute improvement of $\approx$ 4% $F_{1}$ -score over DAN in Subtask A. However in Subtask B, it performs poorer than the baseline DAN model with an $F_{1}$ -score of 64.31. This decrease in performance could be because of overfitting on the source domain due to the larger number of parameters in CNN compared to DAN.

3.5 Tri-training

The aim of doing tri-training is for domain adaptation by labelling unseen data from a newer domain. For Subtask B, the CNN + BERT model achieves an $F_{1}$ -score of 82.19 when trained with the tri-training procedure mentioned in Algorithm 1. Tri-training is used to label the 824 unlabelled sentences from the test set of Subtask B and augmented with the original training data. This score is a huge improvement from the classifier model trained only on the given data which gets an $F_{1}$ -score of 64.31.

We also do the same experiment using 5000 unlabelled sentences from Yelp hotel reviews dataset Blomo et al. (2013). The model obtains a similar score of 81.98 which proves the importance of tri-training in domain adaptation.

For Subtask A, we get an improvement in the $F_{1}$ -score using tri-training, however the increase is not as profound as we observe for Subtask B. We compare the statistical significance of the different models and experiments in Section 3.7.

3.6 Upsampling

We also wanted to find how the class balance in the dataset has affected our model performance. The class distribution of the datasets including the test set distribution that was obtained after the final evaluation phase are mentioned in Table 2.

The original training data has a class imbalance with only 23% of the sentences labelled as suggestions. We tried to balance the labels by naive upsampling, ie., adding duplicates of sentences that are labelled as suggestions. This allowed us to have a balanced training dataset for our experiments. This resulted in consistent gains over the original dataset during the trial evaluation phase.

However during the final submission, in Subtask A we found that the model’s performance in the test set did not correlate well with that of the validation set as shown in Table 1. This could be because the percentage of positive labels in the test set is only 10% while the validation set has 50%.

Experiments without upsampling gives better performance in test set even though there is a decrease in the validation score. For Subtask B however, upsampling has actually increased the model performance. On hindsight, this could be because of similar distribution of class labels in both validation and test sets.

The submitted models received an $F_{1}$ -score of 68.07 in Subtask A and 81.03 in Subtask B.

3.7 Statistical Significance Test

Reichart et al. (2018) suggests methods to measure whether two models have statistically significant differences in their predictions on a single dataset. We incorporate a non-parametric testing method for significance called the McNemar’s test recommended by them for binary classification. Pairwise comparison of few of our models are reported in Table 3. The table contains the p-values for the null hypothesis. The null hypothesis is that two models do not have significant differences in their label predictions. In simpler words, a small p-value for an experiment pair denotes a significant difference in the prediction disagreement between two models. For example, from Table 3, DAN + GloVe and DAN + BERT models have a p-value less than 0.05 in both sub-tasks. This indicates that there is significant disagreement between the predictions of two models. Since DAN + BERT gets a better F1-score and p $<$ 0.05, we can confidently assert that improvement is not obtained by chance.

We use majority voting from five random seeds to get the final predictions on the test set for doing the paired significance testing.

4 Conclusion

We discussed our experiments for doing suggestion mining using tri-training. Tri-training combined with BERT representations proved to be an effective technique for doing semi-supervised learning especially in a cross-domain setting. Future work could explore more optimal ways of doing tri-training, evaluate the effect of contextual representations in tri-training convergence, and try more sophisticated architectures for classification that may include different attention mechanisms.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Blitzer et al. (2007) John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL .
2Blomo et al. (2013) Jim Blomo, Martin Ester, and Marty Field. 2013. Recsys challenge 2013. In Rec Sys .
3Blum and Mitchell (1998) Avrim Blum and Tom M. Mitchell. 1998. Combining labeled and unlabeled data with co-training. In COLT .
4Chen et al. (2011) Minmin Chen, Kilian Q. Weinberger, and John Blitzer. 2011. Co-training for domain adaptation. In NIPS .
5Chen and Cardie (2018) Xilun Chen and Claire Cardie. 2018. Multinomial adversarial networks for multi-domain text classification. In NAACL-HLT .
6Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. Co RR , abs/1810.04805.
7Gardner et al. (2018) Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E. Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2018. Allennlp: A deep semantic natural language processing platform. Co RR , abs/1803.07640.
8Iyyer et al. (2015) Mohit Iyyer, Varun Manjunatha, Jordan L. Boyd-Graber, and Hal Daumé. 2015. Deep unordered composition rivals syntactic methods for text classification. In ACL .