Supervised Sentiment Classification with CNNs for Diverse SE Datasets

Achyudh Ram; Meiyappan Nagappan

arXiv:1812.09653·cs.CL·December 27, 2018

Supervised Sentiment Classification with CNNs for Diverse SE Datasets

Achyudh Ram, Meiyappan Nagappan

PDF

Open Access 1 Repo

TL;DR

This paper introduces a CNN-LSTM hierarchical model trained on pre-trained word vectors for sentiment analysis in software engineering, significantly improving accuracy over existing tools across multiple datasets.

Contribution

The study presents a novel supervised deep learning model tailored for SE sentiment analysis, outperforming existing methods and demonstrating the benefits of small-scale re-training.

Findings

01

Model achieves state-of-the-art accuracy on all datasets.

02

Supervised re-training with small labeled samples improves performance.

03

Deep learning models outperform traditional sentiment analysis tools in SE context.

Abstract

Sentiment analysis, a popular technique for opinion mining, has been used by the software engineering research community for tasks such as assessing app reviews, developer emotions in issue trackers and developer opinions on APIs. Past research indicates that state-of-the-art sentiment analysis techniques have poor performance on SE data. This is because sentiment analysis tools are often designed to work on non-technical documents such as movie reviews. In this study, we attempt to solve the issues with existing sentiment analysis techniques for SE texts by proposing a hierarchical model based on convolutional neural networks (CNN) and long short-term memory (LSTM) trained on top of pre-trained word vectors. We assessed our model's performance and reliability by comparing it with a number of frequently used sentiment analysis tools on five gold standard datasets. Our results show that…

Tables3

Table 1. TABLE I : Gold-standard datasets used for evaluation

Dataset	Number of samples	Class distribution
Dataset	Number of samples	Positive	Neutral	Negative
App Reviews	341	54.5%	7.3%	38.2%
Jira	926	31.3%	N/A	68.7%
Gerrit	1600	75.0%	N/A	25.0%
SO Java Lib.	1500	8.7%	79.4%	11.9%
SO Sentiments	4423	34.5%	38.3%	27.2%

Table 2. TABLE II : Performance of classifiers on gold-standard datasets

Dataset	Classifier	Negative			Positive			Neutral			Acc.
Dataset	Classifier	P	R	F1	P	R	F1	P	R	F1	Acc.
Jira	Naive Bayes	0.94	0.94	0.94	0.88	0.86	0.87	-	-	-	0.92
	VADER	0.99	0.72	0.83	0.62	0.99	0.76	-	-	-	0.80
	SentiStrength	0.99	0.70	0.82	0.60	0.99	0.75	-	-	-	0.79
	SentiStrengthSE	0.99	0.70	0.82	0.61	0.99	0.75	-	-	-	0.80
	Senti4SD	0.86	0.96	0.91	0.93	0.92	0.92	-	-	-	0.95
	SentiCR	0.94	0.99	0.96	0.97	0.86	0.91	-	-	-	0.95
	Hi-CNN-LSTM	0.97	0.99	0.98	0.98	0.92	0.95	-	-	-	0.97
App Reviews	Naive Bayes	0.88	0.65	0.75	0.78	0.84	0.81	0.18	0.30	0.22	0.72
	VADER	0.87	0.44	0.58	0.66	0.92	0.77	0.21	0.16	0.18	0.68
	SentiStrength	0.81	0.34	0.48	0.74	0.86	0.80	0.11	0.32	0.16	0.62
	SentiStrengthSE	0.93	0.30	0.45	0.72	0.74	0.73	0.09	0.40	0.15	0.54
	Senti4SD	0.77	0.81	0.79	0.84	0.90	0.87	0.10	0.05	0.07	0.81
	SentiCR	0.82	0.75	0.78	0.83	0.91	0.87	0.12	0.13	0.12	0.79
	Hi-CNN-LSTM	0.86	0.87	0.86	0.85	0.94	0.89	0.15	0.08	0.10	0.85
Gerrit	Naive Bayes	0.52	0.43	0.47	0.82	0.86	0.84	-	-	-	0.75
	VADER	0.43	0.45	0.44	0.81	0.80	0.80	-	-	-	0.72
	SentiStrength	0.99	0.70	0.82	0.60	0.99	0.75	-	-	-	0.78
	SentiStrengthSE	0.54	0.30	0.39	0.80	0.92	0.86	-	-	-	0.76
	Senti4SD	0.75	0.49	0.59	0.85	0.93	0.89	-	-	-	0.82
	SentiCR	0.61	0.66	0.63	0.88	0.86	0.87	-	-	-	0.81
	Hi-CNN-LSTM	0.75	0.46	0.57	0.84	0.95	0.89	-	-	-	0.83
SO Sentiments	Naive Bayes	0.59	0.57	0.58	0.87	0.75	0.81	0.63	0.73	0.67	0.69
	VADER	0.67	0.79	0.73	0.69	0.94	0.80	0.85	0.47	0.61	0.72
	SentiStrength	0.67	0.93	0.78	0.89	0.92	0.90	0.92	0.63	0.75	0.81
	SentiStrengthSE	0.75	0.76	0.75	0.91	0.82	0.86	0.72	0.79	0.75	0.79
	Senti4SD	0.79	0.85	0.82	0.90	0.92	0.91	0.84	0.79	0.81	0.85
	SentiCR	0.80	0.73	0.76	0.89	0.91	0.90	0.79	0.82	0.80	0.83
	Hi-CNN-LSTM	0.84	0.81	0.82	0.90	0.92	0.91	0.83	0.83	0.83	0.86
SO Java Lib.	Naive Bayes	0.54	0.38	0.45	0.46	0.22	0.30	0.85	0.93	0.89	0.80
	VADER	0.47	0.51	0.49	0.19	0.64	0.29	0.89	0.74	0.81	0.63
	SentiStrength	0.39	0.43	0.41	0.20	0.36	0.26	0.86	0.76	0.81	0.69
	SentiStrengthSE	0.50	0.18	0.26	0.31	0.22	0.26	0.82	0.93	0.87	0.78
	Senti4SD	0.55	0.35	0.43	0.65	0.16	0.26	0.85	0.96	0.90	0.82
	SentiCR	0.50	0.67	0.57	0.48	0.36	0.41	0.90	0.87	0.88	0.80
	Hi-CNN-LSTM	0.40	0.22	0.28	0.21	0.07	0.11	0.83	0.99	0.90	0.82

Table 3. TABLE III : Time (in secs.) for training and testing of one fold in a 10-fold cross validation

Dataset	Senti4SD		Hi-CNN-LSTM
Dataset	Train	Test	Train	Test
Jira	868.54	135.43	29.87	0.32
App Reviews	387.78	80.71	22.20	0.36
Gerrit	1443.03	199.68	359.02	1.34
SO Sentiments	3839.34	445.67	430.63	2.13
SO Java Lib.	1340.19	183.91	13.33	0.52

Equations2

s = x_{1} \oplus x_{2} \oplus ... \oplus x_{n - 1} \oplus x_{n}

s = x_{1} \oplus x_{2} \oplus ... \oplus x_{n - 1} \oplus x_{n}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

achyudhk/SentiGH
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Software Engineering Research

Full text

Supervised Sentiment Classification with CNNs for Diverse SE Datasets

Achyudh Ram, Meiyappan Nagappan

University of Waterloo

{arkeshav, mei.nagappan}@uwaterloo.ca

Abstract

Sentiment analysis, a popular technique for opinion mining, has been used by the software engineering research community for tasks such as assessing app reviews, developer emotions in issue trackers and developer opinions on APIs. Past research indicates that state-of-the-art sentiment analysis techniques have poor performance on SE data. This is because sentiment analysis tools are often designed to work on non-technical documents such as movie reviews. In this study, we attempt to solve the issues with existing sentiment analysis techniques for SE texts by proposing a hierarchical model based on convolutional neural networks (CNN) and long short-term memory (LSTM) trained on top of pre-trained word vectors. We assessed our model’s performance and reliability by comparing it with a number of frequently used sentiment analysis tools on five gold standard datasets. Our results show that our model pushes the state of the art further on all datasets in terms of accuracy. We also show that it is possible to get better accuracy after labelling a small sample of the dataset and re-training our model rather than using an unsupervised classifier.

I Introduction

There are a number of studies that attempt to understand the role of sentiments in the software development process by classifying sentiments expressed by mobile users in app reviews, or by developers in issue trackers, code review tools and websites such as Stack Overflow [1, 2, 3, 4]. From these studies, it could be seen that the effectiveness of sentiment analysis can be quite varied depending on the nature of the dataset used in the study and the task for which the tools used for performing the text classification were originally trained for. A lot of the existing studies use out-of-the-box sentiment analysis tools which were designed to work on non-technical documents such as movie reviews. This practice has been frequently criticized and has led to researchers developing tools specifically for software engineering related texts[5]. Tools frequently used by the SE community like SentiStrength and SentiStrengthSE were meant to be used on short texts [6, 7]. However, common SE use-cases for these tools such as code reviews, discussions on issue trackers or on Stack Overflow often comprise of multiple long sentences.

In this study, we address the issues raised by Lin et al. on using opinion mining in SE research [5]. Lin et al. state that there is no tool currently available for identifying sentiments expressed in SE related discussions and that re-training existing models on SE datasets does not improve the accuracy enough to justify expensive re-training for different datasets. The authors raise another concern that existing tools don’t have acceptable precision and recall levels for tasks such as software library recommendations, and that blindly using the predicted sentiment would lead to wrong recommendations. We attempt to solve the issues with existing sentiment analysis techniques by proposing a hierarchical model based on convolutional neural networks (CNN) and long short-term memory (LSTM) networks trained on SE datasets that pushes the state of the art further. Even though convolutional neural networks were originally invented for analyzing visual imagery, existing studies have shown that they achieve state-of-the-art results for various sentence classification tasks, including single sentence sentiment prediction [8, 9]. LSTM is a popular neural network architecture that is composed of recurring units or cells that act as memory, thus enabling the network to learn long term dependencies. [10]. Since they have been designed to be able to either retain or forget information in the cell state, these networks have been frequently used in studies and have achieved state-of-the-art results in various NLP tasks such as language modeling and machine translation [11, 12]. Zhou et al. show that an unified model that uses a CNN for feature extraction from phrases and an LSTM for obtaining the sentence representation outperforms a network that uses only a CNN or an LSTM [13]. In this study, we propose a unified hierarchical model in which we adapt the approach taken by Zhou et al. by having the CNN extract a sequence of representations for entire sentences rather than phrases, and the LSTM encode this sequence into a paragraph representation.

The goal of this study is to answer the following research questions:

RQ1: How does a unified hierarchical model perform when compared with other sentiment analysis tools on SE datasets?

RQ2: How does a unified hierarchical model scale with the amount of training data available when compared with other sentiment analysis tools?

We assess our model’s performance by comparing it with a number of frequently used sentiment analysis tools on five datasets that represent the SE scenarios in which researchers usually use these tools on. Even with just minimal tuning of hyperparameters, our simple model exceeds the current state of the art on all the five datasets. It also performs better on smaller datasets compared to other supervised classifiers. Further, it is considerably faster than the existing state of the art, Senti4SD.

II Related work

In this section, we look at the existing sentiment analysis tools frequently used by the SE community and relevant studies that compare the state of the art in sentiment analysis for SE datasets.

SentiStrength is one of the most widely adopted tools in the software engineering community for extracting sentiment strength from informal English text [6]. SentiStrength outputs both the positive and negative emotions for a sentence due to the fact that a sentence can have mixed sentiment. It was originally applied to social web texts but can be adjusted for other domains by adding new relevant words and sentiment strengths to the term list. SentiStrength-SE is built on top of the original SentiStrength for sentiment analysis specifically for the software engineering texts. Islam et al. showed that with heuristic improvements and a lexicon adjusted for technical texts, SentiStrengthSE outperforms SentiStrength on the JIRA issue comment benchmark dataset [7].

Senti4SD was developed by Calefato et al. for the specific purpose of performing sentiment analysis in a supervised setting on developer communication channels [14]. A dataset of around 4,000 manually labelled questions, answers, and comments extracted from Stack Overflow was used for training and validating the classification algorithm. The authors claim that their classifier reduces the misclassifications of neutral and positive posts on their dataset when compared to SentiStrength. Senti4SD uses a rich feature space comprising of word embeddings, and n-gram, lexicon and keyword-based features.

SentiCR, a supervised sentiment analysis tool, was trained and validated on manually annotated code review comments from Gerrit [3]. The tool is based on the Gradient Boosting Tree (GBT) algorithm, and utilizes the bag of words model with Term Frequency Inverse Document Frequency (TF-IDF) weights as features. Apart from standard preprocessing, it performs synthetic minority over-sampling technique to address class imbalance in the dataset [15].

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based model for general sentiment analysis [16]. It is tuned for social media texts by incorporating an empirically validated sentiment lexicon with five rules that embody grammatical and syntactical conventions. The authors showed that VADER outperforms human raters and that it generalizes better than the other classifiers they used for comparison.

From an analysis of the Apache Software Foundation issue tracker, Murgia et al. showed that development artifacts carry emotional information [17] about the software development process. However, they state that greater the amount of context that is provided about an issue, greater is the extent to which human raters doubt their interpretation of emotions. A number of studies have benchmarked SE specific and off-the-shelf sentiment analysis toolkits on SE datasets: Jongeling et al. studied the extent of agreement of sentiments predicted by sentiment analysis tools such as SentiStrength, NLTK, Stanford NLP and Alchemy among themselves and with the sentiments recognized by human evaluators [18]; Novielli et al. perform a replication of this study to benchmark the performance of three SE specific sentiment analysis tools: SentiCR, Senti4SD and SentiStrengthSE [19]; Lin et al. discuss the negative results obtained after training a recurrent neural network on a manually annotated dataset comprising of questions, answers and comments from StackOverflow [5]. They further investigate the current state of the art by analyzing the performance of popular sentiment analysis toolkits on various SE datasets. The authors highlight the limitations of the sentiment analysis tools used by the SE community and state that efforts should be made to make sentiment analysis practical for SE research.

III Background

III-A Sentiment analysis

Sentiment analysis is the task of predicting the polarity of a phrase (such as positive or negative) often using natural language processing and machine learning techniques. A large part of sentiment analysis involves building predictive models that attempt to identify the emotional state (such as sadness or joy) or the nature of the opinion of a subject (such as positive or negative). Supervised techniques require labelled training data to train machine learning algorithms, whereas unsupervised methods can be applied in cases where labelling data is time consuming or expensive. Unsupervised classifiers use knowledge-based techniques or lexicon-based methods to perform the classification. Recently, there have been studies where supervised and unsupervised techniques are combined with a majority rule or voting classifier [20, 21].

There are a number of commercial and open-source sentiment analysis tools available. In this study, we only consider tools that are publicly available and free for academic use. An overview of these tools along with the relevant research studies can be found in the next section.

III-B Deep learning for sentence classification

With rapidly increasing accessibility to deep learning, it has been applied to a wide range of NLP tasks, including sentence classification. These techniques capture contextual information better and mitigate the problems with the traditional bag-of-words model such as the curse of dimensionality111The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings. Source: https://en.wikipedia.org/wiki/Curse_of_dimensionality. Two of the most widely used neural network architectures for natural language processing tasks are convolutional neural networks (CNN) and recurrent neural networks (RNN). In this study, we use a variation of an RNN called the long short-term memory (LSTM).

III-B1 Convolutional Neural Networks (CNNs)

A CNN typically comprises of multiple convolutional layers, each of which might be followed by a pooling layer, and finally a fully connected layer. A fully-connected neural network would not be practical in learning features from images due to the large number of neurons required for processing even relatively small images. A convolution layer, with the help of learn-able filters, provides a solution to this problem by reducing the number of parameters. Unlike a fully-connected neural network, each neuron is connected only to a local region of the input. Apart from a convolutional layer, these networks include pooling layers which essentially perform downsampling to reduce the spatial size of the representation. This reduces the number of parameters, thereby reducing the amount of computation required. Due to the fact that this architecture preserves the spatial structure of an image, it has been successfully used for computer vision tasks with minimal pre-processing required.

Kim et al. showed that CNNs can achieve state-of-the-art results in single sentence sentiment prediction among other sentence classification tasks [8]. In this approach, the vector representations of the words in a sentence were concatenated vertically to create a two-dimensional matrix for each sentence. The resulting matrix was passed through a CNN to extract higher-level features for performing the classification.

III-B2 Long-Short Term Memory (LSTM)

Practical difficulties in training RNNs due to trade-offs between efficient learning and latching on to information for long periods have been observed and studied in detail [22]. LSTMs are a variation of RNNs designed to solve this long-term dependency problem [10]. Like RNNs, LSTMs have a chain like structure of repeating units. Unlike RNNs which have a single layer in the repeating units, LSTMs have a complex module of four interacting layers. The cell state aids in the flow of information across the chain of modules, and three gates control the addition or removal of information to the cell state. The ability of these networks to deal with variable-length input sequences and capture long-term dependencies have made these networks achieve remarkable results in a range of NLP tasks.

A frequently used variant of RNNs called the Bidirectional-RNN was proposed by Schuster et al. in 1997 [23]. In this architecture, classification is done on the combined outputs of two RNNs which process the input sentence from left to right and right to left. This approach can be extended to LSTM networks

III-B3 Word2Vec

Word embedding is a language modelling technique used to create a continuous higher dimensional vector space representation of words such that similar words are closer to each other in that space. This learned distributed representation of words alleviates the curse of dimensionality [24]. Word2Vec refers to a class of two layer neural network models that are trained on a large corpus of text to produce such word embeddings [25]. There are two model architectures to learn the distributed representation of words:

Continuous bag-of-words (CBOW): In this architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence the prediction [25]. 2. 2.

Skip-gram: In this architecture, the model uses the current word to predict the surrounding window of context words. Unlike CBOW, the skip-gram architecture weighs nearby context words more heavily than more distant context words [26].

According to Mikolov et al., the skip-gram architecture is slower but works better for infrequent words. In practice, a number of optimizations are made to increase the training speed such as the sub-sampling of high frequency words or negative sampling to ensure that each training sample only modifies the weights corresponding to a random selection of negative words.

IV Methodology

IV-A Model architecture

The architecture of the unified hierarchical model proposed in this study consists of 2 components: a CNN and an LSTM. The architecture of the proposed model can be seen in Figure 1. In this section, we explain how the CNN is used to extract a sequence of representations for a sentence and how it works in unison with the LSTM to encode this sequence into a document/paragraph representation.

IV-A1 Convolutional Neural Network (CNN)

The CNN architecture used in this study (as shown in Figure 1) is a variation of the single-channel architecture used by Kim et al. [8]. Let $n$ be the length of the longest sentence in the dataset. For a given sentence in the dataset, let $x_{i}$ be the k-dimensional word vector representing the $i^{th}$ word in that sentence. Every sentence $s$ in the dataset is represented as the concatenation (represented by the operator $\oplus$ ) of $x_{i}$ , where $1\leq i\leq n,i\in\mathbb{N}$ :

[TABLE]

Let us denote the concatenation of $(j+1)$ word vectors starting with $i$ as $x_{i:i+j}$ . This resultant matrix is then passed through a temporal convolutional layer with a filter windows of size $f$ to produce new features. These filters are applied to each window of words in the sentence $s$ to produce a number of feature maps. A temporal max-pooling operation is applied to these feature maps to retain the feature with the highest value in every map. Finally, these features are fed to a fully connected layer of rectified linear units (ReLU) to create a $m$ dimensional vector representation of the sentence which is then passed on to an LSTM.

IV-A2 Long-Short Term Memory (LSTM)

Every sample belonging to the dataset is treated as a document and is tokenized into sentences. Each sentence is passed through the CNN for extracting features as described above. For encoding these features, we use a Bidirectional-LSTM. The hidden states of the LSTM cells in the last time step is the encoded document/paragraph representation. This representation is passed to the final softmax layer for predicting the sentiment polarity, similar to the approach adopted by Zhou et al. [13].

IV-A3 Regularization

We employ dropout on the dense layers of our network for reducing overfitting [27]. Dropout prevents complex co-adaptations of hidden units on training data by randomly removing (i.e. dropping out) hidden units along with their connections during training. We also employ dropout in the layers of the LSTM on the linear transformation of the inputs and the recurrent state.

IV-B Datasets

We evaluate our model on five publicly available SE benchmark datasets. Table I shows the size of the dataset, the number of samples and the class distribution.

The mobile app reviews dataset was created as a part of a study on the release planning of mobile apps by Villarroel et al. [28]. Lin et al. randomly sampled and manually labelled 341 reviews from this dataset into positive, neutral and negative categories [5]. Two evaluators labelled the reviews, while a third resolved the conflicts. This dataset is considerably skewed with the minority (neutral) class consisting of only 25 samples.

The Jira issue comments dataset by Ortu et al. [29] has been used in various studies as one of the gold-standard dataset for sentiment analysis in software texts [5, 19]. The dataset was labelled with six emotions: love, joy, anger, and sadness being the 4 most frequently expressed ones. As previously done by Lin et al. [5] and Jongeling et al. [18], the issue comments labelled love or joy were considered as positive training samples, and those labelled anger or sadness were considered as negative training samples. The dataset has an imbalanced class distribution with 68.7% of the issue comments belonging to the negative class.

The Gerrit code review dataset was used for training and validating the SentiCR classification tool [3]. It was annotated by three researchers, who classified review comments by following an ad-hoc approach into 3 classes: positive, neutral and negative. The former two were merged into one ’non-negative’ category to reduce the class imbalance.

The Stack Overflow Java Libraries dataset was collected by Lin et al. for the purpose of recommending libraries based on sentiments mined from Stack Overflow [5]. It contains 1,500 randomly extracted sentences that have been labelled by five evaluators. This dataset is considerably skewed considering that it is labelled into 3 classes, with around 80% of the samples belonging to the majority (neutral) class.

The Stack Overflow Sentiments dataset was created as a part of a study to create a classifier for the analysis of sentiments in developer communication channels [14]. This dataset, consisting of 4,423 questions, answers, and comments from Stack Overflow, was used for the training and validation of Senti4SD. Classes are relatively balanced in this dataset with 38% of the posts belonging to the majority (neutral) class, and 27% belonging to the minority (negative) class. This balance is the result of the authors performing sampling of posts based on the presence of affective lexicon as reflected by SentiStrength sentiment scores.

IV-C Experimental setup

The model was implemented with the Keras library222Available at https://www.tensorflow.org/api_docs/python/tf/keras which comes packaged with the core Tensorflow API. This allows us to seamlessly shift between the TensorFlow and Keras workflows based on the task at hand. We train our model by minimizing the categorical cross-entropy. Stochastic gradient descent is used for training with the Adaptive Moment Estimation (Adam) optimizer as it has been shown to work well in practice [30]. We use early stopping when the validation loss has stopped decreasing to avoid overfitting when training [31]. There are variations in the experimental setup across research questions. These variations are explained in Section V. The following six classifiers were used for comparing the performance of our model:

•

Supervised: Naive Bayes (NLTK), Senti4SD, SentiCR

•

Unsupervised: SentiStrength, SentiStrength SE, VADER

IV-C1 Hyperparameters

We don’t perform any dataset specific tuning of hyperparameters and hence the following hyperparameters apply for all the datasets. The temporal convolution layer has a filter window of size 5 with 150 feature maps each, and are activated by a rectified linear unit (ReLU). The dense layers have a dropout rate of 0.4. The layers of the LSTM have a dropout rate of 0.2 for both the inputs and the recurrent state.

IV-C2 Pre-trained word vectors

For all the datasets in this study, we use the 300-dimensional word vectors trained by Mikolov et al. on roughly 100 billion words from Google News333Available at https://code.google.com/archive/p/word2vec/. We don’t fine-tune the word vectors and keep them static throughout the analysis.

IV-C3 Evaluation metrics

For each of the models considered in the analysis, we follow the standard evaluation methodology by measuring the performance with the overall accuracy, along with precision, recall and F-score for each of the classes in the labelled dataset. Classification accuracy is the ratio of the number of correct predictions to the total samples in the dataset. For a given class, precision is the ratio of true positives to the number of predicted items of that class, and recall is the ratio of true positives to all the items that belong to that class. F-score is defined as the harmonic mean of precision and recall. Perfect precision and recall results in an F-score of one, and a zero precision or recall results in an F-score of zero. Precision, recall and F-score provide a complete picture of the model performance on datasets with unbalanced distribution of classes, whereas the overall accuracy allows a quick comparison of the performance of classifiers across all the classes.

V Results

V-A Model performance evaluation

RQ1: How does a unified hierarchical model perform when compared with other sentiment analysis tools on SE datasets?

To answer the first research question, we evaluate each classifier by performing a stratified 10-fold cross validation on the five datasets described in Section I. In stratified k-fold cross validation, each fold would contain approximately the same proportion of classes as the dataset, thereby making each fold a better representation of the dataset. We fixed the state of all random elements during the cross validation process so that each fold is identical for all the classifiers. The classification algorithm used for Senti4SD was L2 regularized logistic regression as it provided the best average performance across our datasets. For SentiCR, gradient boosted trees were used for classification based on the recommendation provided by the authors [3]. The accuracy, precision, recall and F-measure for each classifier is reported in Table II.

For every dataset, our model offers the best accuracy, along with consistently high precision and recall over all the polarity classes except the underrepresented classes of Stack Overflow Java Libraries and App Reviews datasets. Both of these datasets have a sharply skewed distribution of classes, and the relatively small number of samples is not sufficient for training any of the supervised classifiers. Even the unsupervised classifiers don’t offer satisfactory performance in this regard.

V-B Model scalability with dataset size

RQ2: How does a unified hierarchical model scale with the amount of training data available when compared with other sentiment analysis tools?

The main goal of this research question was to observe the change in validation accuracy with the amount of data available to our model, thereby identifying the amount of data required to make the most of our model. We also wanted to identify the lowest amount of labelling to be done by researchers in order for a supervised classifier to exceed the performance of a state-of-the-art unsupervised classifier. Answering this question allows us to provide a ballpark dataset size below which it would be better to use an unsupervised algorithm rather than spend time on labelling. We also compare how the performance of other classifiers evolve with the amount of data available.

To answer this research question, we performed a test-train split of 70-30 percent. We then resampled the elements of the train dataset with replacement with increasing sizes starting at 20% of the train dataset size. We fixed the state of all random elements during the test-train split and resampling process in order to make a fair comparison across classifiers. We were unable to train SentiCR on the resampled splits of the App Reviews dataset due to the small size of the dataset. We present the results of our analysis in Figure 2. We denote the performance of VADER, an unsupervised classifier, on the test split by a horizontal line.

The results from Figure 2 show that, depending on the dataset, having even a small number of labels would help train a supervised classifier that performs better than an unsupervised classifier. Naive Bayes, the baseline classifier, is the only supervised classifier to perform worse than VADER. This makes a case for labelling a small balanced dataset and trying out supervised classification algorithms before resorting to unsupervised tools. Depending on the results, more labelling can be done to improve the results further.

Even though the accuracy rises with the number of training samples available, the increase is very gradual. In a bi-polar dataset that is easy to classify such as Jira, quintupling the size of the training data from 130 to 649 leads to an increase in accuracy of only 2%. Even though most supervised tools struggle to outperform VADER on the App Reviews dataset due to its small size, our hierarchical model is more accurate than VADER after training on just 48 samples.

VI Discussion

From the above results, we can understand that deep learning models have a lot to offer when it comes to building customized sentiment analysis tools for the SE community. We also make a case for why supervised techniques are better for sentiment analysis tasks despite requiring labelled datasets. Using state of the art deep learning techniques, we were able to build a classifier that not only performs better than the existing tools, but also scales better when there aren’t a lot of training samples. Further, our model opens up the possibility of transfer learning by using our trained model for feature extraction.

One of the motivations behind developing this model was that the sentiment analysis tools frequently used by the SE community are not optimized for long texts. This often necessitated researchers to label individual lines rather than a block of text like an entire comment or a bug report. Some of the datasets used in this study such as App Reviews, Gerrit and SO Sentiment contain labels for multiple sentences and our model outperforms all of the existing tools on these datasets. However, it is not possible to conclusively measure the performance improvement obtained when using our classifier on long texts without performing an ablation study.

One of the concerns raised by Lin et al. on re-training existing models on SE datasets is that there isn’t enough improvement in accuracy to justify the expensive and time consuming training process for each dataset [5]. While validating the performance of other supervised classifiers, we also noticed the considerably longer time required to train and test Senti4SD. Table II showed that our tool makes the most accurate predictions on all of the five datasets, followed by Senti4SD. We measured the time taken for training and inference during one fold of a 10-fold cross validation to compare the time performance of both of these models. Table III shows the time taken on a PC equipped with Intel Core i7-6770HQ and 16 GB of RAM. These values take into account the time taken for feature extraction and parameter tuning for Senti4SD. It can be seen that our model is on average twice as fast for training and 200 times faster for inference compared to Senti4SD despite providing equivalent or better predictions on all of these datasets.

Since we employ validation-based early stopping, training is stopped when there isn’t a steady decline in the validation loss. Since the network trains until it is stopped, the time taken for training is not solely determined by the size of the training set, but also on factors like the number of epochs of training that can be run before the network begins to overfit. Despite the Gerrit dataset being smaller than the Stack Overflow Sentiments dataset (around one-third the size), our model takes longer to train on the Gerrit dataset. For a given model, the time taken for inference, however, is largely dependent on the size of the test set. It should be noted that our model scales well on GPUs as it is based on the TensorFlow deep learning framework. With a mid-range GPU such as the Nvidia GTX 1050, we observed that our model ran more than 6 times faster than the numbers reported in Table III.

VII Threats to Validity

VII-A Internal validity

Wherever we could, we used hyperparameters and dataset pre-processing methods recommended by the authors while training the classifiers from other studies. However, since we did not conduct an exhaustive grid search for optimizing hyperparameters on these classifiers due to time and resource constraints, we could be reporting sub-optimal performance metrics [32].

VII-B External validity

Sentiment analysis datasets in SE can be quite diverse based on the application. This makes it hard to make a model that generalizes universally across datasets even within the SE domain. In order to ensure that our model does not overfit, we employ dropout on the fully-connected and recurrent layers, along with validation-based early stopping. However, more datasets will be required to ascertain how well our model generalizes.

It is hard to explain the positive or negative results associated with deep learning as the cause of the result could be impossible to pinpoint. The explanations given regarding the choice of hyperparameters or the network architecture is purely based on our intuition and the empirical results from existing literature. This also makes it hard to comment on whether the hyperparameters and the network architecture chosen based on the datasets in this study would generalize to other datasets in the future. In order to minimize this threat, we chose five datasets on sentiment analysis from diverse SE domains.

VIII Conclusion

In this study, we described a novel unified hierarchical model benchmarked on five SE sentiment analysis datasets. Our results show that this model provides more accurate predictions than the existing state-of-the-art supervised and unsupervised classifiers, despite the lack of extensive hyperparameter tuning. Further, it scales better to small datasets compared to other supervised classifiers, and takes less time to train. We also discuss why it is better to label a small sample of the dataset and train a supervised classifier rather than using an unsupervised classifiers.

We initially set out to address the issues raised by Lin et al. [5] on opinion mining in SE research and show that by either finding appropriate existing models and retraining them on the new dataset or developing models specifically for the SE application, it is possible to achieve satisfactory results that are usable for research purposes. When we identified existing datasets and validated the performance of existing classifiers, we noticed that Senti4SD and SentiCR provide usable results on most of the datasets considered, with minor exceptions. All of the supervised classifiers considered in this study perform poorly on the minority classes of the App Reviews and SO Java Lib. datasets due to the lack of enough training samples for these classes. While Lin et al. discuss that the nature of the data in certain SE applications (such as issue trackers) might make sentiment analysis easier, they don’t comprehensively consider a wide range of SE datasets. Of the three datasets they look at (App Reviews, Jira, and SO Java Lib.), two of them have a skewed class distribution. Further, the authors didn’t consider supervised tools built specifically for SE such as Senti4SD and SentiCR. In this paper, we would like to make a case that opinion mining is very well feasible for SE research with the caveat that it is at least required to re-train existing models on the appropriate domain data, and in some cases building models and tweaking hyperparameters based on the nature of the data.

The primary contributions of this study are:

A novel supervised sentiment classification model that provides state of the art performance on a diverse range of SE datasets. The classifier and the datasets used in this study are publicly available at https://github.com/achyudhk/SentiGH. 2. 2.

A comprehensive comparative analysis of existing sentiment analysis tools and how they scale with the amount of available training data.

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] E. Guzman and W. Maalej, “How do users like this feature? a fine grained sentiment analysis of app reviews,” in Requirements Engineering Conference (RE), 2014 IEEE 22nd International . IEEE, 2014, pp. 153–162.
2[2] M. Ortu, A. Murgia, G. Destefanis, P. Tourani, R. Tonelli, M. Marchesi, and B. Adams, “The emotional side of software developers in jira,” in Proceedings of the 13th International Conference on Mining Software Repositories . ACM, 2016, pp. 480–483.
3[3] T. Ahmed, A. Bosu, A. Iqbal, and S. Rahimi, “Senti CR: A Customized Sentiment Analysis Tool for Code Review Interactions,” in 32nd IEEE/ACM International Conference on Automated Software Engineering (NIER track) , ser. ASE ’17, 2017.
4[4] D. Pletea, B. Vasilescu, and A. Serebrenik, “Security and emotion: sentiment analysis of security discussions on github,” in Proceedings of the 11th working conference on mining software repositories . ACM, 2014, pp. 348–351.
5[5] B. Lin, F. Zampetti, G. Bavota, M. Di Penta, M. Lanza, and R. Oliveto, “Sentiment analysis for software engineering: How far can we go?” in IEEE/ACM 40th International Conference on Software Engineering (ICSE) , 2018.
6[6] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, and A. Kappas, “Sentiment in short strength detection informal text,” J. Am. Soc. Inf. Sci. Technol. , vol. 61, no. 12, pp. 2544–2558, Dec. 2010. [Online]. Available: http://dx.doi.org/10.1002/asi.v 61:12 · doi ↗
7[7] M. R. Islam and M. F. Zibran, “Leveraging automated sentiment analysis in software engineering,” in Proceedings of the 14th International Conference on Mining Software Repositories . IEEE Press, 2017, pp. 203–214.
8[8] Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2014, pp. 1746–1751.