Easy over Hard: A Case Study on Deep Learning

Wei Fu; Tim Menzies

arXiv:1703.00133·cs.SE·June 27, 2017

Easy over Hard: A Case Study on Deep Learning

Wei Fu, Tim Menzies

PDF

1 Repo

TL;DR

This paper compares deep learning with simpler optimization methods, demonstrating that in some cases, traditional techniques can achieve similar results much faster, urging caution in adopting complex models without baseline comparisons.

Contribution

It shows that simple optimization methods like DE can outperform deep learning in training time while maintaining comparable accuracy, challenging the assumption that more complex models are always better.

Findings

01

DE achieved similar results to deep learning in 10 minutes

02

Deep learning took 14 hours for the same task

03

Simple optimizers can be effective alternatives to deep learning

Abstract

While deep learning is an exciting new technique, the benefits of this method need to be assessed with respect to its computational cost. This is particularly important for deep learning since these learners need hours (to weeks) to train the model. Such long training time limits the ability of (a)~a researcher to test the stability of their conclusion via repeated runs with different random seeds; and (b)~other researchers to repeat, improve, or even refute that original work. For example, recently, deep learning was used to find which questions in the Stack Overflow programmer discussion forum can be linked together. That deep learning system took 14 hours to execute. We show here that applying a very simple optimizer called DE to fine tune SVM, it can achieve similar (and sometimes better) results. The DE approach terminated in 10 minutes; i.e. 84 times faster hours than deep…

Tables6

Table 1. Table 1. Classes of Knowledge Unit Pairs.

Class	Description
Duplicate	These two knowledge units are addressing the
Duplicate	same question.
Direct link	One knowledge unit can help to answer the
Direct link	question in the other knowledge unit.
Indirect link	One knowledge provides similar information to
	solve the question in the other knowledge unit,
	but not a direct answer.
Isolated	These two knowledge units discuss unrelated
Isolated	questions.

Table 2. Table 2. List of Parameters Tuned by This Paper.

Parameters	Default	Xue et al.	Tuning Range	Description
C	1.0	unknown	[1, 50]	Penalty parameter C of the error term.
kernel	‘rbf’	‘rbf’	[‘liner’, ‘poly’, ‘rbf’, ‘sigmoid’]	Specify the kernel type to be used in the algorithms.
gamma	1/n_features	$1 / 200$	[0, 1]	Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
coef0	0	unknown	[0, 1]	Independent term in kernel function. It is only used in ‘poly’ and ‘sigmoid’.

Table 3. Table 3. Confusion Matrix.

		Classified as
		$C_{1}$	$C_{2}$	$C_{3}$	$C_{4}$
Actual	$C_{1}$	$c_{11}$	$c_{12}$	$c_{13}$	$c_{14}$
	$C_{2}$	$c_{21}$	$c_{22}$	$c_{23}$	$c_{24}$
	$C_{3}$	$c_{31}$	$c_{32}$	$c_{33}$	$c_{34}$
	$C_{4}$	$c_{41}$	$c_{42}$	$c_{43}$	$c_{44}$

Table 4. Table 4. Comparison of Our Baseline Method with XU’s. The Best Scores are Marked in Bold .

Metrics	Methods	Duplicate	Direct Link	Indirect Link	Isolated	Overall
Precision	Our SVM	0.724	0.514	0.779	0.601	0.655
Precision	XU’s SVM	0.611	0.560	0.787	0.676	0.659
Recall	Our SVM	0.525	0.492	0.970	0.645	0.658
Recall	XU’s SVM	0.725	0.433	0.980	0.538	0.669
F1-score	Our SVM	0.609	0.503	0.864	0.622	0.650
F1-score	XU’s SVM	0.663	0.488	0.873	0.600	0.656
Accuracy	Our SVM	0.525	0.493	0.970	0.645	0.658
Accuracy	XU’s SVM	-	-	-	-	0.669

Table 5. Table 5. Comparison of Tuned SVM with XU’s CNN Method. The Best Scores are Marked in Bold .

Metrics	Methods	Duplicate	Direct Link	Indirect Link	Isolated	Overall
Precision	XU’s SVM	0.611	0.560	0.787	0.676	0.658
	XU’s CNN	0.898	0.758	0.840	0.890	0.847
	Tuned SVM	0.885	0.851	0.944	0.903	0.896
Recall	XU’s SVM	0.725	0.433	0.980	0.538	0.669
	XU’s CNN	0.898	0.903	0.773	0.793	0.842
	Tuned SVM	0.860	0.828	0.995	0.905	0.897
F1-score	XU’s SVM	0.663	0.488	0.873	0.600	0.656
	XU’s CNN	0.898	0.824	0.805	0.849	0.841
	Tuned SVM	0.878	0.841	0.969	0.909	0.899

Table 6. Table 6. Comparison of Experimental Environment

Methods	OS	CPU	RAM
Tuning SVM	MacOS 10.12	Intel Core i5 2.7 GHz	8 GB
CNN	Windows 7	Intel Core i7 2.5 GHz	16 GB

Equations8

\frac{1}{n} i = 1 \sum n - c \leq j \leq c, j \neq = 0 \sum l o g p (w_{i + j} ∣ w_{i})

\frac{1}{n} i = 1 \sum n - c \leq j \leq c, j \neq = 0 \sum l o g p (w_{i + j} ∣ w_{i})

p (w_{O} ∣ w_{I}) = \frac{e x p ( v _{w_{O}}^{T} v _{w_{I}} )}{\sum _{w = 1}^{∣ W ∣} e x p ( v _{w}^{T} v _{w_{I}} )}

p (w_{O} ∣ w_{I}) = \frac{e x p ( v _{w_{O}}^{T} v _{w_{I}} )}{\sum _{w = 1}^{∣ W ∣} e x p ( v _{w}^{T} v _{w_{I}} )}

\begin{array}[]{ll}accuracy=\frac{\sum_{i}c_{ii}}{\sum_{i}\sum_{j}c_{ij}}\end{array}

\begin{array}[]{ll}accuracy=\frac{\sum_{i}c_{ii}}{\sum_{i}\sum_{j}c_{ij}}\end{array}

\begin{array}[]{ll}prec_{j}&=precision_{j}=\frac{c_{jj}}{\sum_{i}c_{ij}}\\ pd_{j}&=recall_{j}=\frac{c_{jj}}{\sum_{i}c_{ji}}\\ F1_{j}&=2*pd_{j}*prec_{j}/(pd_{j}+prec_{j})\end{array}

\begin{array}[]{ll}prec_{j}&=precision_{j}=\frac{c_{jj}}{\sum_{i}c_{ij}}\\ pd_{j}&=recall_{j}=\frac{c_{jj}}{\sum_{i}c_{ji}}\\ F1_{j}&=2*pd_{j}*prec_{j}/(pd_{j}+prec_{j})\end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

WeiFoo/EasyOverHard
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSupport Vector Machine

Full text

\usetikzlibrary

shadows

Easy over Hard: A Case Study on Deep Learning

Wei Fu, Tim Menzies

Com.Sci., NC State, USA

[email protected], [email protected]

(2017; 2017)

Abstract.

While deep learning is an exciting new technique, the benefits of this method need to be assessed with respect to its computational cost. This is particularly important for deep learning since these learners need hours (to weeks) to train the model. Such long training time limits the ability of (a) a researcher to test the stability of their conclusion via repeated runs with different random seeds; and (b) other researchers to repeat, improve, or even refute that original work.

For example, recently, deep learning was used to find which questions in the Stack Overflow programmer discussion forum can be linked together. That deep learning system took 14 hours to execute. We show here that applying a very simple optimizer called DE to fine tune SVM, it can achieve similar (and sometimes better) results. The DE approach terminated in 10 minutes; i.e. 84 times faster hours than deep learning method.

We offer these results as a cautionary tale to the software analytics community and suggest that not every new innovation should be applied without critical analysis. If researchers deploy some new and expensive process, that work should be baselined against some simpler and faster alternatives.

Search based software engineering, software analytics, parameter tuning, data analytics for software engineering, deep learning, SVM, differential evolution

††conference: Submitted to 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering; 4–8 September, 2017; Paderborn, Germany††copyright: rightsretained††conference: ACM SIGSOFT SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING; September 2017; PADERBORN, GERMANY††journalyear: 2017††price: 15.00††copyright: none††doi: 10.1145/3106237.3106256††isbn: 978-1-4503-5105-8/17/09††conference: 2017 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering; September 4-8, 2017; Paderborn, Germany††journalyear: 2017††price: 15.00

1. Introduction

This paper extends a prior result from ASE’16 by Xu et al. (Xu et al., 2016) (hereafter, XU). XU described a method to explore large programmer discussion forums, then uncover related, but separate, entries. This is an important problem. Modern SE is evolving so fast that these forums contain more relevant and recent comments on current technologies than any textbook or research article.

In their work, XU predicted whether two questions posted on Stack Overflow are semantically linkable. Specifically, XU define a question along with its entire set of answers posted on Stack Overflow as a knowledge unit (KU). If two knowledge units are semantically related, they are considered as linkable knowledge units.

In their paper, they used a convolution neural network (CNN), a kind of deep learning method (LeCun et al., 2015), to predict whether two KUs are linkable. Such CNNs are highly computationally expensive, often requiring network composed of 10 to 20 layers, hundreds of millions of weights and billions of connections between units (LeCun et al., 2015). Even with advanced hardware and algorithm parallelization, training deep learning models still requires hours to weeks. For example:

•

XU report that their analysis required 14 hours of CPU.

•

Le (Le, 2013) used a cluster with 1,000 machines (16,000 cores) for three days to train a deep learner.

This paper debates what methods should be recommended to those wishing to repeat the analysis of XU. We focus on whether using simple and faster methods can achieve the results that are currently achievable by the state-of-art deep learning method. Specifically, we repeat XU’s study using DE (differential evolution (Storn and Price, 1997)), which serves as a hyper-parameter optimizer to tune XU’s baseline method, which is a conventional machine learning algorithm, support vector machine (SVM). Our study asks:

RQ1*: Can we reproduce XU’s baseline results (Word Embedding + SVM)?* Using such a baseline, we can compare our methods to those of XU.

RQ2*: Can DE tune a standard learner such that it outperforms XU’s deep learning method?* We apply differential evolution to tune SVM. In terms of precision, recall and F1-score, we observe that the tuned SVM method outperforms CNN in most evaluation scores.

RQ3*: Is tuning SVM with DE faster than XU’s deep learning method?* Our DE method is $84$ times faster than CNN.

We offer these results as a cautionary tale to the software analytics community. While deep learning is an exciting new technique, the benefits of this method need to be carefully assessed with respect to its computational cost. More generally, if researchers deploy some new and expensive process (like deep learning), that work should be baselined against some simpler and faster alternatives

The rest of this paper is organized as follows. Section 2 describes the background and related work on deep learning and parameter tuning in SE. Section 3 explains the case study problem and the proposed tuning method investigated in this study, then Section 4 describes the experimental settings of our study, including research questions, data sets, evaluation measures and experimental design. Section 5 presents the results. Section 6 discusses implications from the results and the threats to the validity of our study. Section 7 concludes the paper and discusses the future work.

Before beginning, we digress to make two points. Firstly, just because “DE + SVM” beats deep learning in this application, this does not mean DE is always the superior method for all other software analytics applications. No learner works best over all problems (Wolpert, 1996)– the trick is to try several approaches and select the one that works best on the local data. Given the low computational cost of DE (10 minutes vs 14 hours), DEs are an obvious and low-cost candidate for exploring such alternatives.

Secondly, to enable other researchers to repeat, improve, or refute our results, all our scripts and data are freely available on-line Github111https://github.com/WeiFoo/EasyOverHard.

2. Background and Related Work

2.1. Why Explore Faster Software Analytics?

This section argues that avoiding slow methods for software analytics is an open and urgent issue.

Researchers and industrial practitioners now routinely make extensive use of software analytics to discover (e.g.) how long it will take to integrate the new code (Czerwonka et al., 2011), where bugs are most likely to occur (Ostrand et al., 2004), who should fix the bug (Anvik et al., 2006), or how long it will take to develop their code (Kocaguneli et al., 2012a; Kocaguneli et al., 2012b; Molokken and Jorgensen, 2003). Large organizations like Microsoft routinely practice data-driven policy development where organizational policies are learned from an extensive analysis of large data sets collected from developers (Begel and Zimmermann, 2014; Theisen et al., 2015).

But the more complex the method, the harder it is to apply the analysis. Fisher et al. (Fisher et al., 2012) characterizes software analytics as a work flow that distills large quantities of low-value data down to smaller sets of higher value data. Due to the complexities and computational cost of SE analytics, “the luxuries of interactivity, direct manipulation, and fast system response are gone” (Fisher et al., 2012). They characterize modern cloud-based analytics as a throwback to the 1960s-batch processing mainframes where jobs are submitted and then analysts wait, wait, and wait for results with “little insight into what is really going on behind the scenes, how long it will take, or how much it is going to cost” (Fisher et al., 2012). Fisher et al. (Fisher et al., 2012) document the issues seen by 16 industrial data scientists, one of whom remarks

“Fast iteration is key, but incompatible with the jobs are submitted and processed in the cloud. It is frustrating to wait for hours, only to realize you need a slight tweak to your feature set”.

Methods for improving the quality of modern software analytics have made this issue even more serious. There has been continuous development of new feature selection (Hall and Holmes, 2003) and feature discovering (Jiang et al., 2013) techniques for software analytics, with the most recent ones focused on deep learning methods. These are all exciting innovations with the potential to dramatically improve the quality of our software analytics tools. Yet these are all CPU/GPU-intensive methods. For instance:

•

Learning control settings for learners can take days to weeks to years of CPU time (Fu et al., 2016b; Tantithamthavorn et al., 2016; Wang et al., 2013b).

•

Lam et al. needed weeks of CPU time to combine deep learning and text mining to localize buggy files from bug reports (Lam et al., 2015).

•

Gu et al. spent $240$ hours of GPU time to train a deep learning based method to generate API usage sequences for given natural language query (Gu et al., 2016).

Note that the above problem is not solvable by waiting for faster CPUs/GPUs. We can no longer rely on Moore’s Law (Moore et al., 1998) to double our computational power every 18 months. Power consumption and heat dissipation issues effect block further exponential increases to CPU clock frequencies (Kumar et al., 2003). Cloud computing environments are extensively monetized so the total financial cost of training models can be prohibitive, particularly for long running tasks. For example, it would take 15 years of CPU time to learn the tuning parameters of software clone detectors proposed in (Wang et al., 2013b). Much of that CPU time can be saved if there is a faster way.

2.2. What is Deep Learning?

Deep learning is a branch of machine learning built on multiple layers of neural networks that attempt to model high level abstractions in data. According to LeCun et al. (LeCun et al., 2015), deep learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transforms the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. Compared to the conventional machine learning algorithms, deep learning methods are very good at exploring high-dimensional data.

By utilizing extensive computational power, deep learning has been proven to be a very powerful method by researchers in many fields (LeCun et al., 2015), like computer vision and natural language processing (Krizhevsky et al., 2012; Mikolov et al., 2013; Sutskever et al., 2014; Schmidhuber, 2015; Arel et al., 2010). In 2012, Convolution neural networks method won the ImageNet competition (Krizhevsky et al., 2012), which achieves half of the error rates of the best competing approaches. After that, CNN became the dominant approach for almost all recognition and detection tasks in computer vision community. CNNs are designed to process the data in the form of multiple arrays, e.g., image data. According to LeCun et al. (LeCun et al., 2015), recent CNN methods are usually a huge network composed of 10 to 20 layers, hundreds of millions of weights and billions of connections between units. With advanced hardware and algorithm parallelization, training such model still need a few hours (LeCun et al., 2015). For the tasks that deal with sequential data, like text and speech, recurrent neural networks (RNNs) have been shown to work well. RNNs are found to be good at predicting the next character or word given the context. For example, Graves et al. (Graves et al., 2013) proposed to use long short-term memory (LSTM) RNNs to perform speech recognition, which achieves a test set error of $17.7\%$ on the benchmark testing data. Sutskever et al. (Sutskever et al., 2014) used two multiplelayered LSTM RNNs to translate sentences in English to French.

2.3. Deep Learning in SE

We study deep learning since, recently, it has attracted much attentions from researchers and practitioners in software community (Wang et al., 2016; Gu et al., 2016; Xu et al., 2016; White et al., 2016; White et al., 2015; Lam et al., 2015; Choetkiertikul et al., 2016; Yuan et al., 2014; Mou et al., 2016). These researchers applied deep learning techniques to solve various problems, including defect prediction, bug localization, clone code detection, malware detection, API recommendation, effort estimation and linkable knowledge prediction.

We find that this work can be divided into two categories:

•

Treat deep learning as a feature extractor, and then apply other machine learning algorithms to do further work (Lam et al., 2015; Wang et al., 2016; Choetkiertikul et al., 2016).

•

Solve problems directly with deep learning (Gu et al., 2016; Xu et al., 2016; White et al., 2016; White et al., 2015; Yuan et al., 2014; Mou et al., 2016).

2.3.1. Deep Learning as Pre-Processor

Lam et al. (Lam et al., 2015) proposed an approach to apply deep neural network in combination with rVSM to automatically locate the potential buggy files for a given bug report. By comparing it to baseline methods (Naive Bayes (Kim et al., 2013), learn-to-rank (Ye et al., 2014), BugLocator (Zhou et al., 2012)), Lam et al. reported, $\mbox{16.2-46.4}\%$ , $\mbox{8-20.8}\%$ and $\mbox{2.7-20.7}\%$ higher top-1 accuracy than baseline methods, respectively (Lam et al., 2015). The training time for deep neural network was reported from 70 to 122 minutes for 6 projects on a computer with 32 cores 2.00GHz CPU, 126 GB memory. However, the runtime information of the baseline methods was not reported.

Wang et al. (Wang et al., 2016) applied deep belief network to automatically learn semantic features from token vectors extracted from the studied software program. After applying deep belief network to generate features from software code, Naive Bayes, ADTree and Logistic Regression methods are used to evaluate the effectiveness of feature generation, which is compared to the same learners using traditional static code features (e.g. McCabe metrics, Halstead’s effort metrics and CK object-oriented code mertics (Kafura and Reddy, 1987; Chidamber and Kemerer, 1994; McCabe, 1976; Halstead, 1977)). In terms of runtime, Wang et al. only report time for generating semantics features with deep belief network, which ranged from 8 seconds to 32 seconds (Wang et al., 2016). However, the time for training and tuning deep belief network is missing. Furthermore, to compare the effectiveness of deep belief network for generating features with methods that extract traditional static code features in terms of time cost, it would be favorable to include all the time spent on feature extraction, including paring source code, token generation and token mapping for both deep belief network and traditional methods (i.e., an end-to-end comparison).

Choetkiertikul et al. (Choetkiertikul et al., 2016) proposed to apply deep learning techniques to solve effort estimation problems on user story level. Specifically, Choetkiertikul et al. (Choetkiertikul et al., 2016) proposed to leverage long short-term memory (LSTM) to learn feature vectors from the title, description and comments associated with an issue report and after that, regular machine learning techniques, like CART, Random Forests, Linear Regression and Case-Based Reasoning are applied to build the effort estimation models. Experimental results show that LSTM has a significant improvement over the baseline method bag-of-words. However, no further information regarding runtime as well as experimental hardware is reported for both methods and there is no cost of this deep learning method at all.

2.3.2. Deep Learning as a Problem Solver

White et al. (White et al., 2015; White et al., 2016) applied recurrent neural networks, a type of deep learning techniques, to address code clone detection and code suggestion. They reported, the average training time for 8 projects were ranging from 34 seconds to 2977 seconds for each epoch on a computer with two 3.3 GHz CPUs and each project required at least 30 epochs (White et al., 2016). Specifically, for the JDK project in their experiment, it would take 25 hours on the same computer to train the models before getting prediction. For the time cost for code suggestions, authors did not mention any related information (White et al., 2015).

Gu et al. (Gu et al., 2016) proposed a recurrent neural network (RNN) based method, DEEPAPI, to generate API usage sequences for a given natural language query. Compared with the baseline method SWIM (Raghothaman et al., 2016) and Lucene + UP-Miner (Wang et al., 2013a), DEEPAPI improved the performance significantly. However, that improvement came at a cost: that model was trained with a Nivdia K20 GPU for 240 hours (Gu et al., 2016).

XU (Xu et al., 2016) utilized neural language model and convolution neural network (CNN) to learn word-level and document-level features to predict semantically linkable knowledge units on Stack Overflow. In terms of performance metrics, like precision, recall and F1-score, CNN method was evaluated much better than the baseline method support vector machine (SVM). However, once again, that performance improvement came at a cost: their deep learner required 14 hours to train CNN model on a 2.5GHz PC with 16 GB RAM (Xu et al., 2016).

Yuan et al. (Yuan et al., 2014) proposed a deep belief network based method for malware detection on Android apps. By training and testing the deep learning model with 200 features extracted from static analysis and dynamic analysis from 500 sampled Android app, they got $96.5\%$ accuracy for deep learning method and $80\%$ for one baseline method, SVM (Yuan et al., 2014). However, they did not report any runtime comparison between the deep learning method and other classic machine learning methods.

Mou et al. (Mou et al., 2016) proposed a tree-based convolutional neural network for programming language processing, in which a convolution kernel is designed over programs’ abstract syntax trees to capture structural information. Results show that their method achieved $94\%$ accuracy, which is better than the baseline method RBF SVM $88.2\%$ on program classification problem (Mou et al., 2016). However, Mou et al. (Mou et al., 2016) did not discuss any runtime comparison between the proposed method and baseline methods.

2.3.3. Issues with Deep Learning

In summary, deep learning is used extensively in software engineering community. A common pattern in that research is to:

•

Report deep learning’s benefits, but not its CPU/GPU cost (White et al., 2015; Choetkiertikul et al., 2016; Yuan et al., 2014; Mou et al., 2016);

•

Or simply show the cost, without further analysis (Wang et al., 2016; Lam et al., 2015; Gu et al., 2016; Xu et al., 2016; White et al., 2016).

Since deep learning techniques cost large amount of time and computational resources to train its model, one might question whether the improvements from deep learning is worth the costs. Are there any simple techniques that achieve similar improvements with less resource costs? To investigate how simple methods could improve baseline methods, we select XU (Xu et al., 2016) study as a case study. The reasons are as follows:

•

Most deep learning paper’s baseline methods in SE are either not publicly available or too complex to implement (White et al., 2016; Lam et al., 2015). XU define their baseline methods precisely enough so others can confidently reproduce it locally. XU’s baseline method is SVM learner, which is available in many machine learning toolboxes.

•

Further, it is not yet common practice for deep learning researchers in SE community to share their implementations and data (White et al., 2016; White et al., 2015; Lam et al., 2015; Wang et al., 2016; Choetkiertikul et al., 2016; Gu et al., 2016), where a tiny difference may lead to a huge difference in the results. Even though XU do not share their CNN tool, their training and testing data are available online, which can be used for our proposed method. Since the same training and testing data are used for XU’s CNN and our proposed method, we can compare results of our method to their CNN results.

•

Some studies do not report their runtime and experimental environment, which makes it harder for us to systematically compare our results with theirs in terms of computational costs (Choetkiertikul et al., 2016; Yuan et al., 2014; White et al., 2015; Mou et al., 2016). XU clearly report their experimental hardware and runtime, which will be easier for us compare our computational costs to theirs.

2.4. Parameter Tuning in SE

In this paper, we use DE as an optimizer to do parameter tuning for SVM, which achieves results that are competitive with deep learning. This section discusses related work on parameter tuning in SE community.

Machine learning algorithms are designed to explore the instances to learn the bias. However, most of these algorithms are controlled by parameters such as:

•

The maximum allowed depth of decision tree built by CART;

•

The number of trees to be built within a Random Forest.

Adjusting these parameters is called hyperparameter optimziation. It is a well well explored approach in other communities (Bergstra and Bengio, 2012; Li et al., 2016). However, in SE, such parameter optimization is not a common task (as shown in the following examples).

In the field of defect prediction, Fu et al. (Fu et al., 2016a) surveyed hundreds of highly cited software engineering paper about defect prediction. Their observation is that most software engineering researchers did not acknowledge the impact of tunings (exceptions: (Lessmann et al., 2008; Tantithamthavorn et al., 2016)) and use the “off-the-shelf” data miners. For example, Elish et al. (Elish and Elish, 2008) compared support vector machines to other data miners for the purposes of defect prediction. However, the Elish et al. paper makes no mention of any SVM tuning study (Elish and Elish, 2008). More details about their survey refer to (Fu et al., 2016a).

In the field of topic modeling, Agrawal et al. (Agrawal et al., 2016) investigated the impact of parameter tuning on Latent Dirichlet Allocation (LDA). LDA is a widely used technique in software engineering field to find related topics within unstructured text, like topic analytics on Stack Overflow (Barua et al., 2014) and source code analysis (Binkley et al., 2014). Agrawal et al. found that LDA suffers from conclusion instability (different input orderings can lead to very different results) that is a result of poor choice of the LDA control parameters (Agrawal et al., 2016). Yet, in their survey of LDA use in SE, they found that very few researchers (4 out of 57 papers) explored the benefits of parameter tuning for LDA.

One troubling trend is that, in the few SE papers that perform tuning, they do so using methods heavily deprecated in the machine learning community. For example, two SE papers that use tuning (Lessmann et al., 2008; Tantithamthavorn et al., 2016), apply a simple grid search to explore the potential parameter space for optimal tunings (such grid searchers run one for-loop for each parameter being optimized). However, Bergstra et al. (Bergstra and Bengio, 2012) and Fu et al. (Fu et al., 2016b) argue that random search methods (e.g. the differential evolution algorithm used here) are better than grid search in terms of efficiency and performance.

3. Method

3.1. Research Problem

This section is an overview of the the task and methods used by XU. Their task was to predict relationships between two knowledge units (questions with answers) on Stack Overflow. Specifically, XU divided linkable knowledge unit pairs into 4 difference categories namely, duplicate, direct link, indirect link and isolated, based on its relatedness. The definition of these four categories are shown in Table 1 (Xu et al., 2016):

In that paper, XU provided the following two methods as baselines (Xu et al., 2016):

•

TF-IDF + SVM: a multi-class SVM classifier with 36 textual features generated based on the TF and IDF values of the words in a pair of knowledge units.

•

Word Embedding + SVM: a multi-class SVM classifier with word embedding generated by the word2vec model (Mikolov et al., 2013).

Both of these two baseline methods are compared against their proposed method, Word Embedding + CNN.

In this study, we select Word Embedding + SVM as the baseline because it uses word embedding as the input, which is the same as the Word Embedding + CNN method by XU.

3.2. Learners and Their Parameters

SVM has been proven to be a very successful method to solve text classification problem. A SVM seeks to minimize misclassification errors by selecting a boundary or hyperplane that leaves the maximum margin between positive and negative classes (where the margin is defined as the sum of the distances of the hyperplane from the closest point of the two classes (Joachims, 1998)).

Like most machine learning algorithms, there are some parameters associated with SVM to control how it learns. In XU’s experiment, they used a radial-bias function (RBF) for their SVM kernel and set $\gamma$ to $1/k$ , where $k$ is $36$ for TF-IDF + SVM method and $200$ for Word Embedding + SVM method. For other parameters, XU mentioned that grid search was applied to optimize the SVM parameters, but no further information was disclosed.

For our work, we used the SVM module from Scikit-learn (Pedregosa et al., 2011), a Python package for machine learning, where the parameters shown in Table. 2 are selected for tuning. Parameter $\mathit{C}$ is to set the amount of regularization, which controls the tradeoff between the errors on training data and the model complexity. A small value for C will generate a simple model with more training errors, while a large value will lead to a complicated model with fewer errors. Kernel is to introduce different nonlinearities into the SVM model by applying kernel functions on the input data. *Gamma * defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. coef0 is an independent parameter used in sigmod and polynomial kernel function.

As to why we used the “Tuning Range” shown in Table 2, and not some other ranges, we note that (a) those ranges include the defaults and also XU’s values; (b) the results presented below show that by exploring those ranges, we achieved large gains in the performance of our baseline method. This is not to say that larger tuning ranges might not result in greater improvements. However, for the goals of this paper (to show that tuning baseline method does matter), exploring just these ranges shown in Table 2 will suffice.

3.3. Learning Word Embedding

Learning word embeddings refers to find vector representations of words such that the similarities between words can be captured by cosine similarity of corresponding vector representations. It is been shown that the words with similar semantic and syntactic are found closed to each other in the embedding space (Mikolov et al., 2013).

Several methods have been proposed to generate word embeddings, like skip-gram (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and PCA on the word co-occurrence matrix (Lebret and Collobert, 2013). To replicate XU work, we used the continuous skip-gram model (word2vec), which is a unsupervised word representation learning method based on neural networks and also used by XU (Xu et al., 2016).

The skip-gram model learns vector representations of words by predicting the surrounding words in a context window. Given a sentence of words $W=w_{1}$ , $w_{2}$ ,…, $w_{n}$ , the objective of skip-gram model is to maximize the the average log probability of the surrounding words:

[TABLE]

where $c$ is the context window size and $w_{i+j}$ and $w_{i}$ represent surrounding words and center word, respectively. The probability of $p(w_{i+j}|w_{i})$ is computed according to the softmax function:

[TABLE]

where $v_{w_{I}}$ and $v_{w_{O}}$ are the vector representations of the input and output vectors of $w$ , respectively. $\sum_{w=1}^{|W|}exp(v_{w}^{T}v_{w_{I}})$ normalizes the inner product results across all the words. To improve the computation efficiency, Mikolove et al. (Mikolov et al., 2013) proposed hierachical softmax and negative sampling techniques. More details can be found in Mikolove et al.’s study (Mikolov et al., 2013).

Skip-gram’s parameters control how that algorithm learns word embeddings. Those parameters include window size and dimensionality of embedding space, etc. Zucoon et al. (Zuccon et al., 2015) found that embedding dimensionality and context window size have no consistent impact on retrieval model performance. However, Yang et al. (Yang et al., 2016) showed that large context window and dimension sizes are preferable to improve the performance when using CNN to solve classification tasks for Twitter. Since this work is to compare performance of tuning SVM with CNN, where skip-gram model is used to generate word vector representations for both of these methods, tuning parameter of skip-gram model is beyond the scope of this paper (but we will explore it in future work).

To train our word2vec model, $100,000$ knowledge units tagged with “java” from Stack Overflow posts table (include titles, questions and answers) are randomly selected as a word corpus222Without further explanation, all the experiment settings, including learner algorithms, training/testing data split, etc, strictly follow XU’s work. . After applying proper data processing techniques proposed by XU, like remove the unnecessary HTML tags and keep short code snippets in code tag, then fit the corpus into gensim word2vec module (Rehurek and Sojka, 2010), which is a python wrapper over original word2vec package.

When converting knowledge units into vector representations, for each word $w_{i}$ in the post processed knowledge unit (including title, question and answers), we query the trained word2vec model to get the corresponding word vector representation $v_{i}$ . Then the whole knowledge unit with $s$ words is converted to vector representation by element-wise addition, $Uv=v_{i}\oplus v_{2}\oplus...\oplus v_{s}$ . This vector representation is used as the input data to SVM.

3.4. Tuning Algorithm

A tuning algorithm is an optimizer that drives the learner to explore the optimal parameter in a given searching space. According to our literature review, there are several searching algorithms used in SE community:simulated annealing (Feather and Menzies, 2002; Menzies et al., 2007); various genetic algorithms (Jones et al., 1996; Harman, 2007; Arcuri and Fraser, 2011) augmented by techniques such as differential evolution (Storn and Price, 1997; Fu et al., 2016a, b; Chaves-González and Pérez-Toledano, 2015; Agrawal et al., 2016), tabu search and scatter search (Beausoleil, 2006; Molina et al., 2007; Corazza et al., 2013); particle swarm optimization (Windisch et al., 2007); numerous decomposition approaches that use heuristics to decompose the total space into small problems, then apply a response surface methods (Krall et al., 2015); NSGA-II (Zhang et al., 2007)and NSGA-III (Mkaouer et al., 2014).

Of all the mentioned algorithms, the simplest are simulated annealing (SA) and differential evolution (DE), each of which can be coded in less than a page of some high-level scripting language. Our reading of the current literature is that there are more advocates for differential evolution than SA. For example, Vesterstrom and Thomsen (Vesterstrom and Thomsen, 2004) found DE to be competitive with particle swarm optimization and other GAs. DEs have already been applied before for parameter tuning in SE community to do parameter tuning (e.g. see (Omran et al., 2005; Chiha et al., 2012; Fu et al., 2016a, b; Agrawal et al., 2016)) . Therefore, in this work, we adopt DE as our tuning algorithm and the main steps in DE is described in Figure 1.

4. Experimental Setup

4.1. Research Questions

To systematically investigate whether tuning can improve the performance of baseline methods compared with deep learning method, we set the following three research questions:

•

RQ1*: Can we reproduce XU’s baseline results (Word Embedding + SVM)?*

•

RQ2*: Can DE tune a standard learner such that it outperforms XU’s deep learning method?*

•

RQ3*: Is tuning SVM with DE faster than XU’s deep learning method?*

RQ1 is to investigate whether our implementation of Word Embedding + SVM method has the similar performance with XU’s baseline, which makes sure that our following analysis can be generalized to XU’s conclusion. RQ2 and RQ3 lead us to investigate whether tuning SVM comparable with XU’s deep learning from both performance and cost aspects.

4.2. Dataset and Experimental Design

Our experimental data comes from Stack Overflow data dump of September 2016333https://archive.org/details/stackexchange, where the posts table includes all the questions and answers posted on Stack Overflow up to date and the postlinks table describes the relationships between posts, e.g., duplicate and linked. As mentioned in Section 3.1, we have four different types of relationships in knowledge unit pairs. Therefore, linked type is further divided into indirectly linked and directly linked. Overall, four different types of data are generated according the following rules (Xu et al., 2016):

•

Randomly select a pair of posts from the postlinks table, if the value in PostLinkTypeId field for this pair of posts is $3$ , then this pair of posts is duplicate posts. Otherwise they’re directly linked posts.

•

Randomly select a pair of posts from the posts table, if this pair of posts is linkable from each other according to postlinks table and the distance between them are greater than 2 (which means they are not duplicate or directly linked posts), then this pair of posts is indirectly linked. If they’re not linkable, then this pair of posts is isolated.

In this work, we use the same training and testing knowledge unit pairs as XU (Xu et al., 2016)444https://github.com/XBWer/ASEDataset, where 6,400 pairs of knowledge units for training and 1,600 pairs for testing. And each type of linked knowledge units accounts for $1/4$ in both training and testing data. The reasons that we used the same training and testing data as XU are:

•

It is to ensure that performance of our baseline method is as closed to XU’s as possible.

•

Since deep learning method is way complicated compared to SVM and a little difference in implementations might lead to different results. To fairly compare with XU’s result, we can use the performance scores of CNN method from XU’s study (Xu et al., 2016) without any implementation bias introduced.

For training word2vec model, we randomly select 100,000 knowledge units (title, question body and all the answers) from posts table that are related to “java”. After that, all the training/tuning/testing knowledge units used in this paper are converted into word embedding representations by looking up each word in wrod2vec model as described in Section 3.3.

As seen in Figure 2, instead of using all the 6,400 knowledge units as training data, we split the original training data into new training data and tuning data, which are used during parameter tuning procedure for training SVM and evaluating candidate parameters offered by DE. Afterwards, the new training data is again fitted into the SVM with the optimal parameters found by DE and finally the performance of the tuned SVM will be evaluated on the testing data.

To reduce the potential variance caused by how the original training data is divided, 10-fold cross-validation is performed. Specifically, each time one fold with $640$ knowledge units pairs is used as the tuning data, and the remaining folds with $5760$ knowledge units are used as the new training data, then the output SVM model will be evaluated on the testing data. Therefore, all the performance scores reported below are averaged values over 10 runs.

In this study, we use Wilcoxon single ranked test to statistically compare the differences between tuned SVM and untuned SVM. Specifically, the Benjamini-Hochberg (BH) adjusted p-value is used to test whether a difference is statistically significant at the level of $0.05$ (Benjamini and Hochberg, 1995). To measure the effect size of performance scores between tuned SVM and untuned SVM, we compute Cliff’s $\delta$ that is a non-parametric effect size measure (Romano et al., 2006). As Romano et al. suggested, we evaluate the magnitude of the effect size as follows: negligible ( $|\delta|<0.147$ ), small ( $0.147<|\delta|<0.33$ ), medium ( $0.33<|\delta|<0.474$ ), and large (0.474 $\leq|\delta|$ ) (Romano et al., 2006).

4.3. Evaluation Metrics

When evaluating the performance of tuning SVM on the multi-class linkable knowledge units prediction problem, consistent with XU (Xu et al., 2016), we use accuracy, precision, recall and F1-score as the evaluation metrics.

Given a multi-classification problem with true labels $C_{1}$ , $C_{2}$ , $C_{3}$ and $C_{4}$ , we can generate a confusion matrix like Table 3, where the value of $c_{ii}$ represents the number of instances that are correctly classified by the learner for class $C_{i}$ .

Accuracy of the learner is defined as the number of correctly classified knowledge units over the total number of knowledge units, i.e.,

[TABLE]

where ${\sum_{i}\sum_{j}c_{ij}}$ is the total number of knowledge units. For a given type of knowledge units, $C_{j}$ , the precision is defined as probability of knowledge units pairs correctly classified as $C_{j}$ over the number of knowledge unit pairs classified as $C_{j}$ and recall is defined as the percentage of all $C_{j}$ knowledge unit pairs correctly classified. F1-score is the harmonic mean of recall and precision. Mathematically, precision, recall and F1-score of the learner for class $C_{j}$ can be denoted as follows:

[TABLE]

Where $\sum_{i}c_{ij}$ is the predicted number of knowledge units in class $C_{j}$ and ${\sum_{i}c_{ji}}$ is the actual number of knowledge units in class $C_{j}$ .

Recall from Algorithm 1 that we call differential evolution once for each optimization goal. Generally, this goal depends on which metric is most important for the business case. In this work, we use $F1$ to score the candidate parameters because it controls the trade-off between precision and recall, which is also consistent with XU (Xu et al., 2016) and is also widely used in software engineering community to evaluate classification results (Wang et al., 2016; Menzies et al., 2007; Fu et al., 2016a; Kim et al., 2008).

5. Results

In this section, we present our experimental results. To answer research questions raised in Section 4.1, we conducted two experiments:

•

Compare performance of Word Embedding + SVM method in XU (Xu et al., 2016) and our implementation;

•

Compare performance of our tuning SVM with DE method with XU’s CNN deep learning method.

Since we used the same training and testing data sets provided by XU (Xu et al., 2016) and conducted our experiment in the same procedure and evaluated methods using the performance measures, we simply used the results reported in the work by XU (Xu et al., 2016) for the performance comparison.

RQ1: Can we reproduce XU’s baseline results (Word Embedding + SVM)?

This first question is important to our work since, without the original tool released by XU, we need to insure that our reimplementation of their baseline method (WordEmbedding + SVM) has a similar performance to their work. Accordingly, we carefully follow XU’s procedure (Xu et al., 2016). We use the SVM learner from scikit-learn with the setting $\gamma=\frac{1}{200}$ and $\mathit{kernel=}$ “rbf”, which are used by XU. After that, the same training and testing knowledge unit pairs are applied to SVM.

Table 4 and Figure 3 show the performance scores and corresponding score delta between our implementation of WordEmbedding + SVM with XU’s in terms of accuracy 555XU just report overall accuracy, not for each class, hence it is missing in this table., precision, recall and F1-score. As we can see, when predicting these four different types of relatedness between knowledge unit pairs, our Word Embedding + SVM method has very similar performance scores to the baseline method reported by XU in (Xu et al., 2016), with the maximum difference less than $0.2$ . Except for Duplicate class, where our baseline has a higher precision (i.e., $0.724$ v.s. $0.611$ ) but a lower recall (i.e., $0.525$ v.s. $0.725$ ).

Figure 3 presents the same results in a graphical format. Any bar above zero means that our implementation has a better performance score than XU’s on predicting that specific knowledge unit relatedness class. As we can see, most of the differences ( $\frac{8}{12}$ ) are within 0.05 and the score delta of overall performance shows that our implementation is a little worse than XU’s implementation. For this chart we conclude that:

{myshadowbox}

Overall, our reimplementation of WordEmbedding + SVM has very similar performance in all the evaluated metrics compared to the baseline method reported in XU’s study (Xu et al., 2016).

The significance of this conclusion is that, moving forward, we are confident that we can use our reimplementation of WordEmbedding+SVM as a valid surrogate for the baseline method of XU.

RQ2: Can DE tune a standard learner such that it outperforms XU’s deep learning method?

To answer this question, we run the workflow of Figure 2, where DE is applied to find the optimal parameters of SVM based on the training and tuning data. The optimal tunings are then applied on the SVM model and the built learner is evaluated on testing data. Note that, in this study, since we mainly focus on precision, recall and F1-score measures where F1-score is the harmonic mean of precision and recall, we use F1-score as the tuning goal for DE. In other words, when tuning parameters, DE expects to find a pair of candidate parameters that maximize F1-score.

Table 5 presents the performance scores of XU’s baseline, XU’s CNN method and Tuned SVM for all metrics. The highest score for each relatedness class is marked in bold. Note that: Without tuning, XU’s CNN method outperforms the baseline SVM in $\frac{10}{12}$ evaluation metrics across all four classes. The largest performance improvement is $0.47$ for recall on Direct Link class. Note that this result is consistent with XU’s conclusion that their CNN method is superior to standard SVM. After tuning SVM, the deep learning method has no such advantage. Specifically, CNN has advantage over tuned SVM in $\frac{4}{12}$ evaluation metrics across all four classes. Even when CNN performs better that our tuning SVM method, the largest difference is $0.065$ for Recall on Direct_Link class, which is less than $0.1$ .

Figure 4 presents the same results in a graphical format. Any bar above zero indicates that tuned SVM has a better performance score than CNN. In this figure: CNN has a slightly better performance on Duplicate class for precision, recall and F1-score and a higher recall on Direct link class. Across all of Figure 4, in $\frac{8}{12}$ evaluation scores, Tuned SVM has better performance scores than CNN, with the largest delta of $0.222$ .

Figure 5 compares the performance delta of tuned SVM with XU’s untuned SVM. We note that DE-based parameter tuning never degrades SVM’s performance (since there are no negative values in that chart). Tuning dramatically improves scores on predicting some classes of KU relatedness. For example, the recall of predicting Direct_link is increased from $0.433$ to $0.903$ , which is $108\%$ improvement over XU’s untuned SVM (To be fair for XU, it is still $84\%$ improvement over our untuned SVM). At the same time, the corresponding precision and F1 scores of predicting Direct_Link are increased from $0.560$ to $0.851$ and $0.488$ to $0.841$ , which are $52\%$ and $72\%$ improvement over XU’s original report(Xu et al., 2016), respectively. A similar pattern can also be observed in Isolated class. On average, tuning helps improve the performance of XU’s SVM by $0.238$ , $0.228$ and $0.227$ in terms of precision, recall and F1-score for all four KU relatedness classes. Figure 6 compares the tuned SVM with our untuned SVM. We note that we get the similar patterns that observed in Figure 5. All the bars are above zero, etc.

Based on the performance scores in Table 5 and score delta in Figure 4, Figure 5 and Figure 6, we can see that:

•

Parameter tuning can dramatically improve the performance of Word Embedding + SVM (the baseline method) for the multi-class KU relatedness prediction task;

•

With the optimal tunings, the traditional machine learning method, SVM, if not better, is at least comparable with deep learning methods (CNN).

When discussing this result with colleagues, we are sometimes asked for a statistical analysis that confirms the above finding. However, due the lack of evaluation score distributions of the CNN method in (Xu et al., 2016), we cannot compare their single value with our results from 10 repeated runs. However, according to Wilcoxon singed rank test over 10 runs results, tuned SVM performs statistically better than our untuned SVM in terms of all evaluation measures on all four classes ( $p<0.05$ ). According to Cliff $\delta$ values, the magnitude of difference between tuned SVM and our untuned SVM is not trivial ( $|\delta|>0.147$ ) for all evaluation measures.

Overall, the experimental results and our analysis indicate that:

{myshadowbox}

In the evaluation conducted here, the deep learning method, CNN, does not have any performance advantage over our tuning approach.

RQ3: Is tuning SVM with DE faster than XU’s deep learning method?

When comparing the runtime of two learning methods, it obviously should be conducted under the same hardware settings. Since we adopt the CNN evaluation scores from (Xu et al., 2016), we can not run on our tuning SVM experiment under the exactly same system settings. To allow readers to have a objective comparison, we provide the experimental environment as shown in Table 6. To obtain the runtime of tuning SVM, we recorded the start time and end time of the program execution, including parameter tuning, training model and testing model.

According to XU, it took $14$ hours to train their CNN model into a low loss convergence ( $<e^{-3}$ ) (Xu et al., 2016). Our work, on the other hand only takes $10$ minutes to run SVM with parameter tuning by DE on a similar environment. That is, the simple parameter tuning method on SVM is $84X$ faster than XU’s deep learning method.

{myshadowbox}

Compared to CNN method, tuning SVM is about $84X$ faster in terms of model building.

The significance of this finding is that, in this case study, CNN was neither better in performance scores (see RQ2) nor runtimes. CNN’s extra runtimes are a particular concern since (a) they are very long; and (b) these would be incurred anytime researchers wants to update the CNN model with new data or wanted to validate the XU result.

6. Discussion

6.1. Why DE+SVM works?

Parameter tuning matters. As mentioned in Section 2.4, the default parameter values set by the algorithm designers could generate a good performance on average but may not guarantee the best performance for the local data (Bergstra and Bengio, 2012; Fu et al., 2016a). Given that, it is most strange to report that most SE researchers ignore the impacts of parameter tuning when they utilize various machine learning methods to conduct software analytic (evidence: see our reviews in (Fu et al., 2016a, b; Agrawal et al., 2016)). The conclusion of this work must be to stress the importance of this kind of tuning, using local data, for any future software analytics study.

Better explore the searching space. It turns out that one exception to our statement that “most researchers do not tune” is the XU study. In that work, they unsuccessfully perform parameter tuning, but with with grid search. In such a grid search, for $N$ parameters to be tuned, $N$ for loops are created to run over a range of settings for each parameter. While a widely used method, it is often deprecated. For example, Bergstra et al.(Bergstra and Bengio, 2012) note that grid search jumps through different parameter settings between some min and max values of pre-defined tuning range. They warn that such jumps may actually skip over the critical tuning values. On the other hand, DE tuning values are adjusted based on better candidates from previous generations. Hence DE is more likely than grid search to “fill in the gaps” between the initialized values.

That said, although DE +SVM works in this study, it does not mean DE is the best parameter tuner for all SE tasks. We encourage more researchers to explore faster and more effective parameter tuners in this direction.

6.2. Implication

Beyond the specifics of this case study, what general principles can we take from the above work?

Understand the task. One reason to try different tools for the same task is to better understand the task. The more we understand a task, the better we can match tools to that task. Tools that are poorly matched to task are usually complex and/or slow to execute. In the case study of this paper, we would say that

•

Deep learning is a poor match to the task of predicting whether two questions posted on Stack Overflow are semantically linkable since it is so slow;

•

Differential evolution tuning SVM is a much better match since it is so fast and obtain competitive performance.

That said, it is important to stress that the point of this study is not to deprecate deep learning. There are many scenarios were we believe deep learning would be a natural choice (e.g. when analyzing complex speech or visual data). In SE, it is still an open research question that in which scenario deep learning is the best choice. Results from this paper show that, at least for classification tasks like knowledge unit relatedness classification on Stack Overflow, deep learning does not have much advantage over well tuned conventional machine learning methods. However, as we better understand SE tasks, deep learning could be used to address more SE problems, which require more advanced artificial intelligence.

Treat resource constraints as design challenges. As a general engineering principle, we think it insightful to consider the resource cost of a tool before applying it. It turns out that this is a design pattern used in contemporary industry. According to Calero and Pattini (Calero and Piattini, 2015), many current commercial redesigns are motivated (at least in part) by arguments based on sustainability (i.e. using fewer resources to achieve results). In fact, they say that managers used sustainability-based redesigns to motivate extensive cost-cutting opportunities.

6.3. Threads to Validity

Threats to internal validity concern the consistency of the results obtained from the result. In our study, to investigate how tuning can improve the performance of baseline methods and how well it perform compared with deep learning method. We select XU’s Word Embedding + SVM baseline method as a case study. Since the original implementation of Word Embedding + SVM (baseline 2 method in (Xu et al., 2016)) is not publicly available, we have to reimplement our version of Word Embedding + SVM as the baseline method in this study. As shown in RQ1, our implementation has quite similar results to XU’s on the same data sets. Hence, we believe that our implementation reflect the original baseline method in Xu’s study (Xu et al., 2016).

Threats to external validity represent if the results are of relevance for other cases, or the ability to generalize the observations in a study. In this study, we compare our tuning baseline method with deep learning method, CNN, in terms of precision, recall, F1-score and accuracy. The experimental results are quite consistent for this knowledge units relatedness prediction task. Nonetheless, we do not claim that our findings can be generalized to all software analytics tasks. However, those other software analytics tasks often apply deep learning methods on classification tasks (Choetkiertikul et al., 2016; Wang et al., 2016) and so it is quite possible that the methods of this paper (i.e., DE-based parameter tuning) would be widely applicable, elsewhere.

7. Conclusion

In this paper, we perform a comparative study to investigate how tuning can improve the baseline method compared with state-of-the-art deep learning method for predicting knowledge units relatedness on Stack Overflow. Our experimental results show that:

•

Tuning improves the performance of baseline methods. At least for Word Embedding + SVM (baseline in (Xu et al., 2016)) method, if not better, it performs as well as the proposed CNN method in (Xu et al., 2016).

•

The baseline method with parameter tuning runs much faster than complicated deep learning. In this study, tuning SVM runs $84X$ faster than CNN method.

8. Addendum

As this paper was going to going to press we learned of a new deep learning methods that, according to its creators, runs 20 times faster than standard deep learning (Spring and Shrivastava, 2017). Note that in that paper, the authors say their faster method does not produce better results– in fact, their method generated solutions that were a small fraction worse than “classic” deep learning. Hence, that paper does not invalidate our result since (a) our DE-based method sometimes produced better results than classic deep learning and (b) our DE runs 84 times faster (i.e. much faster runtimes than those reported in (Spring and Shrivastava, 2017)).

That said, this new fast deep learner deserves our close attention since, using it, we conjecture that our DE tools could solve an open problem in the deep learning community; i.e. how to find the best configurations inside a deep learner faster.

Based on the results of this study, we recommend that before applying deep learning method on SE tasks, implement simpler techniques. These simpler methods could be used, at the very least, for comparisons against a baseline. In this particular case of deep learning vs DE, the extra computational effort is so very minor (10 minutes on top of 14 hours), that such a “try-with-simpler” should be standard practice.

As to the future work, we will explore more simple techniques to solve SE tasks and also investigate how deep learning techniques could be applied effectively in software engineering field.

ACKNOWLEDGEMENTS

The work is partially funded by an NSF award #1302169.

Bibliography81

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1]
2Agrawal et al . [2016] Amritanshu Agrawal, Wei Fu, and Tim Menzies. 2016. What is wrong with topic modeling?(and how to fix it using search-based se). ar Xiv preprint ar Xiv:1608.08176 (2016).
3Anvik et al . [2006] John Anvik, Lyndon Hiew, and Gail C Murphy. 2006. Who should fix this bug?. In Proceedings of the 28th International Conference on Software Engineering . ACM, 361–370.
4Arcuri and Fraser [2011] Andrea Arcuri and Gordon Fraser. 2011. On parameter tuning in search based software engineering. In International Symposium on Search Based Software Engineering . Springer, 33–47.
5Arel et al . [2010] Itamar Arel, Derek C Rose, and Thomas P Karnowski. 2010. Research Frontier: Deep Machine Learning–a New Frontier in Artificial Intelligence Research. IEEE Computational Intelligence Magazine 5, 4 (2010), 13–18.
6Barua et al . [2014] Anton Barua, Stephen W Thomas, and Ahmed E Hassan. 2014. What are developers talking about? an analysis of topics and trends in stack overflow. Empirical Software Engineering 19, 3 (2014), 619–654.
7Beausoleil [2006] Ricardo P Beausoleil. 2006. “MOSS” multiobjective scatter search applied to non-linear multiple criteria optimization. European Journal of Operational Research 169, 2 (2006), 426–449.
8Begel and Zimmermann [2014] Andrew Begel and Thomas Zimmermann. 2014. Analyze this! 145 questions for data scientists in software engineering. In Proceedings of the 36th International Conference on Software Engineering . ACM, 12–23.