Supervised Learning Based Algorithm Selection for Deep Neural Networks

Shaohuai Shi; Pengfei Xu; Xiaowen Chu

arXiv:1702.03192·cs.DC·March 20, 2017

Supervised Learning Based Algorithm Selection for Deep Neural Networks

Shaohuai Shi, Pengfei Xu, Xiaowen Chu

PDF

Open Access

TL;DR

This paper introduces MTNN, a supervised learning approach that intelligently selects the optimal NT operation implementation in deep learning, significantly improving performance and training speed on modern GPUs.

Contribution

The paper proposes MTNN, a novel supervised learning-based algorithm selection method for NT operations, enhancing deep learning platform efficiency.

Findings

01

MTNN achieves 96% prediction accuracy with low overhead.

02

Performance of NT operations improves by an average of 54%.

03

Revised Caffe outperforms original by 28% in training speed.

Abstract

Many recent deep learning platforms rely on third-party libraries (such as cuBLAS) to utilize the computing power of modern hardware accelerators (such as GPUs). However, we observe that they may achieve suboptimal performance because the library functions are not used appropriately. In this paper, we target at optimizing the operations of multiplying a matrix with the transpose of another matrix (referred to as NT operation hereafter), which contribute about half of the training time of fully connected deep neural networks. Rather than directly calling the library function, we propose a supervised learning based algorithm selection approach named MTNN, which uses a gradient boosted decision tree to select one from two alternative NT implementations intelligently: (1) calling the cuBLAS library function; (2) calling our proposed algorithm TNN that uses an efficient out-of-place matrix…

Tables10

Table 1. TABLE I: The experimental GPU hardware with CUDA-8.0

GPU Model	Cores	Memory	OS	Core frequency
GTX1080	2560	8 GB	Ubuntu 14.04	1607 MHz
Titan X	3584	10 GB	Ubuntu 14.04	1417 MHz

Table 2. TABLE II: Sample distribution on tested GPUs

GPU Model	GTX1080	TitanX
# of $- 1$	649	535
# of $1$	242	406
# of Samples	891	941
Total	1832

Table 3. TABLE III: Characteristics of tested GPUs

GPU	GTX1080	TitanX
Compute Capability	6.1	6.1
Global Mem (GB)	8	10
# of SMs	20	28
Core Clock (MHz)	1607	1417
Mem Clock (MHz)	5005	5005
Mem Bus Width	256	384
L2 Cache (KB)	2048	3072

Table 4. TABLE IV: Accuracies of the 5-fold cross-validation

Class	Minimum	Maximum	Average
Negative	91.36%	93.30%	92.05%
Positive	86.49%	92.31%	88.39%
Total	89.40%	91.94%	90.51%

Table 5. TABLE V: The experimental environment for classifiers

CPU	Memory	OS	Frequency
Intel CPU i7-3820	64 GB	Ubuntu 14.04	3.6 GHz

Table 6. TABLE VI: Comparison with SVM and DT

Classifier	Accuracy (%)	Train Time (ms)	Predict Time (ms)
GBDT	90.51	7	0.005
SVM-RBF	81.66	47	1.2
SVM-Poly	77.68	30	1.07
DT	87.84	1	0.004

Table 7. TABLE VII: Metrics description

Metric	Description
MTNN vs NT	Average percent improvement of using MTNN versus
	versus always choosing TN
MTNN vs TNN	Average percent improvement of using MTNN versus
	versus always choosing TNN
$G O W_{a v g}$	Average $G O W$ in all samples
$G O W_{m a x}$	Maximum $G O W$ in all samples
$L U B_{a v g}$	Average $L U B$ in all samples
$L U B_{m i n}$	Maximum $L U B$ in all samples

Table 8. TABLE VIII: Values of performance metrics of MTNN in %

Metric	GTX1080	TitanX	Total
MTNN vs NT	57.78	50.48	54.03
MTNN vs TNN	21.51	22.31	21.92
$G O W_{a v g}$	79.44	73.20	76.23
$G O W_{m a x}$	1439.39	957.44	1439.39
$L U B_{a v g}$	-0.15	-0.40	-0.28
$L U B_{m i n}$	-25.07	-71.62	-71.62

Table 9. TABLE IX: Fully connected networks configuration for evaluation

Data set	MNIST	Synthetic
Input	784	26752
Output	10	26752
2 hidden layers	2048-1024	4096-4096
3 hidden layers	2048-2048-1024	4096-4096-4096
4 hidden layers	2048-2048-2048-1024	4096-4096-4096-4096

Table 10. TABLE X: Breakdown of the average running time in millisecond and speedups

Data set	GPU	Phase	CaffeNT	CaffeMTNN	Speedup
		Forward	11.15	10.39	1.07
	G.1080	Backward	58.81	59.79	0.98
MNIST		Total	24.79	24.31	1.02
		Forward	7.36	7.38	1.00
	TitanX	Backward	47.69	47.39	1.01
		Total	18.22	18.25	1.00
		Forward	320.83	131.62	2.44
	G.1080	Backward	1029.77	1033.04	1.00
Synth-		Total	477.05	288.24	1.66
etic		Forward	200.54	93.12	2.15
	TitanX	Backward	761.08	763.59	1.00
		Total	316.13	208.99	1.51

Equations16

C = A \times B

C = A \times B

C = A \times B^{T}

C = A \times B^{T}

T_{t r an s p ose} (n, k) < T_{N T} (m, n, k) - T_{N N} (m, n, k)

T_{t r an s p ose} (n, k) < T_{N T} (m, n, k) - T_{N N} (m, n, k)

f : (G, m, n, k) \mapsto {- 1, 1}

f : (G, m, n, k) \mapsto {- 1, 1}

\hat{f} = a r g min (G, m, n, k) \in Ω \sum ∣∣ \hat{f} (G, m, n, k) - f (G, m, n, k) ∣∣

\hat{f} = a r g min (G, m, n, k) \in Ω \sum ∣∣ \hat{f} (G, m, n, k) - f (G, m, n, k) ∣∣

\hat{f} (x) = {- 1, + 1, P_{N T} (x) < P_{T N N} (x) P_{N T} (x) \geq P_{T N N} (x)

\hat{f} (x) = {- 1, + 1, P_{N T} (x) < P_{T N N} (x) P_{N T} (x) \geq P_{T N N} (x)

GO W = \frac{P _{M T N N} - min ( P _{N T} , P _{T N N} )}{min ( P _{N T} , P _{T N N} )}

GO W = \frac{P _{M T N N} - min ( P _{N T} , P _{T N N} )}{min ( P _{N T} , P _{T N N} )}

LU B = \frac{P _{M T N N} - ma x ( P _{N T} , P _{T N N} )}{ma x ( P _{N T} , P _{T N N} )}

LU B = \frac{P _{M T N N} - ma x ( P _{N T} , P _{T N N} )}{ma x ( P _{N T} , P _{T N N} )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Machine Learning and ELM · Machine Learning and Data Classification

Full text

Supervised Learning Based Algorithm Selection for Deep Neural Networks

Shaohuai Shi, Pengfei Xu, Xiaowen Chu

Department of Computer Science, Hong Kong Baptist University

{csshshi, pengfeixu, chxw}@comp.hkbu.edu.hk

Abstract

Many recent deep learning platforms rely on third-party libraries (such as cuBLAS) to utilize the computing power of modern hardware accelerators (such as GPUs). However, we observe that they may achieve suboptimal performance because the library functions are not used appropriately. In this paper, we target at optimizing the operations of multiplying a matrix with the transpose of another matrix (referred to as NT operation hereafter), which contribute about half of the training time of fully connected deep neural networks. Rather than directly calling the library function, we propose a supervised learning based algorithm selection approach named MTNN, which uses a gradient boosted decision tree to select one from two alternative NT implementations intelligently: (1) calling the cuBLAS library function; (2) calling our proposed algorithm TNN that uses an efficient out-of-place matrix transpose. We evaluate the performance of MTNN on two modern GPUs: NVIDIA GTX 1080 and NVIDIA Titan X Pascal. MTNN can achieve 96% of prediction accuracy with very low computational overhead, which results in an average of 54% performance improvement on a range of NT operations. To further evaluate the impact of MTNN on the training process of deep neural networks, we have integrated MTNN into a popular deep learning platform Caffe. Our experimental results show that the revised Caffe can outperform the original one by an average of 28%. Both MTNN and the revised Caffe are open-source.

Index Terms:

Linear Algebra; Matrix Multiplication; Transpose; GPU; Deep Neural Networks

I Introduction

Deep neural networks have recently achieved great success in computer vision, speech recognition, and natural language processing [1][2]. The forwarding and backwarding phases in the backpropagation based training process of a deep neural network requires two different forms of matrix multiplication (i.e., Equation 1 and Equation 2), which dominate the training time. The regular form of matrix multiplication for two row-major matrices A and B can be represented as follows:

[TABLE]

where $\textit{A}\in R^{m\times k}$ , $\textit{B}\in R^{k\times n}$ and $\textit{C}\in R^{m\times n}$ . In this paper we call Equation 1 NN operation (N means no transpose). There is another form of matrix multiplication: A multiplied with the transpose of B, i.e.,

[TABLE]

where $\textit{B}^{T}$ is the transpose of B, $B^{T}_{ji}=B_{ij}$ and $\textit{B}\in R^{n\times k}$ . In this paper we call Equation 2 NT operation (T means transpose).

The time complexity of schoolbook matrix multiplication is $O(m\times k\times n)$ , which makes it very time-consuming for large matrices. Nowadays, there exist many optimized software libraries for matrix operations, including ATLAS, LAPACK, OpenBLAS, GotoBLAS, Intel MKL, Eigen, cuBLAS, etc. As GPUs have become mainstream hardware accelerators, the cuBLAS library from NVIDIA becomes a major linear algebra library for state-of-the-art deep learning software tools [3]. For example, the SGEMM function in cuBLAS library running on an NVIDIA K40M card can achieve about 3000 GFLOPS when performing single-precision floating-point matrix multiplication, which is up to 17x faster than the MKL library on Intel CPU IvyBridge E5-2697v2 @ 2.70GHz [4].

Some recent work has been proposed to understand and improve the performance of NN operations on GPUs [5]. Considering the complexity of GPU architectures, it is very challenging to design a single algorithm or a single set of kernel configuration that is optimal for all cases; hence autotuning method has become an attractive approach to choosing the best algorithms or kernel configurations for GPUs [6][7]. However, the NT operations have not received much attention from the research community. Our previous work shows that many state-of-the-art deep learning software tools overlook the importance of NT operations and only achieve suboptimal performance for some deep neural networks [3]. In this paper, we first show that the performance of NT operations by cuBLAS is often much lower than that of NN operation on recent GPUs. We then propose a simple method called TNN which implements the NT operation by carrying out efficient out-of-place matrix transpose first and then performing an NN operation. In general, TNN outperforms cuBLAS for large matrices, but it is not as efficient as cuBLAS for small matrices. In order to achieve the best average performance, we design an algorithm selection method named MTNN, which can intelligently select the appropriate algorithm to carry out the NT operations based on some GPU architecture information and matrix sizes. Notice that the idea of algorithm selection dates back to 1976 [8] and becomes very successful in recent years to choose optimal implementation from a set of algorithms [9][10][11]. In order to verity the effectiveness of MTNN, we integrate it into a popular real world deep learning platform Caffe [12] which relies on cuBLAS to accelerate its NN and NT operations on GPUs. We evaluate the performance of MTNN and the revised Caffe on two modern GPUs: NVIDIA GeForce GTX1080 and Titan X Pascal. The experimental results show that (1) our MTNN solution achieves up to 54.03% improvement on average over the NT operation of cuBLAS; and (2) the revised Caffe111Our source codes can be found here: https://github.com/hclhkbu/caffe-optimized achieves 28% speedup over the original Caffe on the tested GPUs.

The rest of the paper is organized as follows. We present the motivation of this work in Section II, and then introduce the related work in Section III. The TNN method is described in Section IV, followed by our MTNN framework in Section V. Experimental results are presented in Section VI. We conclude the paper and discuss our future work in Section VII.

II Motivation

On deep neural networks, especially the fully connected networks [13], matrix-matrix multiplication (i.e., NN operations) and matrix-matrix-transpose multiplication (i.e., NT operations) are the two major computational tasks for the training process. Both types of matrix multiplication are commonly implemented by the SGEMM routine of BLAS library in practice. The standard SGEMM has the following form:

$C=\alpha\cdot op(A)\times op(B)+\beta\cdot C$

where $op$ represents whether the matrix is transposed or not, and $\alpha$ and $\beta$ are scalars. To simplify the calculation, we ignore the second term and set $\alpha$ to 1. In cuBLAS, the SGEMM API is “cublasSgemm”, in which the second and the third parameters are the values of $op$ for A and B respectively. The value of $op$ can be “CUBLAS_OP_T” (transpose) or “CUBLAS_OP_N” (no transpose). To understand the performance difference between NN and NT operations in cuBLAS, we conduct experiments to evaluate the running time performance of SGEMM for NN and NT operations with different sizes of input matrices. Table I shows the details of our two tested platforms.

We use $P_{algorithm}$ to denote the performance of a specific $algorithm$ with the unit of GFLOPS. To illustrate the difference between $P_{NN}$ and $P_{NT}$ , we run experiments for 1000 cases with different matrix sizes and show the distribution of resulted $P_{NN}/P_{NT}$ in Fig. 1. It is noted that, in most cases, the performance of NN ( $P_{NN}$ ) is much better than that of NT ( $P_{NT}$ ) because there is no overhead of matrix transpose. The percentages of the number of cases that $P_{NN}$ is higher than $P_{NT}$ are 71% and 62% on GTX1080 and Titan X respectively. More surprisingly, there are around 20% of cases with $P_{NN}/P_{NT}\geq 2.0$ on both GPUs. The low performance of NT of cuBLAS may be caused by the inefficient memory access to the elements of B. Another possible reason is that cuBLAS uses the slow in-place matrix transpose algorithm to reduce the memory footprint [14]. Observing this low efficiency issue, we are motivated to propose a method (TNN) for NT operations which finds the transpose of B first and then calls NN function of cuBLAS to finish the calculation of $A\times B^{T}$ on GPUs. The performance of TNN is better than cuBLAS in most cases, but still there exist cases that cuBLAS outperforms our TNN. To this end, we further design an algorithm selection approach to select an appropriate algorithm from the set {TNN, NT} based on a supervised learning algorithm. Notice that TNN requires that the GPU memory is large enough to store the additional $B^{T}$ . If that is not the case, our framework will simply choose the original NT operations.

III Related Work

SGEMM algorithm in cuBLAS has been intensively optimized on GPUs by kernel optimizations [5][15][16][17] and auto-tuning algorithms [18][6][7]. The information of different levels of GPU memory access latency [15] and instruction computation [5] are extracted to help increase the parallelism of GPU kernels, which can achieve excellent performance that is close to the theoretical hardware capacity based on the block-based matrix-matrix multiplication algorithm. Targeting at Fermi GPU of DGEMM (GEMM in double precision), R. Nath et al. [16] propose a double blocking algorithm to reduce the impact of latency in accessing registers and the shared memory, which can achieve up to 58% of the peak performance. Even though there is a well-designed kernel on GPU, the discrepancy among distinct GPUs could require different configurations to obtain best performance. Instead of conducting detailed kernel analysis, auto-tuning methods have been investigated to select the optimal configuration to achieve better performance of the kernel [18][6][7].

However, little work has been done to evaluate the performance of the NT operations. Since $B_{ji}^{T}=B_{ij}$ , we can perform NT by changing the access of a row to the corresponding column of matrix B with SGEMM routine. However, it might cause extra latencies due to uncoalesced global memory access and conflicted shared memory access when fetching the column elements of matrix B. The kernel optimization of NT is challenging because its performance depends not only on the GPU architecture, but also on the input matrix size. Therefore, instead of optimizing the kernel algorithm, we first propose a simple approach called TNN as an alternative to SGEMM. We notice that TNN can significantly outperform SGEMM in many cases, but sometimes its performance could be worse than SGEMM. To this end, we formulate an algorithm selection problem in order to select the appropriate algorithm for each NT operation.

Machine learning approaches become useful in choosing more efficient algorithms with high accuracy [9][11][10]. Spillinger et al. [9] exploit SVM model [19] to predict the better implementation of matrix multiplication algorithm at runtime among two implementations of MKL and CARMA on three different CPU platforms, which achieves about 26% performance improvement on average. Beside the SVM models which have been applied to solve algorithm selection problems [9][11], the decision tree classifier is also used to solve the automatic selection of sparse matrix representation on GPUs and it obtains no more than 1.05x average slowdown compared to the existing ideal approach [11]. In this paper, we make use of machine learning techniques to choose the more efficient algorithm between our proposed TNN and the original cuBLAS implementation to improve the average performance in calculating $C=A\times B^{T}$ .

IV TNN: transpose before multiply

As we already show in Fig. 1, directly calculating $C=A\times B^{T}$ is usually inefficient. We propose a simple TNN method which replaces the one-step NT operation by two-step operation, i.e., transpose B first and then make use of NN. The overall performance can be improved if $T_{TNN}=T_{transpose}+T_{NN}<T_{NT}$ , where $T_{algorithm}$ is the computation time of $algorithm$ . Note that $T_{transpose}$ includes the time of GPU memory allocation and release.

Matrix transpose is a memory bound operation [20]. There are two very different ways to perform matrix transpose: in-place and out-of-place. The in-place matrix transpose algorithm does not require extra memory space. However, the in-place matrix transposition can be factored as a product of disjoint circles [21], and the number of circles could be much lower in rectangular matrices and their length is not uniform, which results in the difficulty in parallelization [14]. The state-of-the-art implementation of in-place matrix transposition achieves only 51.56 GB/s and 22.74 GB/s on GTX 980 (with a peak memory bandwidth of 224 GB/s) and Telsa K20 (with a peak memory bandwidth of 208 GB/s) respectively with single precision [14].

On the contrary, the out-of-place matrix transposition can exploit the GPU shared memory to achieve an efficient utilization of GPU memory bandwidth. In [20], the optimized transpose kernel achieves up to 80% of peak bandwidth on tested GPUs, which is much higher compared to the in-place algorithm. Therefore, when the rest GPU memory is available to store $\textit{B}^{T}$ to perform the out-of-place matrix transpose, we can choose the out-of-place transpose routine to implement our TNN algorithm. The pseudo-code of TNN is shown in Algorithm 1. Since TNN requires the additional transpose operation on GPU, the time used by transpose operation ( $T_{transpose}(n,k)$ ) should not be larger than difference between NT ( $T_{NT}(m,n,k)$ ) and NN ( $T_{NN}(m,n,k)$ ). In other words, to guarantee $T_{TNN}(m,n,k)$ is smaller than $T_{NT}(m,n,k)$ ), we have:

[TABLE]

However, the performance of transpose operation is highly affected by the hardware platform and the size of the matrix. It is difficult to guarantee Equation 3 in practice because there do exist cases that the difference between $T_{NT}(m,n,k)$ ) and $T_{NN}(m,n,k)$ is small or even $T_{NT}<T_{NN}$ , like the cases of $P_{NN}/P_{NT}=1.1$ . We show the experimental results of NT and TNN in Fig. 2 and Fig. 3. $M$ is the height of matrix A, $N$ is the height of matrix B, and $K$ is the width of A and B. In Fig. 2, both x-axis and y-axis are using $log_{2}$ scale, i.e., the value of $M$ and $N$ are varied from $2^{7}$ to $2^{16}$ . The value of $K$ is also chosen from $2^{7}$ to $2^{16}$ , which forms a total of 1000 cases. To show the detailed visual results, we display all values of $K$ in Fig. 2 with various values of $M$ and $N$ . In this figure, the red rectangle indicates that the performance of NT is better than TNN; the green circle symbol indicates that the performance of NT is worse than TNN; and the blue dash symbol indicates that the performances of NT and TNN are equal. The size of the symbols is determined by the value $P_{NT}/P_{TNN}$ or $P_{TNN}/P_{NT}$ : a larger symbol size indicates a higher value of the ratio.

From Fig. 2, it is noticed that there are some cases that NT outperforms the TNN method, especially when the value of $K$ is small (e.g., there are up to half of the cases that NT is better than TNN when $K$ is 128 on both GPUs). Among all the tested cases, the maximum speedup of TNN over NT is 4.7x, whilst the maximum speedup of NT over TNN is 15.39x. From Fig. 3, it is easy to see that there is a great portion of cases (about 41.5% on GTX1080 and 43% on TitanX) that are located in the left side of $P_{TNN}/P_{NT}=1.0$ .

Therefore, to perform faster calculations of $C=A\times B^{T}$ , we should choose the NT algorithm and the TNN algorithm appropriately.

V MTNN: a Supervised Learning Based Algorithm Selection Method

In this section, we first formulate the algorithm selection problem as a classification problem for two given input sizes of matrices and a specific GPU platform. Let the class: $-1$ denote $P_{TNN}>P_{NT}$ and the class: $1$ denote $P_{TNN}\leq P_{NT}$ . Given a GPU platform: $G$ , the size of matrix A ( $m\times k$ ) and the size of matrix B ( $n\times k)$ , there exists a function:

[TABLE]

We need to learn a function $\hat{f}$ such that:

[TABLE]

The learning of function $\hat{f}$ can be regarded as a binary classification problem. There are 4 main steps of our supervised-learning based method MTNN. First, we need to construct the training and testing data sets with proper preprocessing of data by benchmarking the performance of NT and TNN. Second, we learn a decision model (i.e., $\hat{f})$ from training samples with supervised machine learning algorithms. Third, we evaluate the learned model on the testing data set. Lastly, we apply the trained model to predict the better implementation (i.e., NT or TNN) in calculating $C=A\times B^{T}$ .

V-A Data Collection

According to the results in Fig. 1, we choose a range of matrices with sizes in $S=\{2^{i}|i=7,8,...,16\}$ . In other words, for all $m$ , $n$ and $k$ ( $m\in S,n\in S,k\in S$ ), which has 1000 combinations, we measure the performances of NT and TNN in calculating $C=A\times B^{T}$ . Let $P_{NT}(m,n,k)$ and $P_{TNN}(m,n,k)$ denote the performance of NT and TNN respectively with two matrices A and B, where $\textit{A}\in R^{m\times k}$ and $\textit{B}\in R^{n\times k}$ . The difference value between $P_{NT}(m,n,k)$ and $P_{TNN}(m,n,k)$ is denoted by $D(m,n,k)$ . If $D(m,n,k)\geq 0$ , then $label=1$ , otherwise $label=-1$ . Each record is with the following format:

( $m$ , $n$ , $k$ ), $label$

For each type of GPU, 1000 cases are tested; but some samples that cannot be fitted into memory are not included into the evaluation. So the number of valid samples on each GPU is less than 1000, and the sample distribution is shown in Table II.

Besides the variety of input size of matrices, the GPU platform can also be different. Thus, we need to extract the features to represent different GPUs. The details of tested GPUs are shown in Table III, which are used as input features of the GPU platform.

Combined with different values of the characteristics of GPU in Table III, the input sample x is formed as an 8-dimension (5 dimensions from GPU specification and 3 dimensions from matrices size). The first 5 dimensions are global memory ( $gm$ ), the number of SMs ( $sm$ ), core clock ( $cc$ ), memory bus width ( $mbw$ ) and the size of L2 cache ( $l2c$ ). Note that the feature generation is an $O(1)$ computation, which is crucial to reduce the overhead of using the predictor in runtime. The final format of input sample x is as follows:

( $gm,sm,cc,mbw,l2c,m,n,k$ ), $label$

We do not need to normalize the input feature by using decision tree learning algorithms. By contrast, each dimension of the input feature should be normalized to the range of (0, 1) when training SVMs.

V-B Model Training

Given the training set: $\textbf{S}=\{\textbf{x}|\textbf{x}=(G,m,n,k)\}$ , where $G$ is the feature combination in Table III, we want to learn function: $\hat{f}$ , where

[TABLE]

If $\hat{f}(\textbf{x})=-1$ , then we choose TNN, otherwise we choose NT.

Learning Algorithm. SVM [22] is a power tool learning algorithm in solving classification problems. And it has been successfully applied to solve algorithm selection problems related to matrix-matrix multiplication [9][10]. Another powerful learning algorithm: decision tree (DT) is also prosperously used in solving the problem of automatic best algorithm selection [11], and there is an extended algorithm of decision tree named gradient boosted decision tree (GBDT) [23][24].

In this paper, we choose GBDT as our learning algorithm for three main reasons:

It does not require the input feature normalization since the decision tree is a recursive partitioning based algorithm, which reduces the overhead the feature preprocess in runtime. 2. 2.

Among 10 popular supervised learning algorithms, boosted decision tree outperforms other algorithms, including SVM and traditional decision tree on a variety of tested data sets [25]. 3. 3.

The prediction time complexity is acceptable, say $O(h)$ , where $h$ is the depth of the trained decision tree and can be restricted to a fixed value.

There are several algorithms of tree decision learning (e.g., ID3 [26], C4.5 [27] and CART [28]), and CART would be more competitive in some cases compared to others [29]. So we choose CART as our model training algorithm, and we use the implementation of gradient boosting framework named XGBoost [24], which is flexible, portable and highly efficient.

Parameter Configuration. We need to consider two main impacts when setting the parameters. On one hand, it is crucial that the depth of the decision tree should not be too deep, otherwise it will increase the overhead of the predictor in runtime. On the other hand, we need to set the proper parameters such that the prediction accuracy is high enough. In this paper, we set the maximum depth of the decision tree to be 8 and the number of estimators for boosting is also 8. We set step size shrinkage ( $eta$ ) to be 1, and the minimum loss reduction ( $gamma$ ) to 0, which makes the boosting algorithm more progressive.

Training. Instead of training model separately from different GPUs, we hope that the model is equipped with robustness to different GPU hardware, so we put all the input feature (8-dimension vector, including 5 characteristics of GPU) into one model training. We randomly split the data set into training data set (80%) and testing data set (20%). Note that in the 80% training data set, there include 80% samples from each GPU, and the remainder is used as testing data set. To validate whether the chosen model can generalize our data set, 5-fold cross-validation is presented in this work. After the evaluation of cross-validation, the whole data set is used as training data to learn the final model that can be put into real-world applications.

Integration. We use the learned model as our predictor of the selection system to choose the better algorithm between NT and TNN. After the model has been well trained, the final algorithm in calculating $C=A\times B^{T}$ is derived, and we call it MTNN. The pseudo-code of MTNN is shown in Algorithm 2.

VI Evaluation

We first demonstrate the evaluation of the accuracy of the predictor, which figures out the performance of the classifier, and then we present the overall performance improvement with the trained predictor (i.e., the performance of MTNN), which displays how well the selection system is.

VI-A Performance of Classification

To evaluate the performance of the classification algorithm, we use the metric of classifying accuracy to measure the classifiers. The average accuracy of our pre-defined 5-fold cross-validation is 90.51%, which means that the predictor makes the calculation of $C=A\times B^{T}$ fast enough in 90.51% cases. Since the testing data set is an imbalanced set with a larger number of negative samples than positive samples, both accuracies of the negative and the positive classes are recorded. Table IV shows the details of the accuracy of the 5-fold cross-validation.

We also make a comparison with SVM algorithms, including axial basis function kernel (SVM-RBF) and polynomial kernel (SVM-Poly), both of which are commonly used in supervised machine learning algorithms. We use libSVM [30] as SVM implementation, which is a widely used tool. The parameters for SVM are: $C=1000.0$ and $gamma=0.01$ , and the input feature is normalized to the range of (0, 1). The learning algorithm of traditional decision tree (DT) is also included into the comparison to show GBDT has a better performance in terms of accuracy and running efficiency. In the tested experimental environment (Table V) for learning algorithms, the performances of classifiers are shown in Table VI.

From Table VI, in terms of the prediction accuracy, GBDT is much better than both SVMs and DT. Regarding the training and prediction efficiency, GBDT outperforms both two types of SVMs. Even though the prediction time of GBDT is slightly longer than that of DT, it could be neglectable (only 0.005 ms) compared with the computation time of matrix-matrix multiplication.

Before putting the model into the MTNN algorithm, high-accuracy model should be trained with specific parameters and training samples. The 5-fold cross validation has verified our model is admissible with 80% training samples, but there exists a question that how many training samples should be chosen for the better convergence of the model so that MTNN has a higher prediction accuracy. We use different sizes of the training data set to figure out how many samples are proper to train a high-accuracy predictor. From all the 1832 samples, $x$ percent are selected as the training data set, and the whole samples are used as the testing data set, where $x$ is selected from 10 to 100 with a step size of 5. The training accuracy with different size of the training data set is shown in Fig. 4. It displays a tend of higher accuracy with larger size of training data set.

VI-B Performance of Selection

In this section, we want to show that how much performance improved by using the MTNN algorithm, which is integrated with the trained predictor. In the algorithm of MTNN, the integrated predictor is trained with all the data set to achieve higher performance instead of just using 80% data for training because the more data the higher accuracy in general. As we can see from Fig. 4, with 100% data as training set, the trained predictor with GBDT achieves 96.39% accuracy in classification, which means the selection system makes the correct decision to choose the better algorithm between NT and TNN in 96.39% cases.

Before presenting the statistic results of MTNN compared to NT and TNN, a visualized comparison between MTNN and NT on our tested GPUs is shown in Fig. 5. Compared to Fig. 2, the red rectangles, which indicate that the performance of TNN is worse than NT, are reduced to a very small portion by the MTNN method. In other words, in most cases, the performance of MTNN is better than or equal to NT; and only in a minority of cases, the performance of MTNN is worse than NT. The statistic frequency on the performance of MTNN over NT is shown in Fig. 6. The portion of the cases that MTNN outperforms NT is 47.81% on GTX1080, and 43.35% on TitanX. It shows that there is futher optimization space for the matrix-matrix-transpose multiplication algorithm on Pascal GPUs. In Fig. 2, the maximum value of $P_{NT}/P_{TNN}$ is 15.394, while Fig. 5 displays the maximum of $P_{NT}/P_{MTNN}$ is only about 1.6.

Similar to the work in [9] and to make further comparisons in a statistic way, we use $GOW$ (Gain over Worst) to denote Gain in performance of MTNN Over the Worst algorithm at each sample. $GOW$ is calculated by:

[TABLE]

Let $LUB$ (Loss under Best) denote the percent Loss of MTNN Under the Best algorithm for each sample, which is calculated by:

[TABLE]

We can define some metrics to measure the performance of MTNN compared to NT and TNN. The description of metrics is displayed in Table VII. And the corresponding evaluated values are shown in Table VIII.

From Table VIII, we can see MTNN achieves 54.03% performance improvement compared to use the NT algorithm only, and 21.92% compared to TNN on average. Compared to the worst cases of NT and TNN, MTNN achieves up to 76.23% performance improvement on average and up to 1439.39% in some particalar cases. There are some cases that the predictor makes the wrong decision, but the slowdown performance is only about 0.28%. In other words, compared to the best cases of NT and TNN, the performance of MTNN is only 0.28% worse when the predictor chooses the lower performance algorithm of NT and TNN. Between these two GPUs, the speedup of time efficiency on the GTX1080 card is slightly higher than that on the TitanX card.

VI-C Evaluation with Caffe

To test the performance of MTNN in the real-world application, we integrate the MTNN algorithm into Caffe [12] which is one of the most popular deep learning frameworks. We choose two types of fully connected networks: one is with the MNIST data set whose input and output dimensions are small, and the other one is with a synthetic data whose input and output dimensions are large. For each type of fully connected network, a variety of hidden layers are configured, namely 2, 3 and 4 layers. The configuration details of neural networks are shown in Table IX. The performance comparison of these two types of networks running on the original version of Caffe (CaffeNT) and Caffe with MTNN (CaffeMTNN), are displayed in Fig. 7 and Fig. 8, respectively.

By integrating our method to Caffe, the performance of the optimized Caffe accomplishes a slightly improvement of 1.74% with the MNIST data set, while the performance improvement is as much as 28.2% with the synthetic data set.

On one hand, from Fig. 7, it is noted that the training time speed of CaffeNT and CaffeMTNN is very close with all the mini-batch sizes. The main reason is that with specific number of neurons in two adjacent layers (e.g., $l1$ and $l2$ ) and the mini-batch size ( $mb$ ), the size of matrix-matrix-transpose multiplication is decided by $l1$ , $l2$ and $mb$ . If the values of $l1$ , $l2$ and $mb$ are too small, the performance of TNN has no advantages compared to the original NT of cuBLAS, which can be explained with the performance comparison in Fig. 5 (there are many dash symbols on the left-bottom side of the figure, so MTNN can only be on the par with NT of cuBLAS). There exists a particular case that MTNN is slightly worse than NT of cuBLAS with mini-batch of 4096 in the network of 3 hidden layers on TitanX. The reason of this minor slowdown is that the predictor makes the error prediction, but it may occur only in a very small probability since the accuracy of the predictor is up to 96%.

On the other hand, from Fig. 8, with the larger neural network (the input size and the output size are both 27652 in our tested case) and the larger mini-batch size (larger than 512), the speedup of CaffeMTNN is significant. And the matrix-matrix-transpose multiplication can be mapped to the cases in the right-top side of Fig. 5, where it has numerous green circles, which means the deep neural networks can benefit from the higher performance algorithm of MTNN.

The matrix-matrix-tranpose multiplication only impacts either the forward propagation or the backward propagation during the training of deep neural networks. To demonstrate which phase benifits from the MTNN method, we break down the running time in one mini-batch to the forward phase and the backward phase in the experimental results. Instead of showing all the tested cases seperately, we show the statistic results with different data sets on different GPUs by averaging all the mini-batch sizes and layers. The results are shown in Table X. It is noted that the running time of the backward propagation is almost the same in all the cases. The main speedup of the training process is contributed to the forward phase. With the MNIST data set whose network size is small, CaffeMTNN is on the par with CaffeNT. With the sythetic data set whose network size is large, the speedup of the forward propagation of CaffeMTNN is significant, and it obtains as much as 2.44x and 2.15x speedups compared to CaffeNT on GTX1080 and TitanX, respectively.

VII Conclusion and Future Work

In this paper, we first figure out the low performance of cuBLAS in calculating the matrix-matrix-transpose multiplication compared to the matrix-matrix multiplication by benchmarking a variety of cases. To accelerate the calculation of matrix-matrix-transpose multiplication, we propose a simple solution (named TNN), which carrys out the efficient out-of-place tranpose algorithm first and then make use of the high performance matrix-matrix multiplication algorithm. TNN achieves some performance improvement, but it still may fall into even worse efficiency. In order to obtain the best average performance, we design a supervised learning based algorithm (named MTNN), which can make an intelligent choose of proper algorithm in calculating matrix-matrix-transpose multiplication. Using the boost gradient decision tree learning algorithm, MTNN can carry out the matrix-matrix-transpose multiplication with faster routine in an accuracy of 96%. We evaluate the performance of our algorithm on two modern GPUs (GTX1080 and Titan X Pascal). The experimental results show that the MTNN method achieves 54.03% performance improvement compared to cuBLAS. To verify the effectiveness of MTNN in the real-world application, we integrate MTNN into a popular deep learning framework: Caffe, and the optimized Caffe obtains an average of 28% improvement on fully connected networks.

The transpose algorithm used in this paper is an out-of-place method, which requires extra memory to store the transpose of matrix. The selection system could not be used if the GPU card has no enough memory. Therefore, we plan to exploit in-place matrix transpose algorithms by finding a good trade-off between the memory overhead and throughput.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Le Cun et al. , “Lenet-5, convolutional neural networks,” URL: http://yann. lecun. com/exdb/lenet , 2015.
2[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems , 2012, pp. 1097–1105.
3[3] S. Shi, Q. Wang, P. Xu, and X. Chu, “Benchmarking state-of-the-art deep learning software tools,” ar Xiv preprint ar Xiv:1608.07249 , 2016.
4[4] NVIDIA, “cublas — nvidia,” https://developer.nvidia.com/cublas , 2017, accessed: 2017-02-20.
5[5] J. Lai and A. Seznec, “Performance upper bound analysis and optimization of sgemm on fermi and kepler gpus,” in Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on . IEEE, 2013, pp. 1–10.
6[6] J. Kurzak, S. Tomov, and J. Dongarra, “Autotuning gemm kernels for the fermi gpu,” IEEE Transactions on Parallel and Distributed Systems , vol. 23, no. 11, pp. 2045–2057, 2012.
7[7] A. Abdelfattah, A. Haidar, S. Tomov, and J. Dongarra, “Performance, design, and autotuning of batched gemm for gpus,” in International Conference on High Performance Computing . Springer, 2016, pp. 21–38.
8[8] J. R. Rice, “The algorithm selection problem,” Advances in computers , vol. 15, pp. 65–118, 1976.