ScriptNet: Neural Static Analysis for Malicious JavaScript Detection

Jack W. Stokes; Rakshit Agrawal; Geoff McDonald; Matthew Hausknecht

arXiv:1904.01126·cs.CR·April 3, 2019

ScriptNet: Neural Static Analysis for Malicious JavaScript Detection

Jack W. Stokes, Rakshit Agrawal, Geoff McDonald, Matthew Hausknecht

PDF

TL;DR

ScriptNet introduces a neural static analysis system using the CPoLS model to effectively detect malicious JavaScript files at web scale, achieving high accuracy and outperforming baselines.

Contribution

The paper presents a novel neural static analysis approach with the CPoLS model for JavaScript malware detection, trained end-to-end for improved performance.

Findings

01

97.20% true positive rate at 0.50% false positive rate

02

Outperforms baseline models significantly

03

Effective processing of large JavaScript datasets

Abstract

Malicious scripts are an important computer infection threat vector in the wild. For web-scale processing, static analysis offers substantial computing efficiencies. We propose the ScriptNet system for neural malicious JavaScript detection which is based on static analysis. We use the Convoluted Partitioning of Long Sequences (CPoLS) model, which processes Javascript files as byte sequences. Lower layers capture the sequential nature of these byte sequences while higher layers classify the resulting embedding as malicious or benign. Unlike previously proposed solutions, our model variants are trained in an end-to-end fashion allowing discriminative training even for the sequential processing layers. Evaluating this model on a large corpus of 212,408 JavaScript files indicates that the best performing CPoLS model offers a 97.20% true positive rate (TPR) for the first 60K byte subsequence…

Figures37

Click any figure to enlarge with its caption.

Tables3

Table 1. Table I: DataSet Statistics.

DataSet	Start	End	Total	Num	%	Num	%
	Date	Date		Malware	Malware	Benign	Benign
Training	09/14/2017	12/28/2017	151,840	126,505	83.31	25,335	16.69
Files
Validation	12/29/2017	02/01/2018	45,251	38,693	85.51	6,558	14.49
Files
Test	02/02/2018	03/03/2018	65,109	57,037	87.60	8,072	12.40
Files
Total	09/14/2017	03/03/2018	262,200	222,235	84.76	39,965	15.24
Files

Table 2. Table II: Hyperparameter settings for the various models. The hyperparameter settings for the CPoLS variant are identical to the CPoLS model.

Model	Parameter	Description	Value
CPoLS	$T_{C P o L S}$	Maximum Sequence Length	60,000
CPoLS	$B_{C P o L S}$	Minibatch Size	50
CPoLS	$H_{C P o L S}$	LSTM Hidden Layer Size	250
CPoLS	$E_{C P o L S}$	Embedding Layer Size	100
CPoLS	$W_{C P o L S}$	CNN Window Size	10
CPoLS	$S_{C P o L S}$	CNN Window Stride	5
CPoLS	$F_{C P o L S}$	Number of CNN Filters	100
CPoLS	$D_{C P o L S}$	Dropout Ratio	0.5
LaMP	$T_{L a M P}$	Maximum Sequence Length	200
LaMP	$B_{L a M P}$	Minibatch Size	200
LaMP	$H_{L a M P}$	LSTM Hidden Layer Size	1500
LaMP	$E_{L a M P}$	Embedding Layer Size	50
LaMP	$D_{L a M P}$	Dropout Ratio	0.5

Table 3. Table III: Performance of the various models which were evaluated for this study.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1	AUC
CPoLS-LSTM-LR $(L = 1, C = 0, T = 60 K)$	98.8725	98.9990	99.7212	0.9936	0.9985
CPoLS-LSTM-NN $(L = 1, C = 1, T = 60 K)$	98.1997	98.2232	99.7492	0.9898	0.9937
CPoLS-LSTM-DNN $(L = 1, C = 2, T = 60 K)$	98.1966	98.1135	99.8615	0.9898	0.9964
CPoLS-LSTM-LR $(L = 2, C = 0, T = 60 K)$	99.0399	99.3888	99.5160	0.9945	0.9975
CPoLS-LSTM-NN $(L = 2, C = 1, T = 60 K)$	98.9708	99.1626	99.6668	0.9941	0.9984
CPoLS-LSTM-DNN $(L = 2, C = 2, T = 60 K)$	98.8771	99.1909	99.5301	0.9936	0.9981

Equations6

L = - (τ l o g (p_{m}) + (1 - τ) l o g (1 - p_{m}))

L = - (τ l o g (p_{m}) + (1 - τ) l o g (1 - p_{m}))

f (x) = ma x (0, x)

f (x) = ma x (0, x)

E = \textsc E mb e dd in g (B) H_{L} = \textsc L S T M (E) h_{C L} = \textsc M a x P oo l 1 d (H_{L})

E = \textsc E mb e dd in g (B) H_{L} = \textsc L S T M (E) h_{C L} = \textsc M a x P oo l 1 d (H_{L})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\newfloatcommand

capbtabboxtable[][\FBwidth]

ScriptNet: Neural Static Analysis for Malicious JavaScript Detection

Anonymous Submission to NDSS 2019

Jack W. Stokes

Microsoft Research

Rakshit Agrawal

University of California, Santa Cruz

Geoff McDonald

Microsoft Corporation

Matthew Hausknecht

Microsoft Research

Abstract

Malicious scripts are an important computer infection threat vector in the wild. For web-scale processing, static analysis offers substantial computing efficiencies. We propose the ScriptNet system for neural malicious JavaScript detection which is based on static analysis. We use the Convoluted Partitioning of Long Sequences (CPoLS) model, which processes Javascript files as byte sequences. Lower layers capture the sequential nature of these byte sequences while higher layers classify the resulting embedding as malicious or benign. Unlike previously proposed solutions, our model variants are trained in an end-to-end fashion allowing discriminative training even for the sequential processing layers. Evaluating this model on a large corpus of 212,408 JavaScript files indicates that the best performing CPoLS model offers a 97.20% true positive rate (TPR) for the first 60K byte subsequence at a false positive rate (FPR) of 0.50%. The best performing CPoLS model significantly outperform several baseline models.

I Introduction

The detection of malicious JavaScript (JS) is important for protecting users against modern malware attacks. Because of its richness and its ability to automatically run on most operating systems, malicious JavaScript is widely abused by malware authors to infect users’ computers and mobile devices. JavaScript is an interpreted scripting language developed by Netscape that is often included in webpages to provide additional dynamic functionality [28]. JavaScript is often included in malicious webpages, PDFs and email attachments. To combat this growing threat, we propose ScriptNet, a novel deep learning-based system for the detection of malicious JavaScript files.

There are numerous challenges posed by trying to detect malicious JavaScript. Malicious scripts often include obfuscation to hide the malicious content which unpacks or decrypts the underlying malicious script only upon execution. Complicating this is the fact that the obfuscators can be used by both benign and malware files. Curtsinger et al. [7] measured the distributions of malicious and benign JavaScript files containing obfuscation. The authors showed that these distributions are very similar if a file is obfuscated and concluded that the presence of obfuscation alone cannot be used to detect malicious JavaScript.

Another difficulty is that a large number of file encodings (e.g., UTF-8, UTF-16, ASCII) are automatically supported by JavaScript interpreters. Thus, individual characters in the script may be encoded by two or more bytes. As a result, malware script authors can use the embedding itself to attempt to hide malicious JavaScript code [39].

While a wide range of different systems have been proposed for detecting malicious executable files [9], there has been less work in investigating malicious JavaScript. Previous JavaScript solutions include those based purely on static analysis [25, 33]. To overcome the limitations imposed by obfuscation, other methods [7, 39, 5] include both static and some form of dynamic (i.e., runtime) analysis to unroll multiple obfuscation layers. In some cases, the solution is focused on the detection of JavaScript embedded in PDF documents [23, 5, 27]. In addition, deep learning models have recently been proposed for detecting system API calls in PE files [2, 21, 29], JavaScript [37], and Powershell [15].

Including dynamic analysis allows improved detection over the previous static analysis approaches. While the combined static and dynamic analysis approaches can help unroll multiple obfuscation layers, it can cause additional difficulties.

In some cases where latency or a computational resources are problematic we would like to have an effective, purely static analysis approach for predicting if an unknown JavaScript file is malicious. Three important applications of this work are large-scale, fast webpage, antimalware and email scanning services. Search engine companies often scan large numbers of webpages searching for drive-by downloads. Antimalware companies may scan hundreds of thousands or even millions of unknown files each day. Similarly, large-scale email hosting providers often scan email attachments to identify malicious content. To scan an individual webpage or file, a specially instrumented virtual machine (VM) must first be reset to a default configuration. The webpage or email attachment is then executed, and dynamic analysis is used to determine whether the unknown script makes any changes to the VM. This process is time consuming and can require vast amounts of computing resources for extremely large-scale email services. If a script classifier can be trained to accurately predict that a script attachment is benign based solely on fast static analysis, this could possibly allow search and email service providers to reduce the number of expensive dynamic analysis performed using full instrumented VMs in the cloud. In this study, we focus on identifying malicious JavaScript for the Microsoft Windows Defender antimalware service.

To address these challenges, ScriptNet employs a sequential, deep learning model for the detection of malicious JavaScript files based solely on static analysis. The deep learning system allows high accuracy even in the presence of obfuscation. We evaluate the CPoLS sequence learning model family in the context of fully static analysis which is capable of capturing malicious behavior in any kind of JavaScript file.

Since the system operates directly on the byte representation of characters instead of keywords, it is able to handle the extremely large vocabulary of the entire script instead of detecting only the key API calls [23, 27]. In ScriptNet, a Data Preprocessing module first translates the raw JavaScript files into a vector sequence representation. The Neural Sequential Learning module then applies deep learning methods on the vector sequence to derive a single vector representation of the entire file. In this module, we propose a novel deep learning model model called Convoluted Partitioning of Long Sequences (CPoLS). These models can operate on extremely long sequences and can learn a single vector representation of the input. The next module of ScriptNet, the Sequence Classification Framework, then performs binary classification on the derived vector and generates a probability $p_{m}$ of the input file being malicious. Unlike earlier sequential models that are proposed to detect malicious PE files [2], our models are trained with end-to-end learning where all the model parameters are learned simultaneously taking the JavaScript file directly as the input.

Evaluating the proposed models on a large corpus of 262,200 JavaScript files, we demonstrate that the best performing CPoLS model offers a true positive rate of 97.20% for the first 60K byte subsequences at a false positive rate of 0.50%. We summarize the primary contributions of this paper as follows:

•

A comprehensive definition of a modular system is provided for detecting the malicious nature of JavaScript files using only the raw file content.

•

A novel deep learning model is proposed for learning from extremely long sequences.

•

Strong malware detection results are demonstrated using ScriptNet on a large corpus of JavaScript files collected by hundreds of millions of computers running a production antimalware product. The results show the robustness of ScriptNet on predicting the malicious nature of JavaScript files that were obtained in the future and were not known at the time of training.

II Data Collection and Dataset Generation

Large labeled datasets are required to sufficiently train deep learning systems, and constructing a dataset of malicious and benign scripts for training ScriptNet’s models is a challenge. When unknown JavaScript is encountered by the user during normal activity, it is submitted to the antimalware engine for scanning. Our antivirus partner generated the datasets utilized in this study from JavaScript encountered by the Windows Defender antimalware engine which was submitted to their production file collection and processing pipeline.

Methodology: Entire JavaScript files are extracted from the incoming flow of files input to the production pipeline. The antimalware engine is the only source of these files in this study which are uploaded from hundreds of millions of end user computers. A user must provide consent (i.e., opt-in) before their file is transmitted to the production cloud environment. In many cases, JavaScript files may be extracted from installer packages or archives which are also processed by the antimalware engine and input to the production pipeline. Because we want to model how well ScriptNet would perform in a production setting, we do not artificially insert additional file into the datasets such as those collected from Alexa top 500 websites or by other data augmentation methods because the production pipeline does not currently spend computational resources doing so.

Labels: Similar to the raw script content, the labels are also provided by the antimalware company, and the labeling process which is used in production cannot be changed for our study. We now describe the labeling process that was used by the antimalware company to generate the labels.

A script is labeled as malware if it has been inspected by our AV partner’s analysts and determined to be malicious. In addition, the script is labeled as malicious if it has been detected by the company’s detection signatures. Finally, a JavaScript file is labeled as malware if eight or more other anti-virus vendors detect it as malware. Given the huge volumes of files that are scanned each day, it is possible that an individual anti-malware company may mispredict in an unknown file is malicious or benign. The threshold parameter of eight was empirically chosen by the company after carefully selecting it as a good tradeoff between identifying malicious scripts while minimizing false positives.

A script is labeled as benign by a number of methods. First, the script is considered benign if it has been labeled as benign by an analyst or has been previously collected by a trusted source such as being downloaded from a legitimate webpage or signed by a trusted signer. However, if this does not provide enough labeled benign scripts, the dataset is augmented with lower confidence benign scripts in the production pipeline which are not detected by the company’s scanners and cloud detections as well as by any other trusted anti-virus vendors for at least 30 days after our AV partner has first encountered it in the wild.

Datasets: As described in Table I, our anti-virus partner provided the full content of 262,200 JavaScript files which contained 222,235 malicious and 39,965 benign scripts. For this research, JavaScript files were subsampled from the production pipeline from September 2017 through March 2018. These JavaScript files were partitioned into training, validation, and test sets containing 151,840, 45,251, and 65,109 samples, respectively, based on the non-overlapping time periods denoted in the table.

For the learning phase, which is described later in the paper, we use the training and validation sets. The training set is the largest portion ( $\sim 60\%$ ) and is used to train the learning model and update its weights. The validation set is a small portion ( $\sim 15\%$ ), and is used for model parameter tuning during the learning phase. JavaScript files in the validation set are not present in the training set. The performance of the learning model on the validation set helps guide the selection of the best model. In the detection phase, we use the third partition, called test set ( $\sim 25\%$ ). JavaScript files in the test set are not present in either training or validation set. The test set, consisting of new files, helps perform true evaluation of a trained model on unseen data. All the evaluation metrics for our models use the test dataset only.

III ScriptNet System Design

ScriptNet is motivated by the objective of building a system, which can predict the malicious nature of a script by analyzing the file in the absence of any additional information. Additionally, we want this system to be able to learn features from the data itself without intervention from humans. In this section, we describe the architecture of ScriptNet in detail which is designed to achieve these objectives.

The ScriptNet system is comprised of multiple modules banded together in a specific order. These modules can be treated independently as black boxes that take a specific kind of input and can generate a certain output, which can be used by the following module. The high-level illustration of the ScriptNet system is shown in Figure 1. In this section we describe the modules, as well as the process of learning and detection used by this system.

Data Preprocessing: The first stage of ScriptNet is to process the raw file data and prepare it for utilization by a deep learning model. In their raw form, the script files are simply text files written using readable characters. The content inside script files is in the form of programming code. This means that the text includes operators, variable names, and other syntactical properties. In natural language, the semantic meaning of a certain word is limited to a small space. Whereas in programming code, words cannot be directly mapped to a limited semantic space. Moreover, the number of words and operators can grow infinitely as variable names do not need to follow any linguistic limitations. Therefore, we need to represent the scripts at a much finer level than using words.

We achieve this by interpreting the script files as byte sequences. Using this method to read the files, we limit the space of possible options to the number of different bytes, i.e., $(2^{8})=256$ . The complete process of sequence processing is also illustrated in Figure 2.

For the system to clearly identify the different bytes, we need to provide them unique identifiers. At this stage, therefore, we create an index to map each byte with a symbolic identifier. Following the convention in deep learning models, we refer to this index as the vocabulary $V$ for the model. A vocabulary helps maintain a one-to-one symbol mapping with the raw data and can also be used to tie any out-of-vocabulary items to a special symbol. For example, we introduce an extra symbol for padding sequences to a uniform length.

With the use of this vocabulary, we can now transform the input file into a sequence usable by the learning model. At first, by reading the text as bytes, we get a byte sequence. Next, we perform a lookup through the vocabulary index and represent each byte with its symbolic representation. We refer to the derived sequence as $B$ , where $B=[b_{1},b_{2},b_{3},\ldots]\hskip 5.0pt\forall b_{i}\in V$ denotes a sequence of symbols $b_{i}$ each of which is identified in our vocabulary $V$ .

For learning purposes, it is possible to directly use this symbolic representation. However, symbols serve information at a very low level of dimensionality. When represented as symbols, any similarity between two kinds of bytes cannot be directly identified. In neural networks, the concept of representations, or ’embeddings’ is extensively used for this purpose. By representing symbols with vectors, we can increase the dimensionality of the information associated with each element. These vectors can be learned from the data itself. The distance between these vectors also serves as a measure of semantic similarity.

In our case, we use this concept of representations and transform the symbolic sequence into a sequence of vectors. The initial value of these vectors is randomly selected using initialization methods by Glorot and Bengio [13]. During the training phase, the vectors are updated along with the model.

Neural Sequential Learning: In our preprocessing phase, we converted the input files into vector sequences. Since the lengths of these files can vary, the derived sequences are also of different length. General learning methods based on feature vectors read fixed length vectors as input and operate on them. For our case with more complex-shaped data, we need to use modules that can process two-dimensional input data. Therefore, to learn from sequences, we use advanced neural network architectures like Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) [24].

RNNs operate on each vector sequentially and use both the input at each time step of the sequence, along with the learned vector from the previous time step, to produce the next output. In our models, we use a memory-based variant of RNNs known as Long Short-Term Memory (LSTM) [17, 12] neural networks.

CNNs operate by performing convolutions over a sliding window of smaller partitions within the input. These can operate on both sequences, as well as images. For learning from sequences, we use one-dimensional CNNs. While mostly used in computer vision [22, 31], CNNs have also recently shown success in sequential learning [10, 11].

In our system, we construct Neural Sequential Learning modules based on both LSTMs and CNNs. The objective of this module is to capture variable-length vector sequences and learn a vector representation from them. Intuitively, this module is responsible for searching through the input sequence and extracting any relevant information that can be used for final detection. In this paper, we use the Convoluted Partitioning of Long Sequences (CPoLS) for sequence learning. We will describe this model in detail in the next section.

Sequence Classification Framework: Once the input file has been processed through our data processing and neural sequence learning modules, the input is available as a fixed-length vector $h_{CL}$ . The objective at this stage is to perform the final prediction in order to classify the input as being malicious or benign. There are several methods of classification in machine learning that can be used at this point. Since we use a sequence learning module based on neural networks, we use classification models that can also be trained using gradient descent-based methods. Using such models, we can train our entire learning system end-to-end. End-to-end learning means that every weight (or coefficient) in our model can be trained in a single process guided directly by the ground truth.

Learning Phase: The modules described above, combined together, create the complete ScriptNet system. Since these models depend on data for training, the system goes through a learning phase before it can be used on new files for malware detection. Figure 1a describes the learning phase of ScriptNet. In this phase, along with the JavaScript files, we also have labels associated with each file, specifying them as being either malicious or benign. The system, therefore, processes the input file into a vector sequence, as well as generates the associated label, to be used for training the system. The sequence vector-label pair is then passed to the learning model.

As mentioned above, we use gradient descent-based methods to train our models. In such methods, a loss $\mathcal{L}$ is measured by comparing the prediction $p_{m}$ generated by the learning model and the available ground truth label $\tau$ . This loss is then used to update the coefficients (i.e., weights) of the model. For our objective of binary classification, we use the binary cross-entropy loss function, which is defined as:

[TABLE]

where $\tau$ is the known ground truth, $p_{m}$ is the predicted probability of maliciousness, and $\log$ is the natural logarithmic function. This process of backpropagating the loss is repeated for the entire training dataset, for a specified number of iterations, until the model converges to the best weights.

Detection Phase: Once the model is trained completely, it is then available for performing detection on new, unknown files. In the learning phase, we use each sample to improve the learning model. Whereas in the detection phase, we use a well-trained model to perform inference in a real deployment setup. Figure 1b shows the detection phase version of ScriptNet.

Since the learning model takes a vector sequence as an input, the same data preprocessing module is used in the detection phase. However, the model is now used only to generate a probability of detection. In machine learning terminology, this pass through the model is known as a forward pass. During a forward pass, we only move through the model in one direction and generate a probability $p_{m}$ for the input file.

End-to-End Learning: Due to the modular nature of our system, we have the freedom to train it in different ways. While we present neural models in this paper, the system can also use different components from machine learning. For instance, in the Sequence Classification Framework, we can ideally use any classifier like an Support Vector Machine or Naive Bayes, which may or may not support gradient-based updates.

By keeping our models in the realm of neural networks, we are also able to utilize the concept of end-to-end learning. This means that our system can train itself completely by just using the input files and labels. We do not need to train different modules individually in such a setting. The results presented in this paper were trained using end-to-end models.

IV Models

ScriptNet uses end-to-end learning models based on neural networks. For our objectives, we need models for sequential learning and sequence classification. In this section, we present a detailed description of the Convoluted Partitioning of Long Sequences (CPoLS) model. We first briefly discuss the neural method of sequence learning and describe our motivation behind constructing these models.

IV-A Sequence Learning and Limitations

Learning from sequential data is a common use case in machine learning. Data in natural language, speech, time series, stock prediction, and many other domains can be of sequential nature. For sequences of very short fixed lengths, vector based learning models like logistic regression can often work well, by flattening the sequence into a longer single vector. For longer sequences, RNNs, specifically LSTMs have been popularly used and have shown exceptional results. Since CNNs can also capture multi-dimensional data, they are often used with sequential data.

As the length of the sequences keeps increasing, these models start experiencing different challenges. RNNs originally experienced the vanishing and exploding gradient [3, 16] problems when used on longer sequences. LSTMs, in particular help mitigate such problems. However, for extremely long sequences, models directly based on the LSTM can quickly become computationally expensive and are unable to process the complete input. Additionally, the objective of ScriptNet is to detect the malicious nature within a file. In natural language data, sequences often generate a context space, within which the semantic relationship between different objects is more clear. For instance, in a task of detecting abusive text, a sentence containing foul words is also semantically linked to an abusive sentiment. The mere presence of a foul word in a sentence cannot make it abusive. However, in malicious JavaScript files, a file can look completely normal except for a certain point in the file where malicious code is present. Therefore, not only do we need to process very long files, we also need to capture specific malicious nature hidden at any location within the file.

Due to the limitations mentioned above and problem specific requirements, we propose these new models. The CPoLS model can operate on extremely long sequences. We later show in the paper, that these models also perform exceedingly well over other proposed solutions in a similar problem space.

IV-B Convoluted Partitioning of Long Sequences

Convoluted Partitioning of Long Sequences (CPoLS) is a neural model architecture designed specifically to extract classification information hidden deep within long sequences. In this model, we process the input sequence by splitting it into smaller parts of fixed-length, processing them individually, and then combining them again for further learning. The process of CPoLS is as follows:

Step 1.: The model receives an input sequence $B$ as a sequence of bytes. Since the sequences are extremely long, we pass them to the model in their symbolic form and transform them to vectors later.

Step 2: The byte sequence $B$ is split into a list $C$ of small subsequences $c_{i}\in C$ where $i$ is the index of each partition in $C$ . During the split, the subsequences maintain their order.

Step 3: Next on the smaller subsequences, we perform the lookup for transforming symbolic sequence $c_{i}$ into vector sequences $e_{i}\in E$ where $E$ is the list of vector subsequences and $i$ is the index of each subsequence. Following the conventional neural network terminology, we refer to the layer for this lookup as the Embedding layer.

Step 4: Each of these partitions $e_{i}$ are now separately processed through a module called RecurrentConvolutions, while still maintaining their overall sequential order.

Step 4.1: In RecurrentConvolutions, we pass each partition $e_{i}$ through a one-dimensional CNN, Conv1D, which applies multiple filters on the input sequence and generates a tensor $e^{\chi}_{i}$ representing the convoluted output of vector sequence $e_{i}$ . The combined list of convolved partitions $e^{\chi}_{i}$ for each subsequence $e_{i}\in E$ is referred to as $E^{\chi}$ .

Step 4.2: We then reduce the dimensionality of $e^{\chi}_{i}$ by performing a temporal max pooling operation MaxPool1d on it. MaxPool1d takes a tensor input $e^{\chi}_{i}$ and extracts a vector $e^{\prime}_{i}$ from it corresponding to the maximum values across each dimension.

Step 5. As a result of RecurrentConvolutions, for each subsequence $e_{i}$ , we derive a vector representation $e^{\prime}_{i}$ . We finally combine these vectors in order to generate a new vector sequence $E^{\prime}$ where each vector $e^{\prime}_{i}\in E^{\prime}$ is a result of RecurrentConvolutions and, therefore, consists of the learned information from subsequence $e_{i}$ .

Step 6.: We now obtain a reduced-length sequence of vectors $E^{\prime}$ . This sequence can now be processed using a standard sequence learning approach. We, therefore, next pass this sequence through an LSTM. In place of LSTM, we can also use multiple stacked LSTMs, bi-directional LSTMs (BiLSTMs), or any other RNN variants like Gated Recurrent Units (GRUs), etc. For an input sequence $E^{\prime}$ of length $n$ , this layer produces a learned sequence $H_{L}$ of length $n$ but with a different fixed dimensionality.

Step 7.: For detecting malware, we want to obtain the important malicious signal information within the sequence $H_{L}$ . An effective method for such cases is the use of temporal max pooling, MaxPool1d, as proposed by Pascanu et al. [29].

Given an input vector sequence $S=[s_{0},s_{1},\ldots s_{M-1}]\in S$ of length $M$ , where each vector $s_{i}\in\mathbb{R}^{K}$ is a $K$ -dimensional vector, MaxPool1d computes an output vector $s_{MP}\in\mathbb{R}^{K}$ as $s_{MP}(k)=\max(s_{0}(k),s_{1}(k),\cdots s_{M-1}(k))\forall k\in K$ . The vector $s_{MP}$ , therefore, for each dimension, contains the maximum value observed in the sequence for that dimension.

At this stage, we pass the sequence $H_{L}$ through MaxPool1d to obtain the final vector $h_{CL}$ . The vector $h_{CL}$ is the derived vector representation of the entire sequence $B$ using the CPoLS model. This vector can now be used by the Sequence Classification Framework to perform the final binary classification.

The simplest such model can be a logistic regression model that uses $h_{CL}$ and derives a probability of maliciousness $p_{m}$ . We can even use complex models, such as feed-forward neural networks, or deeper neural networks with multiple layers for the same purpose. We can also use any non-linear activation functions in these networks. For our experiments, we use the Rectified Linear Unit (ReLU), which is defined as:

[TABLE]

where $f$ represents the ReLU function on an input $x$ .

Due to the modular nature of our system, the choice of the classifier is independent of the sequential learning method being used. For this reason, we evaluated our models with a large number of combinations. Along with CPoLS and CPoLS for sequential learning, we also used a simpler LSTM-based method. We refer to this method as using LSTM and Max-Pooling (LaMP) based on the malware detection model proposed by Athiwaratkun and Stokes [2]. As the name of the model suggests, this model directly takes the vector input sequence, and passes it through the LSTM. The sequence of learned vectors from the LSTM is then passed through a temporal max-pooling layer, MaxPool1d, in order to derive the final vector $h_{CL}$ . For an input byte sequence $B$ , LaMP can be summarized as:

[TABLE]

where Embedding is the embedding lookup layer, $E$ is the vector sequence derived using Embedding, and $H_{L}$ is the output vector sequence derived from the LSTM.

V Experimental Results

In this section, we evaluate the performance of the proposed ScriptNet classifier models on the JavaScript files described in Section II. We start by describing the experimental setup used to generate the results. We next investigate the performance of the different hyperparamer settings for the CPoLS variants.

Experimental Setup: All the experiments are written in the Python programming language using the Keras [4] deep learning library with TensorFlow [1] as the backend deep learning framework. The models are trained and evaluated on a cluster of NVIDIA P100 graphical processing unit (GPU) cards. The input vocabulary size is set to 257 since the sequential input consumed by each model is a byte stream, and an additional symbol is used for padding shorter sequences within each minibatch. All models are trained using a maximum of 15 epochs, but early stopping is employed if the model fully converges before reaching the maximum number of epochs. The Adam optimizer [20] is used to train all models.

We did hyperparameter tuning of the various input parameters for the JavaScript models, and the final settings are summarized in Table II. To do so, we first set the other hyperparameters to fixed values and then vary the hyperparameter under consideration. The best parameter setting is then set based on the validation error rate. For example, to evaluate different minibatch sizes for the JavaScript LaMP classifier, we first set the LSTM’s hidden layer size $H_{LaMP}=1500$ , the embedding dimension to $E_{LaMP}=50$ , the number of LSTM layers $L_{LaMP}=1$ and the number of hidden layers in the classifier $C_{LaMP}=1$ . With these settings, we evaluate the classification error rate on the validation set for the JavaScript dataset.

The CPoLS model is designed to operate on the full JavaScript sequences. However, training on the full length sequences exhausts the memory capacity of the NVIDIA P100s in our cluster, depending on the particular variant and parameter settings of the model. To overcome this limitation, we truncated the sequence length to $T$ = 60,000 bytes for all the CPoLS and CPoLS experiments. Similarly, we truncated the sequences to lengths of $T=200,1000$ bytes for the LaMP and Kolosnjaji CNN [21] baselines.

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Tensor Flow: Large-scale machine learnin
2[2] B. Athiwaratkun and J. W. Stokes, “Malware classification with lstm and gru language models and a character-level cnn,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , March 2017, pp. 2482–2486.
3[3] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks , vol. 5, no. 2, pp. 157–166, Mar 1994.
4[4] F. Chollet et al. , “Keras,” https://github.com/fchollet/keras , 2015.
5[5] I. Corona, D. Maiorca, D. Ariu, and G. Giacinto, “Lux 0r: Detection of malicious pdf-embedded javascript code through discriminant analysis of api references,” in Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop , ser. AI Sec ’14. New York, NY, USA: ACM, 2014, pp. 47–57.
6[6] M. Cova, C. Kruegel, and G. Vigna, “Detection and analysis of drive-by-download attacks and malicious javascript code,” in Proceedings of the 19th International Conference on World Wide Web , ser. WWW ’10. New York, NY, USA: ACM, 2010, pp. 281–290.
7[7] C. Curtsinger, B. Livshits, B. Zorn, and C. Seifert, “Zozzle: Fast and precise in-browser javascript malware detection,” in Proceedings of Usenix Security , 2011.
8[8] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu, “Large-scale malware classification using random projections and neural networks,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2013.