TL;DR
This paper presents a novel approach for classifying ransomware infections using only a single screenshot per variant, leveraging augmented one-shot learning and Bayesian methods to achieve high accuracy and handle unseen cases.
Contribution
The work introduces a new post-infection ransomware classification method based on minimal data, combining data augmentation, one-shot learning, and Bayesian uncertainty for improved accuracy and robustness.
Findings
Achieved up to 93.6% classification accuracy.
Effectively identified unseen ransomware variants.
Handled unrelated images with Bayesian uncertainty.
Abstract
Newly emerging variants of ransomware pose an ever-growing threat to computer systems governing every aspect of modern life through the handling and analysis of big data. While various recent security-based approaches have focused on detecting and classifying ransomware at the network or system level, easy-to-use post-infection ransomware classification for the lay user has not been attempted before. In this paper, we investigate the possibility of classifying the ransomware a system is infected with simply based on a screenshot of the splash screen or the ransom note captured using a consumer camera commonly found in any modern mobile device. To train and evaluate our system, we create a sample dataset of the splash screens of 50 well-known ransomware variants. In our dataset, only a single training image is available per ransomware. Instead of creating a large training dataset of…
| Network | Pretrained (ImageNet) | Evaluation Metrics (higher, better) | ||
|---|---|---|---|---|
| Accuracy | F1 Score | AUC | ||
| SqueezeNet [8] | ✗ | 0.640 | 0.622 | 0.816 |
| SqueezeNet [8] | ✓ | 0.734 | 0.714 | 0.864 |
| VGG-19 [6] | ✗ | 0.670 | 0.661 | 0.832 |
| VGG-19 [6] | ✓ | 0.790 | 0.784 | 0.893 |
| ResNet-101 [7] | ✗ | 0.782 | 0.773 | 0.889 |
| ResNet-101 [7] | ✓ | 0.876 | 0.872 | 0.937 |
| MobileNet-V2 [12] | ✗ | 0.804 | 0.799 | 0.900 |
| MobileNet-V2 [12] | ✓ | 0.892 | 0.883 | 0.945 |
| ResNeXt-101 [13] | ✗ | 0.786 | 0.775 | 0.891 |
| ResNeXt-101 [13] | ✓ | 0.898 | 0.896 | 0.948 |
| Inception-V3 [10] | ✗ | 0.816 | 0.812 | 0.906 |
| Inception-V3 [10] | ✓ | 0.906 | 0.904 | 0.952 |
| ShuffleNet-V2 [11] | ✗ | 0.774 | 0.764 | 0.885 |
| ShuffleNet-V2 [11] | ✓ | 0.910 | 0.905 | 0.954 |
| DenseNet-161 [9] | ✗ | 0.816 | 0.806 | 0.906 |
| DenseNet-161 [9] | ✓ | 0.928 | 0.926 | 0.963 |
| DenseNet-201 [9] | ✗ | 0.848 | 0.837 | 0.917 |
| DenseNet-201 [9] | ✓ | 0.936 | 0.937 | 0.967 |
| Network | # Parameters | Evaluation Metrics (higher, better) | ||
|---|---|---|---|---|
| Accuracy | F1 Score | AUC | ||
| Inception-V3 [10] | 25,214,714 | 0.626 | 0.591 | 0.809 |
| ShuffleNet-V2 [11] | 1,304,854 | 0.628 | 0.604 | 0.810 |
| VGG-19 [6] | 139,786,098 | 0.630 | 0.609 | 0.811 |
| SqueezeNet [8] | 748,146 | 0.634 | 0.613 | 0.813 |
| ResNet-101 [7] | 42,602,610 | 0.664 | 0.642 | 0.829 |
| MobileNet-V2 [12] | 2,287,922 | 0.666 | 0.648 | 0.830 |
| ResNeXt-101 [13] | 86,844,786 | 0.674 | 0.659 | 0.834 |
| DenseNet-201 [9] | 18,188,978 | 0.720 | 0.704 | 0.857 |
| DenseNet-161 [9] | 26,582,450 | 0.744 | 0.734 | 0.870 |
| Custom Network | 1,875,666 | 0.716 | 0.703 | 0.855 |
| Augmentation Method | Evaluation Metrics (higher, better) | ||
|---|---|---|---|
| Accuracy | F1 Score | AUC | |
| None | 0.252 | 0.258 | 0.618 |
| Contrast | 0.386 | 0.379 | 0.687 |
| Rotation | 0.440 | 0.414 | 0.714 |
| Brightness | 0.404 | 0.402 | 0.696 |
| Perspective | 0.524 | 0.500 | 0.757 |
| Motion Blur | 0.338 | 0.348 | 0.662 |
| Defocus Blur | 0.324 | 0.324 | 0.655 |
| Gaussian Blur | 0.312 | 0.289 | 0.649 |
| Random Noise | 0.344 | 0.343 | 0.665 |
| Random Occlusion | 0.344 | 0.339 | 0.665 |
| Colour Perturbations | 0.330 | 0.325 | 0.658 |
| All Augmentations | 0.716 | 0.703 | 0.855 |
| Augmentation | Evaluation Metrics (higher, better) | Augmentation | Evaluation Metrics (higher, better) | ||||
| Accuracy | F1 Score | AUC | Accuracy | F1 Score | AUC | ||
| P/R/B/C/N/O/M/CP/D/G | 0.716 | 0.703 | 0.855 | P/R/B/C/N | 0.616 | 0.609 | 0.782 |
| P/R/B/C/N/O/M/CP/D | 0.690 | 0.681 | 0.842 | P/R/B/C | 0.606 | 0.592 | 0.776 |
| P/R/B/C/N/O/M/CP | 0.674 | 0.658 | 0.821 | P/R/B | 0.592 | 0.580 | 0.771 |
| P/R/B/C/N/O/M | 0.648 | 0.632 | 0.805 | P/R | 0.586 | 0.569 | 0.762 |
| P/R/B/C/N/O | 0.634 | 0.628 | 0.797 | P | 0.524 | 0.500 | 0.757 |
| Approach | Test Data | Evaluation Metrics (higher, better) | Uncertainty and Confidence | ||||
|---|---|---|---|---|---|---|---|
| Accuracy | F1 Score | AUC | Model Uncertainty | Mean Confidence | |||
| Fixed Dropout | [16] | Positive | 0.708 | 0.7011 | 0.8429 | 0.015 | 0.85 0.21 |
| Negative | – | – | – | 0.330 | 0.66 0.25 | ||
| Concrete Dropout | [17] | Positive | 0.698 | 0.6771 | 0.8459 | 0.067 | 0.87 0.19 |
| Negative | – | – | – | 0.218 | 0.72 0.29 | ||
| Variational Dropout | [18] | Positive | 0.6821 | 0.6593 | 0.8378 | 0.084 | 0.86 0.22 |
| Negative | – | – | – | 0.175 | 0.71 0.23 | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A King’s Ransom for Encryption:
Ransomware Classification using Augmented One-Shot Learning and Bayesian Approximation
Amir Atapour-Abarghouei, Stephen Bonner and Andrew Stephen McGough
School of Computing, Newcastle University, Newcastle, UK
{amir.atapour-abarghouei, stephen.bonner3, stephen.mcgough}@newcastle.ac.uk
Abstract
Newly emerging variants of ransomware pose an ever-growing threat to computer systems governing every aspect of modern life through the handling and analysis of big data. While various recent security-based approaches have focused on detecting and classifying ransomware at the network or system level, easy-to-use post-infection ransomware classification for the lay user has not been attempted before. In this paper, we investigate the possibility of classifying the ransomware a system is infected with simply based on a screenshot of the splash screen or the ransom note captured using a consumer camera commonly found in any modern mobile device. To train and evaluate our system, we create a sample dataset of the splash screens of 50 well-known ransomware variants. In our dataset, only a single training image is available per ransomware. Instead of creating a large training dataset of ransomware screenshots, we simulate screenshot capture conditions via carefully designed data augmentation techniques, enabling simple and efficient one-shot learning. Moreover, using model uncertainty obtained via Bayesian approximation, we ensure special input cases such as unrelated non-ransomware images and previously-unseen ransomware variants are correctly identified for special handling and not mis-classified. Extensive experimental evaluation demonstrates the efficacy of our work, with accuracy levels of up to 93.6% for ransomware classification.
Index Terms:
Machine Learning, Ransomware Classification, Model Uncertainty, Bayesian Approximation, One-Shot Learning
I Introduction
Due to the increasingly prominent role of the internet in various facets of modern life, any malicious online activity has the potential to disrupt the social order, sometimes with dire repercussions. Of the numerous variants of malware often spread for economic gain, ransomware has recently received significant attention within the cybersecurity community [1] due to its wide range of targets, the significant harm it can inflict on the victims, the great financial incentive it provides for organised crime syndicates and its constant evolution, allowing its variants to regularly bypass state-of-the-art anti-virus and anti-malware [2].
There are, in essence, two types of ransomware: locker ransomware, which locks the targeted system and prevents or constrains user access, but is often easily resolvable for a technically-savvy user, and crypto-ransomware, which can be significantly more difficult to deal with and can lead to irreversible harm as it encrypts files within the targeted system. A third type of ransomware, called scareware, attempts to scare lay users into paying the ransom without actually damaging the computer in any way [3] only using an intimidating splash screen. This substantial level of diversity among ransomware variants gives significant importance to a robust classification system that could easily identify the ransomware and guide the victims towards appropriate support.
Various classification and detection techniques within the existing literature [1, 4, 5] facilitate identifying and countering ransomware attacks for technically-adept individuals and organisations with a large security and IT infrastructure. However, ransomware classification methods tailored towards the laypersons, which make up the majority of users and are often targeted easily, are scarce. In this paper, we propose an image classification pipeline, which enables any individual to identify the variant of the ransomware they are infected with based on a screenshot of the splash screen or the ransom note casually captured using a consumer-grade camera, such as those commonly found in any modern mobile phone.
While a significant portion of the literature on classification has been dedicated to achieving consistent high-accuracy results using a variety of optimised deep neural networks [6, 7, 8, 9, 10, 11, 12, 13], most of these techniques require large quantities of accurately-labelled data, which for our task, translates to a large corpus of splash screen images captured from computer screens under different environmental conditions (lighting, field of view, camera angle, etc.) varied enough to simulate any future image capture and thus avoid over-fitting. A naïve solution to the data requirement problem would be to accept the considerable costs and resources required to create such a large dataset, but in this work, we attempt to circumvent the need for big data by recreating the conditions that lead to the appearance of a screenshot by means of carefully designed and tuned data augmentation techniques. In essence, our one-shot learning framework is capable of classifying any image of a ransomware splash screen captured using a camera by only ever seeing a single original image for each class of ransomware. This enables our approach to rapidly learn to classify new variants if the model is simply retrained or fine-tuned using a single training image. Consequently, our dataset consists of a single image per variant of ransom note or splash screen for training and ten screenshots of said ransom notes captured using a mobile phone camera for testing (Figure 1).
Additionally, modern neural-based classification approaches are notoriously known for attempting to classify inputs on which they have not been trained [14] or completely miss-classifying images sampled from distributions with slight deviations from the training set [15]. This means an off-the-shelf approach will wrongly classify any unrelated input (e.g. non-ransomware images, images of new ransomware variants unknown to the existing model, carefully-designed adversarial examples), sometimes with a high degree of confidence. To remedy this, we turn towards the recent advances in variational inference and its implications in calculating model uncertainty in neural networks [16, 17, 18, 19, 20]. Not only does the integration of Bayesian inference into a neural network make it more robust against adversarial attacks, access to model uncertainty enables the network to reject irrelevant inputs sampled from outside the distribution of the training data. The inclusion of model uncertainty calculations in our pipeline requires its very own evaluation methodology, for which purpose, we also include a negative test set (Figure 1 – bottom) in our dataset to assess our uncertainty values. This dataset consists of unrelated input images which the model should be uncertain about as it has not been trained to classify such images. In short, the primary contributions of this work are as follows:
- •
Ransomware Classification: We provide a simple pipeline that enables any laypersons to identify the variant of ransomware they have been infected with by casually taking a photograph of their computer screen displaying the ransom note or splash screen.
- •
One-Shot Learning through Data Augmentation: We investigate the possibility of using different data augmentation techniques to simulate the appearance of a screenshot given the original splash screen, thereby enabling training on a single data point per class with significant generalisation capabilities.
- •
Model Uncertainty via Bayesian Approximation: We explore the use of various forms of Bayesian inference to further improve generalisation and obtain model uncertainty to avoid classifying unrelated images and as-of-yet-unknown variants of ransomware.
To enable easier reproducibility, the source code, pre-trained models and the dataset will be publicly available.111https://github.com/atapour/ransomware-classification.
II Related Work
We consider relevant prior work over three distinct areas, ransomware classification and detection (Section II-A), one-shot learning (Section II-B), and Bayesian approximation (Section II-C).
II-A Ransomware Classification and Detection
Traditionally, malware activities are either detected at the network level [21, 22], system level [23] or even both [24]. Andronio [25] identifies device-locking or encryption activities by finding code paths using static taint analysis along with symbolic execution. Anomalous file system activities have also been used to detect ransomware [26]. In another work, abnormal system behaviour is identified based on changes in file type, similarity measurements and entropy [27].
More recently, machine learning based approaches have become prevalent in detecting and/or classifying ransomware. Sgandurra et al. [28] detect and classify ransomware by dynamically analysing the behaviour of applications during the early phases of their installation. In another work [5], detection and classification of ransomware is made possible by combining a static detection phase based on the frequency of opcodes prior to installation and a dynamic method which investigates the use of CPU, memory and network as well as call statistics during run-time. Vinayakumar et al. [29] explores the use of neural networks with a focus on tuning the hyperparameters and the architecture of a very simple multilayer perceptron to detect and classify ransomware activities.
While the use of various machine learning techniques have led to significant improvements in the field of ransomware detection and classification, these approaches are mostly tailored towards technical users or potential integration into various anti-virus and anti-malware applications. The approach proposed here mainly focuses on classifying ransomware after the system has been infected based on an image of the splash screen or the ransom note casually taken by any layperson.
II-B One-Shot Learning
Recent advances in modern machine learning techniques have resulted in remarkable strides in various active areas of research, including image classification [6], semantic scene understanding [30, 31, 32], natural language processing [33] and graph representations [34, 35]. However, one of the main requirements of all such approaches is access to a large corpus of data for extensive iterative training, which is often not readily available or intractable to obtain.
This has led to the creation of an entire field of research with a focus on the daunting task of training machine learning algorithms using one data point. The seminal work by Fei et al. [36] popularised the idea of one-shot learning by proposing a variational Bayesian framework for image classification by leveraging previously-learned classes to aid in the classification of unseen ones. Their promising results inspired a slew of researchers to use novel one-shot learning techniques to tackle various other problems and applications. For instance, to address the problem of character classification, Lake et al. model the character drawing process to decompose the image into smaller chucks and a structural explanation is subsequently given for the observed pixels. The same process has been used for speech primitives along with Bayesian inference to identify new words from unknown speakers [37].
Siamese neural networks have been used to rank similarity between inputs [38]. This similarity prediction is then utilised to classify not only new data but entirely new classes, by measuring the similarity between the new entries. A memory-augmented neural network is proposed by Santoro et al. [39] that learns how to store and retrieve memories to use for each classification task. Vinyals et al. [40] propose a network that maps a small labelled support set and an unlabelled example to its label, enabling adaptation to new data.
Cheny et al. [41] attempt to learn a mapping between new data samples and concepts in a high-dimensional semantic space. The newly mapped concepts are subsequently matched against existing ones and new instance features are synthesised by interpolating among the concepts to facilitate better learning. More similar to our work, Zhao et al. [42] directly leverage data augmentation for one-shot learning. In this paper, we also utilise a series of carefully-selected data augmentation techniques to train a classification model based on a single data point per class. Whilst our pipeline is unable to generalise to entirely new classes, we rely on using Bayesian inference to identify previously unseen new classes.
II-C Model Uncertainty via Bayesian Approximation
In modern applied machine learning, uncertainty is gaining an ever-increasing level of importance, mainly due to the role it can play in detecting and averting adversarial attacks [43], ensuring system safety in critical infrastructure [44] and analysing and preventing failure in robotics and navigation applications [45], among others. Similarly, in our work, uncertainty estimates can be a valuable tool that can ensure new previously-unseen variants of ransomware or completely irrelevant inputs, such as those mistakenly selected by the user, are correctly identified, since explicit handling and treatment is required for these special cases.
A simple and efficient technique widely used in the literature to calculate model uncertainty is Bayesian inference, with dropout [46] used as a pragmatic approximation [16]. In a dropout inference approach, the neural network is trained with dropout applied before every weight layer and during testing, the output is obtained by randomly dropping neurons to generate samples from the model distributions. Gal et al. [16] demonstrate that the use of dropout inference is mathematically equivalent to the probabilistic deep Gaussian process approximation [47], with the approach effectively minimising the Kullback-Leibler divergence between the model distribution and the posterior of a deep Gaussian process, marginalised over its finite rank covariance function parameters [16].
While the use of such an approach [16] can yield a reasonable estimate of model uncertainty (as demonstrated in Section IV-C), to obtain better-calibrated uncertainty that fits the nature of the data at hand, the dropout rate at each layer must be adapted to the data as a variational parameter. This is often accomplished using an extensive grid-search [17] which is computationally-intensive, time-consuming, and completely intractable for certain tasks, which points to the importance of an adaptive dropout rate in a variational framework.
Kingma et al. [18] thus propose variational dropout, which attempts to model Bayesian inference using a posterior factorised over individual network weights for all individual mean parameters . The prior factorises similarly and is explicitly selected so the Kullback-Leibler divergence between the model distribution and the posterior is independent of the mean parameters . Additionally, Kingma et al. [18] claim that their reparametrisation technique maps uncertainty about the weights of the model into independent local noise. Subsequently, an extension to the conventional Gaussian multiplicative dropout [46] is proposed that allows for the dropout rate to be learned as a parameter. However, more recent studies [19, 20] have demonstrated that the log-uniform prior used for variational dropout [18] may not lead to a proper posterior, which means variational dropout is a non-Bayesian sparsification approach and the uncertainty estimated based on may not follow the usual Bayesian interpretation.
Conversely, Gal et al. [17] resolve the issue of the improper prior and posterior and propose the use of learnable dropout rate parameters optimised towards obtaining better uncertainty rather than maximising model performance. By introducing a dropout regularisation term, which only depends on the dropout rate, the approach ensures that the posterior approximated by the dropout itself does not deviate too far from the model distribution. In this paper, we make use of all three approaches [16, 18, 17] to obtain uncertainty and assess the performance and efficacy of each using our data.
III Approach
The primary objective of this work is to investigate the possibility of classifying the variant of ransomware a system is infected with solely based on an image of the splash screen or the ransom note captured from a computer screen (or mobile device) using a consumer camera. This is accomplished by training a classifier on a single original image of the splash screen of each ransomware. In the following, we will outline the details of the our dataset, data augmentation techniques and the different networks used to carry out the classification.
III-A Training Dataset
To explore the potentials of our ransomware classification pipeline, we train our model on a dataset of splash screens and ransom notes of 50 different variants of ransomware. A single image of a splash screen variant is available for each of the ransomware classes available in our dataset. However, certain ransomware classes are associated with more than one splash screen (i.e. certain classes contain more than one training image but those images depict different splash screens associated with the same class), which significantly adds to the difficulty of the problem as this leads to a training data imbalance and can lead to training instability.
To test the performance of the approach, a balanced test set of 500 images (10 images per class) is created by casually taking screenshots of the ransomware images using two different types of camera phones (Apple and Android) from 6 different computer screens (with varying specifications, e.g. size, resolution, aspect ratio, panel type, screen coating and colour depth). We call this the positive test dataset since all the images within this dataset need to be positively identified as ransomware and any model trained using our dataset should be certain about the predictions it makes with respect to the ransomware variants it has already observed.
An additional set of 50 unrelated and/or non-ransomware images are captured from the same computer screens (under the same conditions as our positive test images) to evaluate the uncertainty estimates acquired using our Bayesian networks. We refer to this portion of our dataset as the negative test dataset, as any model trained on our dataset should be very uncertain about this data since these screenshot images are not of, and therefore should not be classified as, any ransomware known to the model. Examples of the training and positive and negative test images are shown in Figure 1. Note that some of the images in our negative test set (Figure 1 – bottom) are very similar in appearance to what a ransomware splash screen could look like. This has been purposefully designed so the uncertainty values estimated by the model can be more rigorously assessed.
Using our carefully designed augmentation techniques, we train the models on our training dataset of 66 images in 50 classes. In the following, we will briefly outline the details of our data augmentation techniques.
III-B Data Augmentation
During training, the network can only see the single image available for each splash screen variant. This lack of training data can significantly hinder generalisation as the model would simply overfit to the training distribution or memorise the few training images it has seen. This means a model trained on our training dataset without any modification or augmentation would be incapable of classifying images captured under test conditions from a computer screen (Section IV-B).
To prevent this, a carefully designed and tuned set of augmentation techniques is applied to the training images on the fly to simulate the test conditions (images casually captured from a computer screen). The hyper-parameters associated with these augmentation techniques (e.g. thresholds, intensity) are determined using exhaustive grid-searches which are excluded here. Each of the following augmentation techniques is randomly applied (both in terms of application and severity):
(1) *rotation:*randomly rotating the image with the angle of rotation in the range [-90°,90°],
(2) *contrast:*randomly changing the image contrast by up to a factor of 2,
(3) *brightness:*randomly changing the brightness by up to a factor of 3,
(4) *occlusion:*to primarily simulate distractors such as screen glare and reflection mostly in glossy screens (up to a quarter of the image size occluded with random elliptical shapes of randomly selected bright colours),
(5) *Gaussian blur:*with a radius of up to 5,
(6) *motion blur:*simulating blurring effects caused by the movement of the camera during image capture (up to a movement length of 9 pixels – see Figure 2 - bottom),
(7) *defocus blur:*simulating the camera being out of focus which is a common occurrence when a computer screen is being photographed (up to a kernel size of 9 – see Figure 2 - top),
(8) *noise:*Gaussian noise up to a level of 0.2,
(9) *colour perturbations:*randomising hue by a maximum of 5% and saturating colours by a factor of up to 2, and
(10) *perspective:*by up to 50% over each axis to simulate the varying camera angles when a screen is being photographed.
By using random combinations of all the different augmentation methods applied to our training set with varying probabilities, very high levels of accuracy can be achieved (see Section IV). In the following section, we will focus on the details of the classification models and the network architectures that take advantage of these data augmentation techniques used within our approach to classify ransomware based on our training dataset.
III-C Classification Model
A very effective way of solving the problem of ransomware classification is to use to the augmentation methods outlined in Section III-B along with any of the many optimised classification networks in the literature [6, 7, 8, 9, 10, 11, 12, 13]. Most of these networks are capable of yielding very high-accuracy results, especially when taking advantage of the boosted features that can be obtained by pre-training the network on large datasets such as ImageNet (Table I). However, it is important to note that despite the recent introduction of more efficient light-weight networks [8, 11, 12], the majority of the state-of-the-art classification models make use of very deep architectures and contain an extremely large number of parameters (Table II).
An important part of this work is to enable an accurate measurement of model uncertainty via Bayesian approximation, and as explained in Section II-C, this can be accomplished with a reasonable degree of mathematical accuracy by applying a dropout layer before every weight layer within the model. This can be highly problematic for very deep networks [6, 10, 7] since the large number of dropout layers in such networks would make convergence intractable. While simply reducing the number of dropout layers in a very deep network can help with the convergence problem [16], it comes at a cost of the precision of the uncertainty values since it would not be possible to accurately calibrate the uncertainty estimation process if some layers contain neurons that cannot be dropped.
To remedy this issue and for the sake of experimental consistency, we propose a simplified custom architecture, seen in Figure 3. This light-weight network takes an image of size as its input and after six convolutional layers and three max-pooling operations produces a feature vector of 4096 dimensions. This is subsequently passed into a fully-connected layer which classifies the input into one of 50 classes. Training is accomplished via a cross entropy loss function. No normalization is performed in the network. To approximate Bayesian inference, a dropout layer can be placed after every weight layer in the network. Figure 3 shows an outline of our custom network architecture, with the dropout layers optionally used to approximate Bayesian inference.
We utilise the Bayesian dropout techniques [16, 18, 17] to calculate model uncertainty via Monte Carlo sampling. After stochastic forward passes of the same input (images) through the network to produce the output (class labels), the predictive mean of the model is as follows:
[TABLE]
The predictive uncertainty is thus obtained as follows:
[TABLE]
The dropout rate can be set as a fixed hyper-parameter tuned to the data via intensive grid-searches (0.05 in our case for all six dropout layers in the network) or learned as model parameters [18, 17]. In Section IV-C, we experiment with all these variations of Bayesian approximation through dropout to enable further insight into the functionality of our model and uncertainty measurements in general.
III-D Implementation Details
The image data in our training and test sets are all of different resolutions but for the sake of consistency, they are all cropped to a square with the length equal to the smaller dimension of the image (random cropping for training images and centre cropping for test images) and resized to an image of dimensions for our custom network architecture or to achieve higher accuracy results using deeper convolutional networks. The non-linearity module used in our custom architecture is leaky ReLU (). The training data imbalance issue is handled by weighting the inputs in the loss function according to the frequency of their class within the overall dataset. All models are trained to 100,000 steps. The implementation is done in PyTorch [48], with Adam [49] empirically providing the best optimization (, , ).
IV Experimental Results
In this section, we evaluate our work using extensive experimental analysis. The results of various state-of-the-art classification approaches are evaluated and using ablation studies, we demonstrate the importance of our data augmentation approaches. Additionally, using our positive and negative test datasets, we investigate the effectiveness of model uncertainty values obtained through Bayesian approximation via dropout.
IV-A State-of-the-Art Classification
To achieve the highest possible levels of accuracy, we train various state-of-the-art image classification networks [6, 7, 8, 9, 10, 11, 12, 13]. With relatively high-resolution images (256 256) used as inputs, accuracy levels of up to 93.6% can be achieved using our full augmentation protocol and a DenseNet-201 network [9] pre-trained on ImageNet.
Table I contains the numerical results obtained from different architectures across various metrics with inputs of size 256 256. As seen in Table I, the representation learning encapsulated within the model resulting from the features obtained by pre-training the network on ImageNet is an invaluable asset and can lead to performance boosts of up to 14% for some of the networks.
As indicated by the high F1 score, despite the uneven class distribution in our training dataset, using our class balancing efforts (Section III-D), most networks are capable of learning about all the classes in an evenly distributed manner. The high AUC (Area Under the Curve) values also demonstrate the great leaning capabilities of the models which are able to easily distinguish between the classes with little confusion. The confusion matrices for some of the models [6, 7, 9, 10, 11, 12, 13] shown in Figure 4 further confirm these findings and point to the strong feature learning capabilities of the models.
An important aspect of our work, however, is training and inference efficiency. Fast convergence during training can be intractable in very deep models when dropout is utilised as Bayesian approximation to obtain model uncertainty. Since our approach is meant to specifically accommodate lay users through a web server, a light-weight model that can be used for efficient deployment is very important to reduce the chance of high load and hence denial of service.
To address the issues of efficiency and convergence rate and to guarantee better experimental consistency, we evaluate our custom architecture network that takes smaller (128 128) images as its input compared against state-of-the-art deep and light-weight networks commonly used within the literature, when receiving the same small (128 128) images as their input. As seen in Table II, our simpler network outperforms most deeper and light-weight networks [6, 7, 8, 10, 11, 12, 13] while remaining competitive with the rest [9]. The superior performance of our simple architecture is mainly due to the fact that the number of its layers and parameters are carefully tuned to the dataset (using preliminary architecture searches, which have been excluded for brevity).
IV-B Ablation Studies
One of our primary contributions is the ability to train an accurate ransomware screenshot classifier using a single training image for each variant of splash screen or ransom note. This is achieved using ten carefully-designed augmentation techniques (Section III-B), the combination of which will result in the simulation of a screenshot of a ransomware splash screen captured using a consumer-grade camera. Consequently, a substantial part of our experimental setup has been to demonstrate the importance of each of these augmentation techniques to ensure that they positively contribute to the improved performance of the model. To accomplish this, we train our custom network (with no dropout) using individual augmentation techniques to measure their effects on the results. Table III contains the results of our custom network when trained on individual augmentation methods.
As expected, not using any augmentation leads to a poor performance from the model, while significantly better results can be achieved when all the augmentation methods are combined. We also experimented with random combinations of the techniques to empirically investigate any incompatibility, but found that all augmentation techniques used here contribute to the improvement of the results, as seen in Table IV.
As seen in Tables III and IV, perspective and rotation have the greatest influence over the results. In our experiments with additional augmentation techniques, we found that horizontally flipping the images results in worse model performance since the test set does not contain any mirror images, as modern consumer cameras do not produce mirrored outputs. We also interestingly found that adding vertical flipping to the mix of our augmentation techniques had no impact on the results as the effects of this augmentation methods can be achieved through rotation. As a result, image flipping was removed from the list of augmentation techniques used in our approach.
IV-C Model Uncertainty
Another important component of this work is the ability of the model to calculate uncertainty, therefore enabling the identification of unrelated input images (e.g. non-ransomware inputs and new previously-unseen ransomware images). Our custom network (Figure 3) is consequently trained with the three different dropout modules [16, 17, 18] used for Bayesian approximation. Dropout layers are kept in place during inference and uncertainty is calculated as per Eqn. 2 via Monte Carlo sampling of the network weights. Recent work [19, 20] argue that the use of variational dropout [18] does not lead to proper Bayesian behaviour and can result in overfitting. This notion is somewhat confirmed by our experiments. A seen in Figure 5, our network trained with variational dropout is prone to overfitting and produces lower test accuracy levels.
Moreover, by calculating model uncertainty when the model is evaluated using our positive and negative test data, we can assess the effectiveness of our uncertainty values. One would expect the model to be very uncertain when negative test images (unrelated images) are passed in as inputs, while the uncertainty values should be smaller when positive test data (ransomware screenshots) are seen by the network. As seen in Figure 6, our experiments point to the same conclusions with uncertainty values being significantly higher in the presence of negative data. Interestingly, as seen in Figure 6, a fixed dropout rate (FDO) [16] produces cleaner and more accurate uncertainty values despite the intensive computation required to determine the dropout rate (0.05 for all layers in our case).
Figure 7 shows the confidence and uncertainty values obtained for a small number of randomly-selected examples from our positive and negative test datasets. As expected, confidence values (softmax outputs) are essentially meaningless and contain very little information about how much the network actually knows about the image, while uncertainty values are a better indicator of whether the network has sufficient knowledge of the input image or not. For our best-performing model (fixed dropout), an uncertainty value of 0.12 seems to be a reasonable estimated threshold, beyond which the predictions of the model are not credible. Similar conclusions can be drawn from Table V, which contains the numerical results of the Bayesian approximation methods [16, 17, 18] applied to positive and negative test data. As seen in Table V, the mean uncertainty values are an order of magnitude higher for the negative test images than they are for the positive images, and the confidence values have such a high standard deviation that their use to measure how much the model knows about the input it has received can lead to very misleading results.
V Discussions and Future Work
As discussed in Section IV-A, we are able to achieve high accuracy results using our augmentation techniques and deep convolutional neural networks such as DenseNet [9]. However, since another important component of our work, model uncertainty, relies on introducing a dropout layer after every weight layer within the model, convergence for very deep models such as DenseNet [9] would be almost impossible, which is why we opt for using our own simplified network architecture.
While this can sufficiently meet the requirements of our application through a possible two-stage solution (the light-weight network measures the uncertainty of the model with respect to the input and if the value is low and special handling is not required, the deep network can be subsequently used to conduct the actual classification), future work can involve the use of Bayesian modules within each layer [50] or a Bayesian last layer in the network [51], thus enabling the optimisation of much deeper networks with plausible uncertainty calculation capabilities. Additionally, if the parameters of the augmentation techniques could be learned during training instead of being laboriously tuned through extensive grid searches, the resulting efficient and stable training procedure can lead to superior model performance.
Conclusion
In this work, we explore the possibility of performing the task of ransomware classification based on a simple screenshot of the splash screen or ransom note captured using a consumer camera found in any modern mobile phone. To make this possible, we create a sample dataset with only a single image available for every variant of ransomware splash screen. Instead of creating a large corpus of ransomware screenshot images for training, we opt for simulating the conditions that lead to the appearance of a screenshot image through carefully designed data augmentation techniques, resulting in a very simple one-shot learning procedure. Additionally, we employ various Bayesian approximation approaches [16, 17, 18] to obtain model uncertainty. Using uncertainty values, we are then able to identify special input cases such as unrelated non-ransomware images and new previously-unseen ransomware variants that our trained models are able or expected to classify. These particular input cases can be set aside for special handling. Using extensive experimental evaluation, we have demonstrated that test accuracy levels of up to 93.6% can be achieved using our full augmentation protocol and a deep network such as DenseNet [9]. Assessments using our negative test dataset (images unknown to the model) also indicate that our custom architecture trained with [16, 17, 18] is capable of accurately estimating uncertainty values.
Acknowledgement
We would like to thank the Engineering and Physical Sciences Research Council (EPSRC) for funding this research project. This work in part made use of the Rocket High Performance Computing service at Newcastle University.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] H. Zhang, X. Xiao, F. Mercaldo, S. Ni, F. Martinelli, and A. K. Sangaiah, “Classification of ransomware families with machine learning based on N-gram of opcodes,” Future Generation Computer Systems , vol. 90, pp. 211–221, 2019.
- 2[2] S. Kok, A. Abdullah, N. Jhanjhi, and M. Supramaniam, “Ransomware, threat and detection techniques: A review,” Int. J. Computer Science and Network Security , vol. 19, no. 2, p. 136, 2019.
- 3[3] B. A. S. Al-rimy, M. A. Maarof, and S. Z. M. Shaid, “Ransomware threat success factors, taxonomy, and countermeasures: A survey and research directions,” Computers & Security , vol. 74, pp. 144–166, 2018.
- 4[4] R. Vinayakumar, K. P. Soman, K. K. S. Velan, and S. Ganorkar, “Evaluating shallow and deep networks for ransomware detection and classification,” in Int. Conf. Advances in Computing, Communications and Informatics , 2017, pp. 259–265.
- 5[5] A. Ferrante, M. Malek, F. Martinelli, F. Mercaldo, and J. Milosevic, “Extinguishing ransomware - A hybrid approach to Android ransomware detection,” in Int. Symp. Foundations and Practice of Security . Springer, 2017, pp. 242–258.
- 6[6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” ar Xiv preprint ar Xiv:1409.1556 , 2014.
- 7[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Computer Vision and Pattern Recognition , 2016, pp. 770–778.
- 8[8] F. Iandola, S. Han, M. Moskewicz, K. Ashraf, W. Dally, and K. Keutzer, “Squeeze Net: Alex Net-level accuracy with 50x fewer parameters and 0.5 MB model size,” ar Xiv preprint ar Xiv:1602.07360 , 2016.
