Malicious Software Detection and Classification utilizing Temporal-Graphs of System-call Group Relations
Anna Mpanti, Stavros D. Nikolopoulos, Iosif Polenakis

TL;DR
This paper introduces a novel graph-based approach for malware detection and classification that incorporates the temporal evolution of system-call relations to improve accuracy and mutation tolerance.
Contribution
The work's novelty lies in integrating temporal dynamics into graph representations of system-call relations for enhanced malware detection and classification.
Findings
Temporal graphs improve detection accuracy.
Graph-based models enhance mutation tolerance.
Effective classification of malware families.
Abstract
In this work we propose a graph-based model that, utilizing relations between groups of System-calls, distinguishes malicious from benign software samples and classifies the detected malicious samples to one of a set of known malware families. More precisely, given a System-call Dependency Graph (ScDG) that depicts the malware's behavior, we first transform it to a more abstract representation, utilizing the indexing of System-calls to a set of groups of similar functionality, constructing thus an abstract and mutation-tolerant graph that we call Group Relation Graph (GrG); then, we construct another graph representation, which we call Coverage Graph (CvG), that depicts the dominating relations between the nodes of a GrG graph. Based on the research so far in the field, we pointed out that behavior-based graph representations had not leveraged the aspect of the temporal evolution of the…
| Group Name | Size | Group Name | Size |
|---|---|---|---|
| ACCESS_MASK XX | 1 | PHANDLE | 1 |
| Atom | 5 | PLARGE_INTEGER | 1 |
| BOOLEAN | 1 | Process | 49 |
| Debug | 17 | PULARGE_INTEGER XX | 1 |
| Device | 31 | PULONG | 1 |
| Environment | 12 | PUNICODE_STRING | 1 |
| File | 44 | PVOID_SIZEAFTER | 1 |
| HANDLE | 1 | PWSTR | 1 |
| Job | 9 | Registry | 40 |
| LONG | 1 | Security | 36 |
| LPC | 47 | Synchronization | 38 |
| Memory | 25 | Time | 5 |
| NTSTATUS | 1 | Transaction | 49 |
| Object | 19 | ULONG | 1 |
| Other | 36 | WOW64 | 19 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Network Security and Intrusion Detection · Software Engineering Research
Malicious Software Detection and Classification utilizing Temporal-Graphs of System-call Group Relations
Anna Mpanti Stavros D. Nikolopoulos Iosif Polenakis
Abstract
In this work we propose a graph-based model that, utilizing relations between groups of System-calls, distinguishes malicious from benign software samples and classifies the detected malicious samples to one of a set of known malware families. More precisely, given a System-call Dependency Graph (ScDG) that depicts the malware’s behavior, we first transform it to a more abstract representation, utilizing the indexing of System-calls to a set of groups of similar functionality, constructing thus an abstract and mutation-tolerant graph that we call Group Relation Graph (GrG); then, we construct another graph representation, which we call Coverage Graph (CvG), that depicts the dominating relations between the nodes of a GrG graph. Based on the research so far in the field, we pointed out that behavior-based graph representations had not leveraged the aspect of the temporal evolution of the graph. Hence, the novelty of our work is that, preserving the initial representations of GrG and CvG graphs, we focus on augmenting the potentials of theses graphs by adding further features that enhance its abilities on detecting and further classifying to a known malware family an unknown malware sample. To that end, we construct periodical instances of the graph that represent its temporal evolution concerning its structural modifications, creating another graph representation that we call Temporal Graphs. In this paper, we present the theoretical background behind our approach, discuss the current technological status on malware detection and classification and demonstrate the overall architecture of our proposed detection and classification model alongside with its underlying main principles and its structural key-components.
Keywords: Malicious Software, Detection, Security, Systems, Algorithms, Graphs.
Department of Computer Science & Engineering
University of Ioannina
GR-45110 Ioannina, Greece
{ampanti,stavros,ipolenak}@cs.uoi.gr
1 Introduction
Malware or malicious software is a software type intended to cause harm to end point computers, systems or networks [46]. In this work we design and propose a graph based model that develops an algorithmic technic for malware detection and classification. Our method is applied on unknown software samples in order to detect whether they are malicious or not, and further classify them to one of a set of known malware families (i.e., set of malicious mawlare samples with similar functionality), as they have been developed by various antivirus software vendors.
1.1 Malware and Mutations
On the contrary part of our scientific field, malware authors, have developed and deployed various techniques in order to avoid the traditional byte-level signature based detection methods. Since such detection methods appear to be significantly fragile against even the least (i.e., bit-level) mutation of the initial subject (i.e., ancestor malware sample), they mutate their software products (malware) creating structurally different but functionally similar copies of them. Except from the mutation methods that leverage one, or more, levels of encryption, there also exist more advanced mutation methods. Some of the most applicable malware mutations are the oligomorphism which is achieved through obfuscation techniques, the polymorphism where the code is modified through encryption techniques and the metamorphism, in which multiple structurally different copies of a malware sample are generated.
More precisely an oligomorphic or semi-polymorphic malware, is a specific category of obfuscated malware disposing an encryption/decryption module for multi-layer encryption in order to avoid decryption body detection. On the other hand, a polymorphic malware can create an endless number of new decryptors that use different encryption methods to encrypt the body of the malware [50]. As referred in [41], the main principal is to modify the appearance of the code constantly across the copies. Finally, a metamorphic malware changes its structure while keeps its functionality each time it replicates itself [34]. Polymorphic and metamorphic malware is the hardest type of malware to detect, since are able to mutate in an infinite number of functionally equivalent copies of themselves, and thus there is not a constant signature for virus scanning [34].
Hence, while the main functionality of a malware sample remains immutable during its mutations, malware samples can be merged into groups of malware samples with common functionality, the so called malware families. So, in this work we developed an algorithmic technic that not only detects if a program is malicious or not, but additionally, given a malicious software it can decide the malware family that it belongs to.
1.2 Protection against Malicious Software
Since malicious software poses a major threat, several protection approaches have been proposed and implemented in order to eliminate such threats. The main corpus of the defense line is mainly developed over three axes, namely malware analysis, malware detection and malware classification:
Malware Analysis. Malware analysis [7] is the process of determining the purpose and the functionality or, abstractly, the behavior of a given malicious code. Such a process is a necessary prerequisite in order to develop efficient and effective detection and also classification methods, and is mainly divided into two main categories, namely Static and Dynamic analysis [46].
Malware Detection. The term malware detection referrers to the process of determining whether a given program is malicious or benign according to an a priori knowledge [1, 11, 31]. Specifically with the term a priori knowledge we are referred to something that is known to be malicious or a characteristic that owned by something that is malicious, at a given time. However, an efficient malware detection is strongly related to malware analysis, during which, the analyst collects all the required information.
Malware Classification. The term malware classification refers to the process of determining the malware family to which a particular malware sample, let belongs to. Malware classification is a quite important procedure, since the indexing of malware samples into families provides the ability to generalize detection signatures from sample level to family level. Through the indexing of a malware sample to a malware family, the construction of a new sample-specific signature is omitted, since the sample can be detected by the signature of its family.
1.3 Our Approach
In our approach, we leverage the use of behavioral graph representations of software samples in order to distinguish if they are malicious or not and further classify them to a malware family. More precisely, given a representation (behavioral graph) of the behavior of a malicious sofware, in our case a System-call Dependency Graph - ScDG, constructed capturing the dependencies of the system-calls invoked during the execution of a software, we construct a directed edge-weighted graph, which we call Group Relation Graph - GrG, resulting from ScDG after grouping disjoint subsets of its vertices. Such graph abstraction has been proven [38] that by generalizing graphs structure makes the detection and classification processes more resilient to known malware mutation procedures. Next we construct an additional graph representation Coverage Graph - CvG, that is a vertex-weighted undirected graph which results from GrGs after computing domination relation among its vertex set (regarding their degree and weight) when representing them on the Cartesian plane. Throughout these processes, over specific time intervals, we preserve instances of GrGs and CvGs creating hence Temporal Graphs that depict their structural evolution over time, namely Group Relation Temporal Graph - GrTG and Coverage Temporal Graph - CvTG, respectively. Given a ScDG graph that represents a known malware sample and a ScDG graph that represent an unknown one, we utilize these instances (i.e., Temporal Graphs) over their corresponding GrG and CvG, produced both on each ScDG, in order to perform graph similarity towards the processes of malware detection and classification.
1.4 Contribution
In this work, we present our graph-based model for distinguishing graph representations referencing malicious software and further classifying them in sets of known malware families. Firstly, we discuss our proposed graph abstractions over ScDG representing the relations between system-call groups (i.e., Group Relation Graphs - GrG) and its corresponding graph that represents dominating relations over the nodes of GrG (i.e., Coverage Graph - CvG). Moreover, we present another graph representation that describes the temporal evolution on the structure of the aforementioned graph abstractions (i.e., GrG and CvG) Group Relation Temporal Graph - GrTG and Coverage Temporal Graph - CvTG, respectively, leveraging the temporal correlation between the structural modifications of two graphs, in order to be utilized on graph similarity. Furthermore, we demonstrate the development of an integrated framework that implements graph similarity approaches over the deployed GrTG and CvTG graphs in order to perform the malware detection and classification processes. Finally, we discuss the potentials of our approach, setting our further research landmarks for the extensions of our proposed model and conclude our work.
1.5 Related Work
In malware detection, there have been proposed similar models utilizing different non graph-based techniques like the one proposed by Alazab et al. [1], who developed a fully automated system that disassembles and extracts API-call features from executables and then, using -gram statistical analysis, is able to distinguish malicious from benign executables. The mean detection rate exhibited was 89.74% with 9.72% false-positives when used a Support Vector Machine (SVM) classifier by applying -grams. In [53], Ye et al. described an integrated system for malware detection based on API-sequences. This is also a different model from ours since the detection process is based on matching the API-sequences on OOA rules (i.e., Objective-Oriented Association) in order to decide the maliciousness or not of a test program. Finally, an important work of Christodorescu et al., presented in [11], proposes a malware detection algorithm, called , based on instruction semantics. More precisely, templates of control flow graphs are built in order to demand their satisfiability when a program is malicious. Although their detection model exhibits better results than the ones produced by our model, since it exhibits 0 false-positives, it is a model based on static analysis and hence it would not be fair to compare two methods that operate on different objects. Kolbitch et al. [25] proposed an effective and efficient approach for malware detection, based on behavioral graph matching by detecting string matches in system-call sequences, that is able to substitute the traditional anti-virus system at the end hosts. The main drawback of this approach is the fact that although no false-positives where exhibited, their detection rates are too low compared with other approaches. Luh and Tavolato [29] presented one more detection algorithm based on behavioral graphs that distinguishes malicious from benign programs by grading the sample based on reports generated from monitoring tools. While the produced false-positives are very close to ours, the corresponding detection ratio is even lower.
In malware classification, there have been proposed other non graph-based malware classification models. Among them, a scalable automated approach for malware classification using pattern recognition algorithms and statistical methods, is presented by Islam et al. in [21], utilizing the combination of static features extracted by function length and printable strings. While their evaluation results are very high(i.e., % classification accuracy), however it is worth mentioning the fact that their experiments include samples from malware families, while the classification accuracy of the model proposed in this paper has been evaluated over malware families. Hence, concerning the impact of philogeny among different malware families the comparative difference between the classification rates achieved by these two models is totally justified, while increasing the number of families in the training set increases the chances of misclassifications. Recently, Nataraj et al. [35] classify malware samples using image processing techniques. Visualizing as gray-scale images the malware binaries, they utilize the fact that,for many malware families, the images belonging to the same family appear very similar in layout and texture. Obviously the results are better than the ones produce by our model however they use at most malware families for their large scale experiments, where the impact of philogeny among different malware families is decreased the less different malware families in the training are. Finally, in [36] Nataraj et al. utilize a static analysis technique called binary texture analysis in order to classify malicious binary samples into malware families. They achieve a rate of consistent classification when performing their evaluation on a data set of to samples comparing their labels with those provided by AV vendors, proving both the accuracy and the scalability of their model.
In the most recent literature, Makandar and Patrot [30] focus on detection and classification of the Trojan viruses using image processing techniques. In their proposed algorithm Gabor wavelet is used for key of feature extraction method and their experimental results are analyzed compared with two classifications such as KNN and SVM. In [19], Hassen and Chan investigate a linear time function call graph (FCG) vector representation based on function clustering that has significant performance gains in addition to improved classification accuracy. They also show how this representation can enable using graph features together with other non-graph features. Recently, Sikora and Zelinka [45] investigate how behavior of malicious software can be connected with evolution and visualization of its spreading as the network. Their approach is based on hypothetical swarm virus and its dynamics of spread in PC and they show that its dynamics can be then modeled as the network structure and thus likely controlled and stopped, as their experiments suggest. Later, Souri and Hosseini [47] present a systematic and detailed survey of the malware detection mechanisms using data mining techniques. Additionally, it classifies the malware detection approaches in two main categories including signature-based methods and behavior-based detection. Based on the dependency graphs of malware samples, Ding et al. [14] propose an algorithm to extract the common behavior graph for each known malware, which is used to represent the behavioral features of a malware family. In addition, a graph matching algorithm that is based on the maximum weight subgraph is used to detect malicious code. In [33], Mukesh et al. propose a machine learning based architecture to distinguish existing and recently developing malware by utilizing network and transport layer traffic features.
1.6 Road Map
In Section 2 we present the prerequisite theoretical background in order for the graph-based techniques for detection and classification to be developed, next in Section 3 we demonstrate the key components of our model, in Section 4 we discuss extensively the main principles and design aspects over the development of our detection and classification scheme concerning the two processes, where in Section 5 we set our further research landmarks discussing the potentials and limitations of our proposed model concerning the processes of malware detection and classification, and we present our concluding remarks.
2 Conceptual Framework
In this section we discuss the semantics behind the processes of malware analysis, detection and classification. We discuss the principles of the utilization of behavior-based approaches towards the deployment of resilient detection and classification techniques. Firstly we present the major process preceding the development of detection and classification methods, that is malware analysis, and next we depict the state-of-the art behavioral approaches applied on malware detection and classification.
2.1 Analyzing Susceptible Samples
The traditional signature-based malware detection, despite its fast real-time protection, is still not resilient against malware mutations. Robust detection techniques prerequisite the procedure of malware analysis, during which, the analyst collects all the required information, in order to be effective and efficient. The effectiveness of signature scanning, relying on pattern matching fails to detect new malware strains or mutated variants of existing ones [12].
The procedure of malware analysis is consisted by the collecting of valuable information concerning either static artifacts or generally behavioral patterns, that could characterize the maliciousness, or not, of a sample, being categorized to two main categories namely static analysis and dynamic analysis, respectively [44, 48]. In a more abstract level, in static analysis the specimen (i.e., test sample) is examined without its execution, performing the analysis on its source code, utilizing reverse engineering techniques when source code is unavailable, while on the other hand, in dynamic analysis an execution of the malware has to be performed in order to collect the required data, concerning the behavior of a program [4].
Static Analysis. Static analysis of software is performed over the programming artifacts and structural characteristics of a software sample [13], without the need of its execution. The information obtained during static malware analysis may refer to opcode sequences, control flow graphs, etc. and can be used at will for malware detection [12]. In static malware analysis, since the sample does not need to be executed can be surpasses by easily foiled by obfuscation and packing techniques (change the sequence of instructions or the signatures of malware), however its scalability consists one of its assets [44, 27].
Several approaches have been deployed on the implementation of static malware analysis, including control-flow graphs, function call graph, machine learning techniques, support vector machines, similarity between API call sequences and opcode sequences, hidden Markov models, and principal component analysis [12].
Dynamic Analysis. Dynamic malware analysis deals mostly with the extraction of behavioral features exhibited during the execution of a malicious software sample. Such behavioral features include among others: environmental artifacts, timing, process introspection, network artifacts, etc [9]. The behavioral features are mainly captured and depicted through API-calls sequences and system-calls dependencies [12].
For security reasons the whole execution takes place inside a virtual emulated environment (i.e., a Virtual Machine) [4], however the scalability of dynamic malware analysis may be reduced due to the demand of real time execution [44]. Moreover, despite that obfuscation techniques can easily be defeated through dynamic analysis the time needed for analysis is disproportionate to the rate of birth of mutated malware samples [27]. Hence, the need for automated dynamic analysis leaded to the development of integrated dynamic analysis systems that corporate visualized and supervision environments i.e., virtualized analysis systems, which among others include: emulators, hosted virtual machines, hypervisors, etc [9].
A specific type of dynamic analysis, called taint analysis or DTA (stands for dynamic taint analysis) traces data flows in programs or systems during execution time. Briefly speaking, taint analysis distinguishes three elements namely taint sources, taint sinks, and propagation rules. Data flows are taint variables introduced by taint sources (i.e., the output parameters of system calls) and propagated according to the propagation rules to taint sinks (i.e., the input parameters of system calls)[4]. However, through the literature, there have been proposed several techniques that combine characteristics from both static and dynamic approaches, synthesizing a hybrid analysis model [12].
2.2 Detecting Malicious Behaviors
A we mentioned previously, malicious software samples are intended to compromise the privacy, the confidentiality or the integrity of a system, of data or any other cyber-source constituting hence an intrusion. To this end, Intrusion Detection Systems, or, for short IDS, are deployed in order to monitor the execution of applications, the traffic of networks or whole systems, aiming on spotting malicious activity patterns [3]. The system supervision through an IDS can be performed through the application of malware detection techniques, that reference file comparisons against signatures of malicious software [15], behavior monitoring of malicious patterns and system supervision [3].
However, the increasing birth-rate of new or mutated malware samples has raised the need for efficient and elaborated malware detection techniques that can effectively detect new malware strains in reasonable amounts of time. The detection approaches are strongly connected to the features set provided through the previous stage of malware analysis, and are distinguished to static and dynamic features, respectively. Static features may include, statistical analysis on n-grams or opcodes, properties of control flow graphs, while dynamic features are obtained the execution time of a program and concern its general behavior (i.e., interaction with the host-environment - O.S.), access events or any other interconnection patterns [18].
Malware detection approaches are divided into two main categories, namely signature-based malware detection and behavior-based malware detection [47, 12, 2, 8, 20, 6, 52, 22]. Next we briefly discuss these two methods and present some of the approaches deployed in each one.
Signature-based Malware Detection. Signature-based malware detection is the dominant technique deployed by antivirus software products due to its time efficacy that provides real-time protection against malicious threats [13]. A byte-level signature is a sequence (i.e., pattern) of bytes used to identify each newly discovered malware, using a scanning scheme of exact correlation and a repository of signatures in order to detect malicious software samples [12]. A signature may represent a byte-code sequence, a binary assembly instruction, an imported Dynamic Link Library (DLL), or function and system calls. [2, 8]. Novel malware detection approaches using machine learning can be deployed through two methods, namely, assembly features and binary features [47]. However, signature-based detection techniques can easily be evaded through code obfuscation techniques that even the least modification on the code sequence would lead to a completely different byte-sequence [12]. A major characteristic of signature-based malware detection is the exhibited precision so through object scanning utilizing efficient meta-heuristic algorithms as in the uniqueness signature creation. This characteristic regarding its precision may turn to a drawback, since such methods can not detect obfuscated or mutated (e.g., polymorphic) malware samples, as their signature does not match the stored one [47].
Behavior-based Malware Detection. Another approach deployed for malware, gaining remarkable research interest during the last yeas is behavioral detection, or more formally, behavior-based malware detection [23]. Behavior-based malware detection mainly focuses on capturing the interaction (in terms of interconnection, relations or dependencies between system-elements i.e., system-calls or API calls) between the executed software and the system (i.e., Operating System)[4, 10, 11, 16, 7, 8, 5, 42, 43]. From an abstract machine learning aspect, the behavior-based systems are trained over a learning phase with behaviors exhibited during the execution of known malicious software samples, while in the monitoring phase the trained behavior-based system decides if an unknown software sample is malicious or not [12]. Behavior-based detection systems as expected require the execution of the software sample in order to extract dynamically (see, Dynamic Malware Analysis) the exhibited behaviors. In order for these dynamic systems to perform the mining of the specified behaviors they utilize software and hardware virtualization technologies, alongside with imitation conditions [47], providing the test sample with an environment as close to reality in order to evade the sandbox-detection mechanisms deployed occasionally by malicious software samples, and letting them exhibit their intentions. Despite the fact that such techniques deploy quite elaborate algorithms on their implementations, the incident that malware families tend to evolve in order to avoid detection [2], results to the need of the development of more elastic and mutation resilient techniques like the one we propose in this work.
2.3 Classifying Malware Samples
Malware authors, in order to avoid traditional detection methods, produce new (mostly mutated) malware samples rapidly, utilizing existing ones in order for the new strains to preserve the functionality inherited from their ancestors. As referred in [8] mutated malware samples are generated from existing ones utilizing automated techniques [51, 54] or integrated tools, generating new samples from libraries and code parts from code exchange networks.
Malware Classification. Through the literature, the term malware classification has been confused several times with malware detection. Distinguishing precisely these two procedures, it can be stated that malware detection is a binary classification, where a a set of unknown samples is classified against a collection of malware and goodware samples, while malware classification is a multinomial classification on whether an already detected malware sample belongs to a particular family or type [17]. As described in [24], malicious software samples that belong to the same malware family tend to exhibit similar behavioral and structural profiles. Additionally, malware classification augments the analysis of new, or mutated, malicious samples where their signatures have not been constructed yet [40].
Malware Phylogeny. Another field of malware analysis applied in malware classification is malware phylogeny[32], which aims on inferring evolutionary relationships between instances of families. The major profit from creating a phylogeny model is the fact that newly developed elaborated detection systems that deploying such techniques can detect that a sample that has not been previously seen can be related to a malware family, when analyzed along an evolution path [26]. Throughout this process the main target is to reveal similarities and relations among a set of specific malware samples coexist and are exhibited by all the members of the set (i.e., malware family) [49], distinguishing its type or family. Such approaches can be utilized to identify evolution trends in over a set of malware samples [8], constituting hence valuable tool for more generalized signatures or, in general, more elaborated detection-techniques. The models applied on phylogeny, using mostly phylogenetic networks, model evolutionary relations among malware families, describing temporal ordering among samples, defining ancestor-descendent relations, as also relationships between families, augmenting hence malware classification and unveiling evolutionary trends [28].
3 Model Components
In this section, we firstly discuss the properties of our initial behavioral-graph representation i.e., the System-call Dependency Graph (or, for short, ScDG) and the proposed structural components of our model, namely, the Group Relation Graph (or, for short, GrG) and Coverage Graph (or, for short, CvG). In order to invoke the temporal evolution of these primordial graphs, we propose and present the derivative graphs the depict the temporal evolution of GrG and CvG, namely the Group Relation Temporal Graph (or, for short, GrTG) and Coverage Temporal Graph (or, for short, CvTG), respectively. More precisely, we show how the system-calls invoked through the execution of a program consisting its ScDG, are merged into groups of similar functionality constructing a directed edge-weighted graph called GrG that its vertex set refers to a system-call group, while its edge set contains the interconnection between the system-calls of these groups, and how we construct its corresponding component, i.e., the CvG, which is a vertex-weighted undirected graph. Given such graph representations, we present the construction of the key components of our proposed detection and classification model, i.e., their corresponding derivative graphs GrTG and CvTG, depicting their temporal evolution.
3.1 System-call Dependency Graphs
The system-calls invoked during the execution of a program can be traced through taint analysis, and hereafter the behavior of a program can be represented with a directed acyclic graph (dag), the so called System-call Dependency Graph see, Figure 1(a). The vertex set of a ScDG is consisted by all the system-calls invoked during the execution of a program and its edge set represents the communication between system-calls as described in [37, 4, 16].
Recalling that the suspicious sample needs to be executed in a contained environment (i.e., a virtual machine), where during its execution time, dynamic taint analysis is performed in order to capture system-call traces, next we illustrate a simple example that includes the system-call traces obtained, constructing the ScDG of a program. In Figure 1(a), it is easy to see that the vertex set of this graph is consisted from the system-calls invoked during the execution of the sample and its edge set is consisted by their in between data-flow dependencies, constructing a directed acyclic graph (dag).
3.2 Group Relation Graphs
Given a graph representation of malware-behavior such a ScDG, a more abstract graph representation of a program’s behavior can be constructed based on the fact that system-calls of similar functionality can be classified into the same group (see Table1). The produced graph representation is a directed weighted graph called Group Relation Graph; see, Figure 1(b). The whole procedure for constructing the GrG graph from a given ScDG for a program is described in details in [37, 38].
As described in [37], having the grouping of system-calls and a system-call dependency graph ScDG, the GrG graph is a directed edge-weighted graph on vertices , , , constructed as follows:
- (i)
for every pair , a directed edge is added in if the two system-calls communicating with each other, let , is an edge in and, belongs to the -th system-call group and belongs to the -th system-call group;
- (ii)
for each directed edge , a weight is assigned on it if there are invocations from a system-call in the -th group to a system-call in the -th group.
Having defined the GrG graph , we also define the underlying vertex-weighted graph of the graph having vertex-weights , for every .
3.3 Coverage Graphs
Another component of our detection model is the Coverage Graph (or, for short, CvG) [39]. As we mentioned above, a GrG graph is an edge-weighted directed graph which, in our approach, we transform it to its underlying vertex-weighted undirected graph . We first define domination relations on the vertices of the graph and then utilizing these relations we construct the Coverage Graph of the GrG graph , denoted by .
We also present a 2D-representation of the underlying vertex-weighted graph of the graph utilizing the degrees and the vertex-weights of its vertices and show a deferent way to compute the domination relations on the graph .
Definition 3.1
Let be the underlying vertex-weighted graph of a GrG graph and let , . We say that dominates , denoted by , if and , where and denote the degree and the weight of the vertex , respectively.
The domination set of a vertex is the set of all the vertices . If we say that vertices and are in a domination relation.
Definition 3.2
Let be the underlying vertex-weighted graph of a GrG graph with vertices . The Coverage Graph (CvG) of the GrG graph , denoted also , is a directed graph defined as follows:
- (i)
* and ;*
- (ii)
* if , where and correspond to and , respectively.*
In Figure 2(a) we show a GrG graph which is isomorphic to the GrG graph of Figure 1(b), in Figure 2(b) we depict its underlying vertex-weighted graph with , , where is the weight of the edge , while in Figure 2(c) we show the Coverage Graph constructed from the graph by utilizing its vertex domination relations. Note that the vertices of each graph in Figure 2 correspond the System-call groups of Figure 1(b).
In Figure 3 we show a 2D-representation of the underlying vertex-weighted graph of a GrG graph . The 2D-representation in Figure 3(a) depicts the domination relations on the vertices of the test sample’s GrG, while the one in Figure 3(b) depicts the domination relations on the vertices of the malware sample’s GrG. As we can observe, according to the definition of the domination relation, in Figure 3(a) the domination sets of the vertices , , , and are , , , and , respectively, while in Figure 3(b) the vertex domination sets of the corresponding vertices , , , and are , , , and , respectively.
3.4 Temporal Graphs
Throughout the development of our research, we have noticed that, to the best of our knowledge, there does not exist any approach on the literature that references or leverages the factor of the temporal evolution of a graph. Similarly to philogeny that examines the temporal evolution of malware families, the key component of our proposed detection and classification model, leverage the temporal evolution of graphs (i.e., GrG and CvG graphs) in order to depict the structural modifications performed on the graph and that could distinguish either a malware sample or to a further extent a malware family.
In our model, we define two types of graphs that depict the temporal evolution of our initial graph structures (i.e., Group Relation Graphs and Coverage Graphs), namely Group Relation Temporal Graphs or, for short, GrTG and Coverage Temporal Graphs or, for short, CvTG, respectively. In order to implement such graph structures we approach this modeling by creating instances of the initial GrG and CvG graphs during their construction. As we mentioned above, GrG graphs are constructed by the sum of the system-calls invoked interconnecting pairs of system call groups, and correspondingly CvG are constructed by they respective dominating relation (i.e., by their supremacy regarding degree and weight) between the system-call groups. Hence, since we are given the system-call dependencies in a series that depicts the time correlation among (i.e., an edge sequence of the System-call Dependency Graph that shows the system-call invocations during execution time), such constructions can be obtained by creating an instance of the produced graphs (i.e., GrG, CvG) at specific steps.
Formalizing our previous claim, we can define that for a set of time-slots, let we can construct instances of graphs GrG and CvG and denote them as and , respectively, that depict the structure in terms of edges, vertex-degrees and vertex-weights of the corresponding graphs at specific time slots. Through this approach we can maintain information about the temporal evolution of the graph thorough its construction procedure,and further leverage such information in order to perform more elaborated graph similarity techniques.
Partitioning Time. The factor of time actually does not represent the actual quantum of run-time, but each time-quantum corresponds to one system-call dependency or, equivalently, relation between two System-call Groups (i.e., edge of the Group Relation Graph). Hence, the total time-line depicts the slots or time-partitions from the appearance of the first to the last group relation.
Additionally, in our model, we define as epochs the set of time-partitions, i.e., , and an epoch, let , contains the structural modifications (i.e., edges added on the corresponding GrG or CvG graph) from the begin to the end of the epoch, where , and .
As we described throughout the paper, the conceptual substance of Temporal Graphs is to depict the structural evolution of the GrG and CvG graphs through the time. However, the structural modification on the instances of the graph over the time can be described either discretely as addition of edges over the exact previous graph instance, or cumulatively as successive additions of edges performed on all the previous graph instances. Next, we discuss the construction of the corresponding Temporal Graphs according to the two approaches.
Discrete Modification Temporal Graphs. In the first approach of our proposed scheme, the construction of the Temporal Graph, that represents the evolution of GrG or CvG graphs during time, constructs the induced subgraph of GrG and CvG, respectively, including only the edges that where added on a specific epoch. So, let the epoch we construct the Temporal Graphs , of the graphs GrG and CvG, denoting them with , , respectively, where denotes the cardinality of edges added on this epoch. In Figure 5 and Figure 5, we depict the discrete structural modification (i.e., temporal evolution) of graphs GrG and CvG over the construction of their corresponding Temporal Graphs and during epochs.
Cumulative Modification Temporal Graphs. In this type of Temporal Graphs, the evolution of the graph is represented as an additive procedure, since once an edge has been created at a given time, let , on the graph between two system-call groups on the GrG graph, or a domination relation has been resulted on the CvG graph, it will remain permanent on the ancestor Temporal Graphs (i.e., if and if , ), since it consists a predecessor of the following temporal graphs. In the second approach of our proposed scheme, the construction of the Temporal Graph, that represents the evolution of GrG or CvG graphs during time, actually extends the graphs GrG and CvG, respectively, during time, by adding on them the edges that where added on a specific epoch. So, let the epoch we construct the Temporal Graphs , of the graphs GrG and CvG, denoting them with , , respectively, where denotes the cardinality of edges added from epoch until epoch . In Figure 7 and Figure 7, we depict the cumulative structural modification (i.e., temporal evolution) of graphs GrG and CvG over the construction of their corresponding Temporal Graphs and during epochs.
4 System Architecture
In this section we present the key components of our detection and classification model, and describe the key insights that constitute the basis of the corresponding procedures. Discussing the design principles that rule the deployment of our model’s components, we present an overview of our detection and classification techniques.
4.1 Design Principles
Malware detection and classification are two interconnected procedures. In malware detection the main target is to determine whether a given program is malicious or benign according to something that is known to be malicious, where malware classification is the following procedure and its intent is to determine the malware family to which the sample, that has been detected as malicious, belongs to. It easily follows that, an a priori knowledge of characteristics of known maliciousness has to be stored in a knowledge database, as also that in order to compare two subjects a similarity measure among them is needed. Moreover, the proposed theoretical approach specifies the form of the subjects, where graph-based models interact with similarity metrics that measure the qualitative characteristics that represent evolutionary commonalities between temporal graphs. Next, we present the architectural considerations, the design principles, the functionality, and the corresponding deployment of the key components of our proposed detection and classification model (see, Figure 8).
Knowledge Database. The knowledge database is consisted by a set of known malicious samples that have been classified to malware families according to their functional, structural and mostly behavioral commonalities. More precisely, various anti-virus vendors have classified these samples to families based on their own heuristic rules concerning shared behavioral patterns and functionally similar execution profiles. Regarding the detection process, the knowledge database, except the known malicious samples, also includes benign samples in order to measure the false positive rates (i.e., benign samples that have been detected as malicious) evaluating the detection ability of our model. On the other hand, regarding the classification process, the benign samples are not needed, as in such procedures a classification model only has to decide the family in which a sample, already distinguished as malicious, belongs to.
Graph Structures. The major target of our approach is to utilize the graphs that depict the temporal evolution of our produced graphs GrG and CvG (i.e., GrTG and CvTG, respectively) in order to measure the graph similarity among test sample and samples that have been already detected as malicious, leveraging their structural modification that take place during the execution time of the programs that they represent. In our work we have a theoretically stable intuition that the factor of time, regarding the structural evolution of a graph is a strong qualitative characteristic that could definitely distinguish the behavior of a program and further be utilized to the development of more elaborated detection and classification techniques over unknown samples.
To this end, we ought to notice that regarding the time quantization procedure, where the time slots where the graph instances have to be retained, there could be applied several different approaches, that would affect the application results. In other words, the implementation of our proposed model on a fine-grained time quantization scheme, would be more precise against a more coarse-grained once, where on the other hand a trade-off between the precision on temporal structural modifications and the construction of more distinguishing patterns poses the basis of further tuning issues.
Similarity Metrics. Concerning the type of the temporal graph utilized to model the temporal evolution of the corresponding GrG and CvG graphs, (i.e., and , respectively) we could deploy two different similarity metrics, namely, -similarity and Cover-similarity metrics, respectively [38, 39], regardless of the applied constructional approach, concerning the discrete or cumulative modification temporal graphs. Next we briefly discuss the two similarity metrics.
- i.
-similarity Metric. The -distance, that utilizing the Euclidean distance measures the structural likeness between two given Temporal Graphs, let and denoting with (resp. ) is the in-degree (resp. out-degree) of node , and (resp. ) is the the average weight of in-coming (resp. out-going) edges of node ; is defined as follows.
[TABLE]
where,
,
,
,
Since -similarity utilizes the -distance metric, we compute it as follows:
[TABLE]
where, and . 2. ii.
Cover-similarity Metric. In order to perform the malware detection process, we actually compute graph similarity between pairs of Temporal Graphs, let and . The graph similarity is computed over the edge set of each pair of Temporal Graphs using the Jaccard similarity metric.
Let be the Temporal graph of an unknown sample and be the Temporal Graph of a known malware sample. The Cover-Similarity of the graphs and is computed as follows:
[TABLE]
where and are the edge sets of the two tested Temporal Graphs; note that, the vertices of the graph correspond to the vertices of the graph .
Component Deployment. The deployment model of the proposed components is consisted by the graph structures utilized to represent the software’s behavioral characteristics (i.e., the Temporal Graphs that represent their temporal evolution through time), the knowledge base, that is a database storing Temporal Graphs representing known malware samples, and the similarity metrics developed to capture structural and qualitative commonalities among such behavioral graphs. Our proposed graph-based malware detection and classification model is partitioned into two phases. The first phase concerns the detection procedure, where an unknown sample, let , is needed to be detected as malicious or benign. Our mode’s implementation utilizes the Temporal Graphs taken from a database of known malware samples and the Temporal Graph of test sample in order to compute their structural similarities across their temporal evolution. The second phase concerns the classification procedure, where an unknown sample, let , that has been already detected as malicious is needed to be classified to one of a set of known malware families. Our mode’s implementation utilizes the Temporal Graphs taken from a database of known malware samples already been classified to malware families and the Temporal Graph of test sample in order to compute their structural similarities across their temporal evolution and further classify to one malware family from our data-set. In Figure 9, we represent an abstract overview of the deployment of our proposed graph-based model for malware detection and classification.
4.2 Malware Detection
Next we discuss the operation of our proposed graph based malware detection model, and present an overview on its constructional principles alongside with a brief discussions over its implementation aspects.
**Model Overview.**We implement our malware detection model by first performing a transformation to the initial ScD graphs, converting them to GrG graphs and CvG graphs respectively, and then we compute for these graphs their corresponding Temporal Graphs (i.e., and ) as we described in the previous section, and then, for any given test sample we follow the same procedure as to conclude with the computation of -similarity and Cover-similarity metrics in order to measure the structural similarities between the graphs of two samples.
Next, we describe the main process of determining if an unknown sample is malicious or benign based on the results of our similarity metrics when applied on the corresponding Temporal Graph of a test sample and a set of Temporal Graphs that represent known malicious software samples. In Figure 10 we depict the total architecture of our proposed model for detecting malicious software samples.
Implementation Aspects.
In the example of Figure 10 we suppose we are given an unknown test sample that we do not know if it is malicious, and we are asked to decide if is malicious or benign. Having a database with the Temporal Graphs of known malware samples. Once the corresponding Temporal Graphs have been constructed, we compute the our similarity metrics between the Temporal Graph of and each Temporal Graph that represents a malware sample in our database. So, let the total number of malware samples in our database, we result to values in our measurements on our similarity metrics (one per pair ), where if the maximum value exhibited is above a predefined threshold it indicates that is malicious.
4.3 Malware Classification
Next we discuss the operation of our proposed graph based malware classification model, and present an overview on its constructional principles alongside with a brief discussions over its implementation aspects.
Model Overview. Our proposed method is based on application our proposed similarity metrics over the set of known malware families in order to classify on them an unclassified malware sample, let . More precisely, our method selects the family that is most similar to according to the similarity results exhibited by the measurement of -Similarity and Cover-Similarity metrics, calling that family dominant family. More precisely, using our proposed similarity metrics, we iterate over all the members of all the known malware families measuring the similarity between each pair of , , where is the member of the malware family. Then, for each family we select a member that is the most similar to , according to -Similarity and Cover-Similarity metrics, and denote this member as representative sample for this specific family. Finally, among all the representative samples for all the known malware families, we select to classify the unclassified test sample to the malware family that its representative sample exhibits the maximum similarity with according to -Similarity and Cover-Similarity metrics, denoting this family as dominant family.
Implementation Aspects. In the example of Figure 11 we show a representation of the procedure for classifying an unknown test sample to a known malware family utilizing the aforementioned methods (i.e., -Similarity and Cover-Similarity metrics). More formally, our classification technique proceeds as follows: given a set of known malware families and an unclassified malware sample , we measure the -Similarity and Cover-Similarity metrics over all the members of each family, keeping the maximum result (i.e., representative sample) for each family resulting to results (i.e., representative samples), one per family. Then, we classify the test sample to the family that exhibited the maximum value among all results. In other words, we compute the aforementioned similarity metrics between and all the malware families of the data-set, selecting as the dominant family, the one that has the representative sample that exhibits the maximum value in our similarity measurements.
5 Conclusion
In this section we discuss our future work, concerning the implementation of our graph based model for malware detection and classification, and the performance of a series of experiments on our data-set in order to prove the potentials regarding the detection ability and the classification accuracy of our model against malicious samples. Finally we conclude our paper with the remarks regarding mostly the implementation aspects of our work and the potentials our our model expecting the experimental results.
5.1 Further Research
Next, we briefly discuss the potentials and the future work concerning the experimental evaluation of our model regarding its potentials as also drawbacks and limitations that may arise during its implementation.
Potentials. Several modeling alternates have been arise during the theoretical construction of our graph-based proposed model regarding the temporal evolution of behavioral graphs that represent software samples, regarding their structural modification during time. Our approaches that we discuss briefly next, mostly concern the representation of the structural modifications on the GrG and CvG graphs during time, and how they could also be represented with other structures that do not cooperate graphs, and consequently deserve the application of different manipulation methods.
In the first alternate approach, we could denote the structural evolution of a given by plotting by a discrete distribution of the addition of edges over the graph on specific time buckets (i.e., similar to epochs) and create patterns that could be utilized in order to perform pattern-matching over the plot of any given pair of samples (i.e., test and known malware sample). These plots should be construct for the temporal evolution of each corresponding edge pair of two given graphs in order for the patterns to be comparable.
On the other hand, in the second approach of our model, we need to simulate the structural modification of a given graph during time (i.e., temporal evolution of the graph). Similarly to our approach, rather than constructing several graph instances equal to the number of the defined epochs and structurally relevant to the applied method regarding the discrete or cumulative modification approach, we could also represent these structural modification (i.e., addition of edges) over the time for each edge (i.e., edge on either GrG or CvG). More precisely, we could define a binary sequence for each edge, where [math] denotes absence and denote adition of this edge on the overall graph, and the length of the sequence equals the size of the ScDG (i.e., System-call Depenency Graph). Then, various alignment algorithms could be adopted in order to retrieve similarity patterns among any pair of such sequences, that represent corresponding edges on the graphs of the test and the known malicious samples.
Limitations. Our proposed graph-based model for malware detection and classification using temporal graphs, despite its theoretical basis, has also some limitations concerning any implementation drawbacks that may arise. The main issue encountered regarding the implementation design concerns the spatial complexity of our approach. More precisely, defining a fine-grained or a coarse-grained quantization of time (i.e., number of epochs) would affect to a great extent the space required to store the corresponding Temporal Graph instances. AS easily someone can understand, an implementation of our proposed model on a fine-grained time quantization scheme, would be more precise against a more coarse-grained once. Additionally, further tuning issues arise over the trade-off between the precision on temporal structural modifications and the construction of more distinguishing patterns. However, more sophisticated approaches, such an implementation that utilizes the maximum length of a binary tree in order to bound the quantization would lead to a more stable, rational, effective and efficient approach.
5.2 Remarks
In this paper we designed and presented a graph-based model for malware detection and classification based on relation of Group Relation Graphs and their structural evolution during time (i.e., temporal evolution). On this aspect, to satisfy such demands we proposed the construction of their corresponding Temporal Graphs. Temporal Graphs represent the structural evolution of a graph over the quantum parts of time (i.e., time-slots of on the corresponding time-line that depicts execution-time), that we call epochs. The proposed graph-based model for malware detection and classification is organized into two phases, namely detection phase and classification phase, respectively. In order to distinguish malicious from benign samples, and further classify a sample that has been detected as malicious to a malware family from a set of known malware families, we propose the utilization of similarity metrics that measure the graph similarity taking into account structural commonalities among graphs (i.e., -Similarity and Cover-Similarity). Presenting the design principals of our model we discussed its potentials concerning its detection and classification abilities as also minor drawbacks that may be encountered during its implementation in the future.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Alazab, M., Layton, R., Venkataraman, S., Watters, P.: Malware detection based on structural and behavioural features of API calls. In: Proceedings of the 1st International Conference on Cyber Resilience (CR’10), pp. 1–10 (2010).
- 2[2] Alsulami, B., Srinivasan, A., Dong, H., Mancoridis, S.: Lightweight behavioral malware detection for windows platforms. In Malicious and Unwanted Software (MALWARE), 2017 12th International Conference on (pp. 75-81), IEEE (2017).
- 3[3] Aneja, L., Babbar, S.: Research Trends in Malware Detection on Android Devices. In International Conference on Recent Developments in Science, Engineering and Technology (pp. 629-642). Springer, (2017).
- 4[4] Babic, D., Reynaud, D., Song, D.: Malware analysis with tree automata inference. In: Proceedings of the 23rd International Conference on Computer Aided Verification (CAV’11), pp. 116–131 (2011).
- 5[5] Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable behavior-based malware clustering. In: Proceedings of the 16th Annual Network and Distributed System Security Symposium (NDSS’09), pp. 8–11 (2009).
- 6[6] Bayer, U., Habibi, I., Balzarotti, D., Kirda, E., Kruegel, C.: A view on current malware behaviors. In: Proceedings of the 2nd USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET’09), Boston, MA (2009).
- 7[7] Bayer, U., Moser, A., C., Kruegel, C., Kirda, E.: Dynamic analysis of malicious code. Journal in Computer Virology 2, 67–77 (2006).
- 8[8] Bernardi, M. L., Cimitile, M., Distante, D., Martinelli, F., Mercaldo, F.: Dynamic malware detection and phylogeny analysis using process mining. International Journal of Information Security, 1-28 (2018).
