SCGDet: Malware Detection using Semantic Features Based on Reachability Relation
Renjie Lu

TL;DR
SCGDet is a novel malware detection approach that leverages semantic features derived from system call reachability relations, outperforming traditional methods in accuracy and false positive reduction.
Contribution
The paper introduces SCGDet, a system call graph-based malware detection method that captures semantic features through reachability, with improved performance over n-gram techniques.
Findings
Achieves 97.78% accuracy in malware detection.
Reduces false positive rate by 14.75%.
Outperforms traditional n-gram methods.
Abstract
Recently, with the booming development of software industry, more and more malware variants are designed to perform malicious behaviors. The evolution of malware makes it difficult to detect using traditional signature-based methods. Moreover, malware detection has important effect on system security. In this paper, we present SCGDet, which is a novel malware detection method based on system call graph model (SCGM). We first develop a system call pruning method, which can exclude system calls that have little impact on malware detection. Then we propose the SCGM, which can capture the semantic features of run-time program by grouping the system calls based on the reachability relation. We aim to obtain the generic representation of malicious behaviors with similar system call patterns. We evaluate the performance of SCGDet using different machine learning algorithms on the dataset…
| Resource Types | Related system calls |
|---|---|
| File System | open, openat, read, write, close, select, stat, fstat, lstat, statfs, stat64, fstat64, readlink, access, fcntl, ioctl, chdir, fchdir, getdents64, getcwd, lseek, utime, uname, unlink, umask, chmod, rename, execve |
| Process | rt_sigaction, rt_sigprocmask, fork, rt_sigsuspend, getpid, clone, waitpid, nanosleep, set_tid_address, prctl, getppid, pipe, kill |
| Network | socket, connect, setsocket, bind, getsockname, listen |
| Memory | brk, mmap, mmap2, munmap, mprotect |
| Feature | … | Class | ||||
| Frequency | … | 1/0 |
| Predicted class | |||
|---|---|---|---|
| Malware | Benign | ||
| Actual class | Malware | True Positive (TP) | False Negative (FN) |
| Benign | False Positive (FP) | True Negative (TN) | |
| The size of feature space | ||||||||||||||||
| Samples | 25% | 50% | 75% | 100% | ||||||||||||
| 4-gram | 4935 | 5940 | 5195 | 4770 | 5500 | 6195 | 5900 | 6040 | 6015 | 6405 | 6995 | 6895 | 6515 | 6560 | 6890 | 7305 |
| SCGM | 150 | 152 | 150 | 169 | 153 | 178 | 171 | 179 | 174 | 178 | 181 | 189 | 190 | 184 | 186 | 191 |
| The average sizes of feature space | ||||
| Samples | 25% | 50% | 75% | 100% |
| 4-gram | 5268 | 6111 | 6771 | 7305 |
| SCGM | 155 | 176 | 186 | 191 |
| Techniques | Features |
|---|---|
| , , | |
| , , | |
| , , | |
| 4-gram | , , |
| , , | |
| , , | |
| , , | |
| SCGM | , , , , |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Network Security and Intrusion Detection · Software Testing and Debugging Techniques
11institutetext: University of Chinese Academy of Sciences, Beijing, China
11email: [email protected]
SCGDet: Malware Detection using Semantic Features Based on Reachability Relation ††thanks: Corresponding author: [email protected]
Renjie Lu
Abstract
Recently, with the booming development of software industry, more and more malware variants are designed to perform malicious behaviors. The evolution of malware makes it difficult to detect using traditional signature-based methods. Moreover, malware detection has important effect on system security. In this paper, we present SCGDet, which is a novel malware detection method based on system call graph model (SCGM). We first develop a system call pruning method, which can exclude system calls that have little impact on malware detection. Then we propose the SCGM, which can capture the semantic features of run-time program by grouping the system calls based on the reachability relation. We aim to obtain the generic representation of malicious behaviors with similar system call patterns. We evaluate the performance of SCGDet using different machine learning algorithms on the dataset including 854 malware samples and 740 benign samples. Compared with the traditional n-gram method, the SCGDet has the smaller feature space, the higher detection accuracy and the lower false positives. Experimental results show that SCGDet can reduce the average FPR of 14.75% and improve the average Accuracy of 8.887%, and can obtain a TPR of 97.44%, an FPR of 1.96% and an Accuracy of 97.78% in the best case.
Keywords:
Malware detection System call Pruning System call graph model (SCGM) Semantic analysis.
1 Introduction
Malicious software is referred to as malware, which is designed to perform various malicious activities, such as leaking private information, disabling targeted host and so on. Nowadays, the exponential growth of malware is a major threat in the software industry. McAfee, an anti-malware vendor [1], reported that the total number of malware samples has grown almost 34% over the past quarters to more than 774 million samples. Meanwhile, with the booming development of software industry, the malware has greatly evolved and become very sophisticated.
Given the alarming growth of malware, a large number of researches have focused on proposing malware detection techniques. Roughly, the malware detection techniques can be divided into two broad categories: static malware analysis and dynamic malware analysis. Traditional anti-virus software usually depends on static signature-based method to detect the malware [9]. Though these static methods can detect malicious applications before they are executed. However, the static malware analysis methods can be evaded using the encryption, the obfuscation or the packing [19]. Malware authors can write several malware variants that have similar functionality but different signatures to evade static malware detection methods, and the zero-day malware can also evade static malware analysis.
In response to this, various dynamic malware detection approaches have been proposed [5, 7, 6, 8, 12], which focus on the program behaviors during execution. In particular, these techniques inspect the run-time behaviors of program by analyzing the system calls. The principal dynamic behavior-based malware detection methods include function call monitoring [5], information flow tracking [7], sequence modeling of system calls using n-gram model [6], individual system call analysis [8] and behavioral graphs [12]. Though these approaches can detect the obfuscated malware and the variants of malware, they require a large amount of time and resources to carry out the detection. Moreover, the number of features generated by these approaches is huge, which results in the scalability of these approaches is highly problematic..
In this paper, we propose the SCGDet that exploits the following two observations to perform malware detection. Experimental observation 1: The malware usually invokes security-critical system calls to achieve malicious activities. For example, the Brk exploit uses repetitively brk system call to increase the data segment of user program and overloads the main memory. Experimental observation 2: Malware and its variants belonging to the same family usually invoke similar system call patterns due to performing the similar malicious behaviors. Based on the above observations, we propose the system call graph model (SCGM) to capture the semantic features of malicious behaviors. Our work aims to obtain the generic feature representation of malicious behaviors with similar system call patterns. Therefore, our approach has the potential to detect malware variants. We apply different machine learning algorithms to perform malware detection, and evaluate the performance of SCGDet using the dataset including 854 malware samples and 740 benign programs. In summary, we make the following contributions in this paper:
We propose SCGDet, which is a novel malware detection method based on SCGM. The SCGM can capture the semantic features of programs by grouping the system calls based on the reachability relation and considering the frequency of these security-critical system call patterns. 2. 2.
Based on statistical experiments, we propose and formalize a novel system call pruning method in order to exclude system calls that have few effect on malware detection. 3. 3.
We evaluate the performance of SCGDet using benign and malware samples. Experimental results show that SCGDet can obtain a TPR of 97.44%, an FPR of 1.96% and an Accuracy of 97.78% in the best case. Compared with the traditional n-gram technique, experimental results demonstrate that our approach reduces significantly the feature space, have the lower false positives and the higher detection accuracy. Specifically, SCGDet can reduce the average FPR of 14.75% and improve the average Accuracy of 8.887%.
The rest of this paper is organized as follows. Section II introduces our proposed malware detection method in detail. Experiment and evaluation are presented in Section III, and Related works are discussed in Section IV. Section V concludes the paper and future work.
2 Methodology of SCGDet
In this section, we introduce the proposed malware detection technique in detail. Figure 1 shows the overall architecture of SCGDet, which consists of training stage and testing stage. The training stage contains the system call tracing and pruning, the feature construction and generation and the malware detection model generation. Then, we use this malware detection model to predict whether the program under detection is malware during the testing stage.
2.1 System Call Tracing and Pruning
2.1.1 System call tracing
To generate system call trace records, all malware samples were executed in a controlled and virtualized environment running 64-bit Ubuntu 18.04 with Linux 4.15 Kernel. We use the strace, which is a Linux debugger to trace system calls and signals, to generate system call trace records of each malware sample. In contrast, all benign samples were executed in a normal environment and common operations were performed on the benign programs. In a similar way, we also use strace to generate system call trace records of each benign sample.
2.1.2 System call pruning
Generally, no program requests all the system calls. Moreover, when we analyze a large number of programs, the total number of system calls requested by all the programs will be huge, which can result in the considerable time overhead. Therefore, we first identify the valuable system calls to eliminate the need of considering all system calls. We propose the following two rules to filter out system calls that have little impact on malware detection.
Rule 1: Exclude system calls that are commonly requested by both malware and benign samples Each system call describes a particular operation, and the benign programs and malicious programs should request different system calls corresponding to their operational needs. That is, we do not need to analyze all system calls to bulid the malware detection model. As a result, we should exclude system calls that are commonly used by both benign and malicious programs, For example, system call close are frequently requested by both malicious and benign programs.
For this purpose, we first create two matrices and a list, M, B and SC. SC contains all the different system calls that appear in the dataset. M is a matrix of system calls used by malware samples, and B is a matrix of system calls used by benign programs. More formally, SC, M and B are shown as follows, respectively, where l represents the number of different system calls, m (854) indicates the number of malware samples in the dataset and n (740) indicates the number of benign samples in the dataset. represents whether the j-th system call is requested by the i-th malware sample, and represents whether the j-th system call is requested by the i-th benign sample, while ’1’ indicates yes and ’0’ indicates no.
[TABLE]
[TABLE]
[TABLE]
Then we use the following formula to calculate the distribution of each system call, where is used to balance the effect of different sizes of malware dataset and benign dataset, and represents the distribution of the j-th system call. The result of ranges from 1 to -1. If = 1, this means that system cal is only requested by the malware samples. If is close to 0, this means that have little impact on malware detection because is closely equal to . If = -1, this means that system call is only requested by the benign samples. Finally, we filter out those system calls that the value of distribution is closely equal to 0.
[TABLE]
Rule 2:Exclude system calls that appear only in few samples To further filter out the system calls that are not helpful for malware detection. We do not consider those system calls to build malware detection model, which appear only in few samples. We use the following formula to measure this rule, where indicates the number of malware and benign samples using the j-th system call. The result of has a value ranging between 0 and . Similarly, we filter out the system calls that the value of is closely equal to 0. For example, the system call signal only requested by one sample, so we should filter out it according to Rule 2.
[TABLE]
Finally, we neglect those system calls that have little impact on malware detection and consider only these meaningful and security-critical system calls, as listed in Table 1. It’s worth noting that a file is an instance of any operated file or I/O device. We observe that the combination of these system calls (in other words, system call pattern) can result in the specific malicious activities. For example, the virus first copies its content into the temporary file, then changes the permission and the accessed time, and finally executes the modified file. Therefore, we can capture specific malicious behaviors of malware samples by analyzing the system call patterns.
2.2 Feature Construction: System Call Graph Model
Feature construction is a crucial step of our proposed malware detection approach. The performance of any malware detection models depends heavily on how precisely the features represent the characteristics of the samples. In this work, we propose a novel feature construction method, system call graph model (SCGM), to capture the semantic features of typical malware behavior. We will detailedly illustrate the SCGM with the system call trace shown in Figure 2.
The SCGM is based on the high-level semantic features of malicious behaviors. We use security-critical system calls along with their arguments and return values to analyze malicious behaviors. We aim at to obtain the generic representation of malicious behaviors in order to detect the variants of the malware that have similar system call patterns. Meanwhile, we also consider the frequency of security-critical system call patterns for feature construction in order to capture malicious behaviors that execute repetitively specific system call pattern to overload the system. In short, we propose the following properties to capture the high-level semantic features of malicious behaviors.
Property 1: Group system calls based on reachability relation The observation behind this property is that malware and its variants usually perform a series of similar operations on particular resources in order to accomplish similar malicious behaviors. For example, in the system call trace shown in Figure 2, the virus first copies its content into the temp0 file, then changes the permission and the accessed time, and finally executes the modified file. These specific system call patterns can reveal malicious intents of malware. To group system calls based on reachability relation, we redeclare the concept of system call graph first.
In our work, the system call graph (SCG) can represent all relationships among system calls during program execution. The SCG is a directed graph and SCG = (V, Entity, E), where V is the set of various system calls, Entity is the set of source and destination files, and E is the set of edges. More formally, E = . That is, the edge is an out-edge of vertex or entity and in-edge of vertex .
In the system call trace example shown in Figure 2, V = , Entity = , and E = as shown in Figure 3.
Based on the SCG, we consider that the vertices that satisfy the reachability relations shown in Figure 4 should be grouped in the same set. 1) if and are directedly connected in SCG, and should be grouped in same set. 2) if and are directedly connected and and are also directedly connected in SCG, , and should be grouped in same set. 3) if and are directedly connected and and are also directedly connected in SCG, , and should be grouped in same set. 4) if and are directedly connected and and are also directedly connected in SCG, and should be grouped in same set.
Property 2: Counting the frequency of security-critical system call patterns The observation behind this property is that malware often executes repetitively security-critical system call patterns in order to overload the system resources (such as Memory, CPU), which can result in the denial of service, the CPU-starvation and even the crashing of OS in resource limited systems. For example, the Brk.c exploit invokes repetitively brk system call to increase the data segment of user program in order to overloads the main memory. Therefore, recording the frequency of these security-critical system call patterns is important to identify similar malicious behaviors.
Hence, for the system call trace example shown in Figure 2, the feature set generated by our proposed SCGM contains , , , and , where the frequency of system call patterns and are equal to 2.
2.3 Feature Vector Generation
We utilize these system call patterns generated by the SCGM to construct a feature vector of size N for each sample, where N is the total number of different system call patterns (features) obtained from both malware samples and benign samples. As listed in Table 2, the feature vector FV = , where f indicates the frequency of this unique system call pattern (feature) in the current sample. If f is equal to 0, which indicates the corresponding system call pattern does not appear in the sample. It is worth noting that the last feature of FV is the label of sample, and 1 for malware and 0 for benign.
3 Experiments and Evaluations
In this section, we first introduce the dataset and the evaluation metrics used in our experiments. Then, we present the experimental results in detail. We conduct lots of experiments to evaluate the effectiveness of SCGDet in distinguishing between malware and benign programs. We use python language to implement the system call tracing and pruning as well as feature construction, and use scikit-learn toolkit [16] to obtain malware detection model.
3.1 Dataset and Metrics
We collect 854 malware samples from VirusShare [2], which include trojan, exploit, virus and various malicious scripts. We use the Virustotal [3] online tool to identify the types of these malware samples. In addition, we also collect 740 benign programs on Ubuntu 18.04 OS, which contains internet, system tools, video, office and so on. We divided the entire dataset into 70%-30%. 70% of the samples were used to generate the malware detection model using the standard 10-fold cross-validation approach. The remaining 30% of the samples were used to evaluate the performance of malware detection model.
For the malware detection, we first introduce the confusion matrix as shown in Table 3. Finally, we use the following evaluation metrics in our experiments: = , = and = .
3.2 Experimental results
We apply three widely used machine learning algorithms to build malware detection models: Logistic Regression (LR), Support Vector Machine (SVM) and Random Forest (RF). In order to compare the malware detection performance with other methods, we also implement 4-gram model on our dataset, which has better performance as presented in [6] and [15].
3.2.1 Feature Space Analysis
Firstly, we analyze the difference in the feature space obtained by the 4-gram method and our proposed feature construction method, SCGM. For this purpose, we randomly select 25%, 50% and 75% of the samples in our dataset for feature construction using 4-gram modeling technique and SCGM, respectively. Then we repeatedly perform this process five times and select all samples (just one time) in our dataset for feature construction. we record the sizes of different feature spaces and the average sizes of different feature spaces, as shown in Table 4 and Table 5. We visualize the difference in the average size of feature space between 4-gram model and SCGM, as shown in Figure 5. In can be seen that, for 4-gram model, the number of features grows proportionally with the number of samples. However, for SCGM, we observe that the curve gradually flattens as the number of samples increases.
Because the SCGM groups system calls based on reachability relation, it can have the less number of features compared to 4-gram model, which can relieve malware detection overheads. For instance, for the system call trace shown in Figure 2, the number of features generated by 4-gram model is 15, while the number of features generated by SCGM is 5, as listed in Table 6.
3.2.2 Malware detection performance:
Moreover, SCGDet can capture the semantic features of run-time program. For example, the feature, , can represent the typical propagation behavior of virus. While n-gram model only considers the strict order of system calls, which makes SCGDet have the higher detection accuracy and the lower false positives. The detailed experimental results are shown in Table 7. Because RF is an ensemble learning algorithm, it has the best malware detection performance. With RF as the classifier, SCGDet can have a TPR of 97.44%, an FPR of 1.96% and an Accuracy of 97.78% in our dataset. In general, SCGDet can improve the average TPR of 2.86%, reduce the average FPR of 14.75% and improve the average Accuracy of 8.887% as shown in Figure 6.
4 Related works
In the past, a considerable number about malware detection researches have been published. The security researchers have proposed many feature-statistics based approaches to identify malware. The statistics of features include the frequency, the prior probability, the entropy and information gain and so on. [11] proposed an integrated approach that use static and dynamic features to classify the benign and malware samples. They applied four common machine learning classifiers to carry out their objective. Ahmadi et al. [4] also proposed a similar approach, which use API calls to conduct their feature construction. In addition, they applied the feature selection methods to remove the redundant and irrelevant features. [17] proposed the visual analysis of malware behavior using treemaps and threaded graphs and used the temporal values of system object for malware detection. However, this method can lead to high false positive rates as proposed statistics have a low signal-to-noise ratio.
The n-gram is a popular data-mining technique, and has been applied in dynamic malware detection approaches. [13] presented an approach that can identify the malicious behaviors using Application Programming Interface (API) and system calls respectively. They applied a signature-like approach in order to match the n-gram features that are obviously present in malware but absent in benign programs. Wu and Yap [18] presented a visualization approach to cluster those malware samples that have similar malicious behaviors. They made use of byte opcodes and constructed the n-gram features, and they also applied hash-based technique to reduce the feature space. Because the syetem calls can provide effective information about the runtime behaviors of the program, many malware detection techniques based on system call have been proposed. Forrest et al. [10] first proposed anomaly detection using the sequences of system calls. Then [14] used system call arguments and sequences to capture interrelation among different arguments of the system call, which can detect the abnormal behaviors of run-time programs. Canali et al. [6] applied common data-mining technique, such as n-gram, m-bag and k-tuples, to conduct the feature construction and used machine learning methods to predict the program under detection.
However, these n-gram based malware detection methods can lead to substantial performance overhead due to the huge feature space. Moreover, these methods do not capture the high-level semantic information of malicious behaviors and do not consider the frequency of system call patterns, which can result in high false positives. For example, malware can invoke repetitively those security-critical system call patterns to perform malicious attack such as denial of service and CPU starvation. In this paper, we propose the system call pruning method to filter out system calls that have little impact on malware detection, and propose the SCGM for feature construction. Our proposed approach, SCGDet, can capture the semantic features of typical malware behaviors by grouping system calls based on reachability relation and considering the frequency of system call patterns. Compared to traditional n-gram based method, SCGDet reduces significantly the feature space and has lower false positive rate.
5 Conclusion
In this paper, we first observed that malware usually invokes security-critical system calls to achieve malicious activities, and observed that malware and its variants belonging to the same family usually invoke similar system call patterns due to performing the similar malicious behaviors. Therefore, we first develop a system call pruning method to exclude system calls that have few impact on malware detection. Then we propose a novel feature construction method, system call graph model (SCGM), to capture the semantic features of typical malware behaviors by grouping system calls based on reachability relation and considering the frequency of system call patterns. Compare to the n-gram method, our propose technique, SCGDet, has the smaller feature space, the lower false positive rates and high detection accuracy. Experimental results show that SCGDet can reduce the average FPR of 14.75% and improve the average Accuracy of 8.887%. In the future, we are purposed to study the classification among different malware to determine which malicious family the malware sample under detection belongs to.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Mcafee labs threats report september 2018. https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-sep-2018.pdf
- 2[2] Virusshare. https://virusshare.com/
- 3[3] Virustotal. https://www.virustotal.com/
- 4[4] Ahmadi, M., Sami, A., Rahimi, H., Yadegari, B.: Malware detection by behavioural sequential patterns. Computer Fraud & Security 2013 (8), 11–19 (2013)
- 5[5] Bayer, U., Moser, A., Kruegel, C., Kirda, E.: Dynamic analysis of malicious code. Journal in Computer Virology 2 (1), 67–77 (2006)
- 6[6] Canali, D., Lanzi, A., Balzarotti, D., Kruegel, C., Christodorescu, M., Kirda, E.: A quantitative study of accuracy in system call-based malware detection. In: Proceedings of the 2012 International Symposium on Software Testing and Analysis. pp. 122–132. ACM (2012)
- 7[7] Christodorescu, M., Jha, S., Kruegel, C.: Mining specifications of malicious behavior. In: Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering. pp. 5–14. ACM (2007)
- 8[8] Egele, M., Kruegel, C., Kirda, E., Yin, H., Song, D.: Dynamic spyware analysis (2007)
