A baseline for unsupervised advanced persistent threat detection in   system-level provenance

Ghita Berrada; Sidahmed Benabderrahmane; James Cheney; William; Maxwell; Himan Mookherjee; Alec Theriault; and Ryan Wright

arXiv:1906.06940·cs.CR·March 6, 2020

A baseline for unsupervised advanced persistent threat detection in system-level provenance

Ghita Berrada, Sidahmed Benabderrahmane, James Cheney, William, Maxwell, Himan Mookherjee, Alec Theriault, and Ryan Wright

PDF

1 Repo

TL;DR

This paper evaluates the effectiveness of unsupervised anomaly detection algorithms in identifying advanced persistent threats from large-scale system provenance data across multiple operating systems, addressing a critical cybersecurity challenge.

Contribution

It provides the first detailed assessment of generic unsupervised anomaly detection methods for APT detection using system-level provenance data.

Findings

01

Unsupervised algorithms can detect APT-like attacks with varying effectiveness.

02

Streaming detection methods show promise for real-time APT detection.

03

The study highlights challenges and potential improvements in anomaly detection for cybersecurity.

Abstract

Advanced persistent threats (APT) are stealthy, sophisticated, and unpredictable cyberattacks that can steal intellectual property, damage critical infrastructure, or cause millions of dollars in damage. Detecting APTs by monitoring system-level activity is difficult because manually inspecting the high volume of normal system activity is overwhelming for security analysts. We evaluate the effectiveness of unsupervised batch and streaming anomaly detection algorithms over multiple gigabytes of provenance traces recorded on four different operating systems to determine whether they can detect realistic APT-like attacks reliably and efficiently. This report is the first detailed study of the effectiveness of generic unsupervised anomaly detection techniques in this setting.

Tables17

Table 1. Table 1: Example of context: process identifiers vs type of system events, i.e ProcessEvent (PE) context (extracted from Android provenance graph)

Object_ID	EVENT_CLONE	EVENT_CHECK _FILE_ATTRIBUTES	EVENT_OTHER	EVENT_MPROTECT	EVENT_CLOSE	EVENT_CREATE _OBJECT	EVENT_LSEEK	EVENT_UNLINK	EVENT_WAIT	EVENT_MODIFY _PROCESS	EVENT_RECVFROM	EVENT_MODIFY _FILE_ATTRIBUTES	EVENT_WRITE	EVENT_BIND	EVENT_READ	EVENT_RENAME	EVENT_OPEN	EVENT_LOADLIBRARY	EVENT_CONNECT	EVENT_SENDTO	EVENT_SENDMSG
285d5fed-06dc-32ae -a04a-13cc9426616b	0	1	1	0	1	1	0	1	0	0	0	1	1	0	1	1	1	1	0	0	0
1e3548c0-b030-3591 -97ac-71b67bbcb305	0	1	1	0	1	1	0	0	0	0	0	1	1	0	1	0	1	0	0	0	0
b4f1724e-0ba1-316b -973f-69e5d5e3490c	0	1	1	0	1	1	0	1	0	0	0	1	1	0	1	1	1	0	0	0	0
e2a4e818-3ce2-3626 -8e22-134b542d1d77	0	0	1	0	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0
……

Table 2. Table 2: Description of the datasets used during the experiments. In each context row (rows 3 to 7), the element at the top shows the number of rows (processes) and the element at the bottom the number of columns (attributes).

	Windows		BSD		Linux		Android
	Scenario		Scenario		Scenario		Scenario
	1	2	1	2	1	2	1	2
Size	743	9.53	288	1.27	2858	25.9	2688	10.9
	MB	GB	MB	GB	MB	GB	MB	GB
ProcessEvent	17569	11151	76903	224624	247160	282087	102	12106
(PE)	22	30	29	31	24	25	21	27
ProcessExec	17552	11077	76698	224246	186726	271088	102	12106
(PX)	215	388	107	135	154	140	42	44
ProcessParent	14007	10922	76455	223780	173211	263730	0	24
(PP)	77	84	24	37	40	45	0	11
ProcessNetflow	92	329	31	42888	3125	6589	8	4550
(PN)	13963	125	136	62	81	6225	17	213
ProcessAll	17569	11151	76903	224624	247160	282104	102	12106
(PA)	14431	606	296	265	299	6435	80	295
$n b _ a t t a c k s$	8	8	13	11	25	46	9	13
$% \frac{n b _ a t t a c k s}{n b _ p r o c e s s e s}$	0.04	0.07	0.02	0.004	0.01	0.01	8.8	0.10

Table 3. Table 3: Evaluation of batch anomaly scoring in Scenario 1 (nDCG scores). The higher the score (i.e the closer to 1) the better. The best score per OS (row) is highlighted in bold.

	FPOF	OD	OC3	CompreX	AVF
Windows	0.20	0.20	0.30	0.60	0.60
BSD	0.20	0.19	0.43	0.54	0.51
Linux	0.18	0.18	0.38	0.30	0.27
Android	0.29	0.33	0.74	0.82	0.84

Table 4. (a) ProcessEvent

	FPOF	OD	OC3	CompreX	AVF
Windows	0.20	0.20	0.30	0.60	0.60
BSD	0.20	0.19	0.43	0.54	0.51
Linux	0.18	0.18	0.38	0.30	0.27
Android	0.29	0.33	0.74	0.82	0.84

Table 5. (b) ProcessExec

	FPOF	OD	OC3	CompreX	AVF
Windows	0.15	0.15	0.28	DNF	0.28
BSD	0.15	0.15	0.49	DNF	0.34
Linux	0.18	0.18	0.30	DNF	0.43
Android	0.22	0.22	0.39	DNF	0.39

Table 6. (c) ProcessParent

	FPOF	OD	OC3	CompreX	AVF
Windows	0.10	0.10	0.21	DNF	0.21
BSD	0.13	0.13	0.43	DNF	0.30
Linux	0.17	0.17	0.24	DNF	0.20
Android	NA	NA	NA	NA	NA

Table 7. (d) ProcessNetflow

	FPOF	OD	OC3	CompreX	AVF
Windows	0.36	0.36	0.71	DNF	0.58
BSD	0.13	0.14	0.32	DNF	0.26
Linux	0.23	0.23	0.48	DNF	0.31
Android	0.42	0.36	0.67	DNF	0.47

Table 8. (e) ProcessAll

	FPOF	OD	OC3	CompreX	AVF
Windows	DNF	DNF	DNF	DNF	0.52
BSD	0.21	0.19	0.65	DNF	0.52
Linux	DNF	DNF	DNF	0.46	0.29
Android	0.31	0.34	0.64	DNF	0.83

Table 9. Table 4: Evaluation of batch anomaly scoring in Scenario 2 (nDCG scores). The higher the score (i.e the closer to 1) the better. The best score per OS (row) is highlighted in bold.

	FPOF	OD	OC3	CompreX	AVF
Windows	DNF	DNF	0.23	0.23	0.21
BSD	0.13	0.17	0.24	0.21	0.19
Linux	0.22	0.21	0.38	0.46	0.29
Android	0.36	0.22	0.32	0.78	0.30

Table 10. (a) ProcessEvent

	FPOF	OD	OC3	CompreX	AVF
Windows	DNF	DNF	0.23	0.23	0.21
BSD	0.13	0.17	0.24	0.21	0.19
Linux	0.22	0.21	0.38	0.46	0.29
Android	0.36	0.22	0.32	0.78	0.30

Table 11. (b) ProcessExec

	FPOF	OD	OC3	CompreX	AVF
Windows	DNF	DNF	0.24	DNF	0.22
BSD	0.18	0.17	0.51	DNF	0.17
Linux	0.20	0.20	0.42	DNF	0.42
Android	0.29	0.29	0.39	DNF	0.38

Table 12. (c) ProcessParent

	FPOF	OD	OC3	CompreX	AVF
Windows	DNF	DNF	0.22	DNF	0.22
BSD	0.10	0.09	0.29	DNF	0.17
Linux	0.20	0.20	0.42	DNF	0.25
Android	0.20	0.20	0.39	DNF	0.25

Table 13. (d) ProcessNetflow

	FPOF	OD	OC3	CompreX	AVF
Windows	DNF	DNF	DNF	DNF	DNF
BSD	DNF	0.15	DNF	DNF	DNF
Linux	DNF	DNF	DNF	DNF	DNF
Android	0.37	0.20	0.40	DNF	0.35

Table 14. (e) ProcessAll

	FPOF	OD	OC3	CompreX	AVF
Windows	DNF	DNF	DNF	DNF	DNF
BSD	0.21	0.19	0.38	DNF	DNF
Linux	DNF	DNF	0.41	DNF	DNF
Android	0.31	0.34	0.82	DNF	0.35

Table 15. Table 5: Running time results (in seconds) for ProcessEvent context in scenario 1.

	FPOF	OD	OC3	CompreX	AVF
Windows	47.90	57.38	0.62	60.76	0.79
BSD	3418.85	3641.86	3.54	214.79	4.59
Linux	814.15	890.87	7.30	564.51	12.59
Android	0.44	0.46	0.01	13.22	0.01

Table 16. Table 10: Running time results (in seconds) for ProcessEvent context in scenario 2.

	FPOF	OD	OC3	CompreX	AVF
Windows	DNF	DNF	0.18	46.67	0.89
BSD	1840.7	2692.35	533.96	2975.59	17.20
Linux	2768.14	6054.92	22.77	970.79	16.16
Android	1551.88	1551.88	0.71	45.81	0.80

Table 17. Table 15: Summary of the detection performance of batch and streaming AVF on 𝖯𝖠 𝖯𝖠 \mathsf{PA} for each dataset, and for block sizes of 1%, 5%, 10%, and 25%. nDCG and AUC scores (higher is better)

	Windows		BSD		Linux		Android
	nDCG	AUC	nDCG	AUC	nDCG	AUC	nDCG	AUC
Stream 1%	0.518	0.993	0.524	0.984	0.298	0.927	0.832	0.872
Stream 5%	0.490	0.984	0.524	0.984	0.298	0.928	0.828	0.857
Stream 10%	0.522	0.994	0.524	0.984	0.298	0.927	0.826	0.849
Stream 25%	0.496	0.985	0.525	0.984	0.298	0.928	0.828	0.858
Batch	0.527	0.996	0.524	0.984	0.298	0.927	0.834	0.878

Equations31

\textsc A V F (x) = \frac{1}{m}_\sum j = 1^{m} (x_{_} j c_{_} j + (1 - x_{_} j) (n - c_{_} j))

\textsc A V F (x) = \frac{1}{m}_\sum j = 1^{m} (x_{_} j c_{_} j + (1 - x_{_} j) (n - c_{_} j))

\begin{array}[]{c|ccc}id&\texttt{abc.com}&\texttt{xyz.com}&\texttt{evil.com}\\ \hline\cr P_{\_}{17}&1&1&0\\ P_{\_}{42}&1&1&0\\ P_{\_}{1337}&0&0&1\\ P_{\_}{007}&1&1&1\end{array}

\begin{array}[]{c|ccc}id&\texttt{abc.com}&\texttt{xyz.com}&\texttt{evil.com}\\ \hline\cr P_{\_}{17}&1&1&0\\ P_{\_}{42}&1&1&0\\ P_{\_}{1337}&0&0&1\\ P_{\_}{007}&1&1&1\end{array}

\textsc A V F (P_{_} 17)

\textsc A V F (P_{_} 17)

\textsc A V F (P_{_} 42)

\textsc A V F (P_{_} 1337)

\textsc A V F (P_{_} 007)

\textsc A V F_{_} nai v e^{(i)} (x) = \frac{1}{m}_\sum j = 1^{m} (x_{_} j c_{_}^{(i)} j + (1 - x_{_} j) (i - c_{_}^{(i)} j))

\textsc A V F_{_} nai v e^{(i)} (x) = \frac{1}{m}_\sum j = 1^{m} (x_{_} j c_{_}^{(i)} j + (1 - x_{_} j) (i - c_{_}^{(i)} j))

\textsc A V F (P_{_} 17)

\textsc A V F (P_{_} 17)

\textsc A V F (P_{_} 42)

\textsc A V F (P_{_} 1337)

\textsc A V F (P_{_} 007)

p_{_}^{(i + 1)} j = \frac{n \times p _{_}^{(i)} j + x _{_} j}{i + 1}

p_{_}^{(i + 1)} j = \frac{n \times p _{_}^{(i)} j + x _{_} j}{i + 1}

\textsc A V F^{(i + 1)} (x) = \frac{1}{m}_\sum j = 1^{m} (x_{_} j p_{_}^{(i + 1)} j + (1 - x_{_} j) (1 - p_{_}^{(i + 1)} j))

\textsc A V F^{(i + 1)} (x) = \frac{1}{m}_\sum j = 1^{m} (x_{_} j p_{_}^{(i + 1)} j + (1 - x_{_} j) (1 - p_{_}^{(i + 1)} j))

\textsc A V F (P_{_} 17)

\textsc A V F (P_{_} 17)

\textsc A V F (P_{_} 42)

\textsc A V F (P_{_} 1337)

\textsc A V F (P_{_} 007)

D C G_{_} N =_\sum i = 1^{N} \frac{r e l _{_} i}{lo g _{_} 2 ( i + 1 )}

D C G_{_} N =_\sum i = 1^{N} \frac{r e l _{_} i}{lo g _{_} 2 ( i + 1 )}

i D C G_{_} N

i D C G_{_} N

\frac{1}{∣ A ∣∣ A ∣} ∣ {(α, β) : r (α) < r (β), (α, β) \in A \times \overline{A}} ∣

\frac{1}{∣ A ∣∣ A ∣} ∣ {(α, β) : r (α) < r (β), (α, β) \in A \times \overline{A}} ∣

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gitlab.com/adaptdata/e2
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A baseline for unsupervised advanced persistent threat detection in system-level provenance

Ghita Berrada

James Cheney

Sidahmed Benabderrahmane

William Maxwell

Himan Mookherjee

Alec Theriault

Ryan Wright

Abstract

Advanced persistent threats (APT) are stealthy, sophisticated, and unpredictable cyberattacks that can steal intellectual property, damage critical infrastructure, or cause millions of dollars in damage. Detecting APTs by monitoring system-level activity is difficult because manually inspecting the high volume of normal system activity is overwhelming for security analysts. We evaluate the effectiveness of unsupervised batch and streaming anomaly detection algorithms over multiple gigabytes of provenance traces recorded on four different operating systems to determine whether they can detect realistic APT-like attacks reliably and efficiently. This report is the first detailed study of the effectiveness of generic unsupervised anomaly detection techniques in this setting.

1 Introduction

For the past few years, damaging security/data breaches have frequently made the headlines Gootman (2016); Silver-Greenberg et al. (2014); Lee et al. (2014); Karchefsky and Rao (2017). These breaches are all examples of “advanced persistent threats” (APTs). Advanced Persistent Threats (APTs) are long-running, stealthy attacks designed to penetrate specific target systems, carry out either pre-determined or dynamically updated instructions from an adversary, and persist (while avoiding detection) for as long as required to accomplish the adversary’s goals, such as data theft Silver-Greenberg et al. (2014); Gootman (2016) or corruption of the target organization’s data and damaging of critical systems.

Security experts warn that APTs are now “part and parcel of doing business” Auty (2015) and concede that it would be unrealistic for all such attacks to be prevented and blocked Smith (2013); Maisey (2014); Auty (2015), partly because even the best designed security systems are bound to have flaws and partly because the targeted nature of the attacks means that the adversaries will persistently try to gain access to the target’s system, adapting and changing their approaches if need be, until they reach their goal or the cost of succeeding far outweighs the benefits to be gained. As a result, the experts consider that, while adopting state-of-the-art prevention techniques is a must, the focus should shift to continuously monitoring the systems, detecting APTs in a timely fashion and minimizing their damage.

Traditional security software and measures (e.g. anti-virus software, system security policies) generally fail to detect APTs since APTs tend to mimic normal business logic and rely on actions that respect social norms (e.g. work schedule of targeted users) or system security policies. Moreover, the fact that APTs are long-running campaigns that consist of multiple steps further complicates their detection, in particular when relying on event logs and audit trails that only provide partial information on temporally and spatially localized events.

Provenance-tracking has been proposed as a basis for security (e.g. provenance-based access control Park et al. (2012)). It has been suggested that mining provenance data to analyze and identify causal relationships among system activities could help identify security threats and malicious actions, such as data exfiltration, that might go undetected with policy-driven approaches and other classical perimeter defence-based methods Jewell and Beaver (2011); Zhang et al. (2012); Awad et al. (2016); Jenkinson et al. (2017).

As appealing as the idea of monitoring provenance-like records to aid security sounds, there are, however, numerous challenges to making it a reality. Beyond the issues linked with recording the provenance itself (e.g. level of provenance granularity, fault tolerance, trustworthiness of the recorded trace Jenkinson et al. (2017)), the recorded provenance traces are expected to be large in volume, with anomalous system activity (if any) likely to constitute but a very small fraction of the recorded traces. Analyzing provenance traces to identify anomalous activity that would suggest an ongoing APT attack is a typical “needle in a haystack” problem further compounded by the variety of possible APT patterns and the lack of available fully annotated data. Typical supervised learning techniques cannot therefore be used to detect (rare) APT patterns111Training supervised learning models for the APT detection task would require having a corpus of provenance data with realistic APT attacks along with complete annotations indicating which parts of the provenance graphs are part of an attack. In an operational context, such annotations are not readily available and generating annotations for the provenance graphs a posteriori is prohibitively labor-intensive and time-consuming. Since we are developing/evaluating APT detection techniques to be used in an operational setting, we cannot assume the existence of a fully annotated corpus so this naturally precludes the use of supervised learning models. The high class imbalance inherent to this application also means supervised learning technques are not necessarily the best candidates for the detection task.. Furthermore, unsupervised anomaly detection over streaming graphs is challenging Akoglu et al. (2015). We know of only one paper on anomaly detection over streaming provenance graph data Manzoor et al. (2016) but this approach relies on an initial training stage over “normal” example graphs, i.e. it is semisupervised.

In an operational security scenario, it is critical to be able to provide actionable information quickly. Security analysts can usually identify and forensically investigate suspicious behavior (such as processes that have been subverted or created by an attacker) once it is brought to their attention. However, in typical system traces, each day of activity may lead to a gigabyte or more of provenance trace information, corresponding to hundreds or thousands of processes, almost all of which are benign. In this paper, we consider the key subproblem of quickly identifying unusual process activity that warrants manual inspection. Our approach summarizes process activity using categorical or binary features such as the kinds of events performed by a process, the process executable name and parent executable name, and IP addresses and ports accessed. We focus on categorical data because attacks typically involve rare combinations of such attributes.

This article evaluates the effectiveness of several algorithms for unsupervised, categorical anomaly detection:

•

FPOutlier (or Frequent Pattern Outlier Factor (FPOF)) He et al. (2005)

•

Outlier Degree (or OD) Narita and Kitagawa (2008)

•

One-Class Classification by Compression (or OC3) Smets and Vreeken (2011)

•

CompreX Akoglu et al. (2012)

•

Attribute Value Frequency (or AVF) Koufakou et al. (2007); Tan et al. (2013)

All of these algorithms except for AVF are based on mining frequent itemsets or association rules and using these results to assign anomaly scores. Moreover, these mining-based techniques are all batch algorithms: in a first pass, the data is mined and analyzed (sometimes taking a lengthy period) and in a second pass, the scores are assigned. AVF is, instead, based on a simple analysis of the frequencies of the attributes. The original paper proposing AVF also only considered a batch setting, but later work Tan et al. (2013) showed how to modify AVF to a one-pass, streaming algorithm. We therefore refer to batch and streaming AVF in this paper.

We apply our work to provenance traces containing example APT attacks (on several different host operating systems) produced as part of the DARPA Transparent Computing program, in which attacks constitute as little as 0.01% of the data. We evaluated all of the above algorithms in batch mode. Our experiments show that on our dataset, AVF has anomaly detection performance comparable or better than the itemset mining-based techniques, typically finding at least some parts of the attack within the top 1% or even 0.1%.

We also conducted experiments comparing batch and streaming AVF, using a modified form of the one-pass algorithm of Tan et al. (2013) that allows blocks of different sizes, in order to study how detection performance is affected by streaming. Our experiments comparing batch and streaming AVF with different block sizes show that there is little degradation in anomaly detection performance. Although our work (like any anomaly-detection technique) does not guarantee to find all attacks, our contribution demonstrates that unsupervised anomaly detection can help find APT-style attacks that currently go unnoticed, enabling analysts to focus their efforts where they are most needed.

This article does not propose new anomaly detection algorithms, and does not evaluate all of the possible algorithms for unsupervised anomaly detection on categorical data. All of the algorithms evaluated either have publicly-available implementations, or were easy to re-implement. It is possible that better results could be obtained using other algorithms that we have not yet tried; nevertheless, our results do establish a baseline against which new approaches (or evaluation of other existing algorithms) can be measured. Such a baseline is essential as a basis for assessing the effectiveness of more sophisticated algorithms, and whether their complexity is justified by increases in effectiveness.

The main contributions of this paper are:

•

Establishing baseline results for five categorical anomaly detection methods, i.e FPOF, OD, OC3, Comprex and AVF (in both batch and streaming modes for AVF) for the task of detecting APT-like activity in system provenance traces

•

Thoroughly evaluating and comparing the effectiveness of these five anomaly detection methods for the studied task

•

Showing that some methods, namely OC3 and AVF, already produce useful detection results in reasonable times despite their relative simplicity (“naive” set of features requiring barely any domain knowledge or tweaking and/or very simple anomaly scoring strategy e.g. AVF) and that these results can, in some cases (e.g. AVF), very easily be replicated in a streaming setting

•

Discussing appropriate metrics for the detection task and proposing a metric from information retrieval (normalized discounted gain) as a suitable metric

The structure of the rest of this paper is as follows. Section 2 presents the overall system architecture and outlines our approach. Section 3 reviews AVF and our variant of streaming AVF. Section 4 presents an experimental evaluation of the effectiveness of the different approaches, establishing a baseline for unsupervised anomaly detection on this data. Section 5 summarizes related work on APTs and anomaly detection. Section 6 concludes and suggests directions for future work.

A short glossary of acronyms used in the paper is included as an appendix.

2 Overview

2.1 Provenance trace analysis

In this section, we situate our work as part of a realistic provenance-based security scenario. Figure 1 outlines the architecture of our system, which is designed to interoperate with several different (provenance) recorders Gehani and Tariq (2012); Jenkinson et al. (2017), each running on a different operating system and generating different styles of provenance graphs recording system activity (albeit in a common format). In this paper, we consider four sources, running on Android, Linux, BSD and Windows operating systems.

Our system receives the provenance graph data from each recording system, as a stream of JSON records in a binary format, and ingests the data into a graph database, Neo4J. In addition, ingestion performs some additional data integration and deduplication steps to deal with some idiosyncrasies among the sources. The different systems use the shared data model in different ways, for example storing information in different places, at different levels of granularity, or just not populating some fields. We remove some information that is not consistently recorded and reorganize other information so that typical queries can be written portably across data sources. Deduplication is important because the recorders add their own unique identifiers for operating system processes and other objects. This is necessary to avoid ambiguity given that operating system-issued process identifiers or filenames are not unique over long periods of time (i.e. days). However, some recording systems create multiple records referring to the same process (or other object) with different unique identifiers. The ingester attempts to detect and merge these duplicates, using heuristics such as “two processes with the same process ID and started at the same time are identical”.

Once the graph data has been ingested, we extract Boolean-valued datasets called contexts from the graph (an example of context is provided in Table 1). Each context represents an aspect of process behavior as a Boolean-valued vector. As a simple example, we could use attributes corresponding to event types (read, write, etc.) with value ‘1’ meaning that the process performed at least one event of that type and ‘0’ otherwise; the exact number of such events is ignored. We discuss additional contexts later in this section. Contexts can be extracted using queries over the fully-ingested data, for forensic analysis, or by incrementally maintaining appropriate data structures and periodically emitting new records. Each context can then be run through the anomaly detection algorithms described in Section 3, yielding a score for each process.

These scores are provided to the user interface (User Interface (UI)) frontend, which allows analysts to explore the graph using queries, or search for anomalies based on the scores. Figure 2 shows a typical provenance graph created using the UI graph visualization system, as a result of a successful attack detection. This illustration highlights that even fairly simple activities can yield complex graphs involving multiple read/write or network access events.

Our system has participated in several DARPA exercises in concert with the recording systems, in which realistic background activity was simulated on each system, and realistic APT-style attacks were performed, yielding several gigabytes of raw trace data, corresponding to tens of millions of nodes and edges. We have manually annotated the data to indicate the processes constituting the attacks for each of these scenarios. Typically, the number of processes involved in an attack is very small: for example, in the largest dataset, there are over 282,000 processes (representing seven days of activity), and only 46 of them (i.e. around 0.016%) are involved in the attack. Even if we optimistically assume an analyst can recognize an attack process in just 10 seconds, screening 200,000 processes would take over 23 days. Thus, although attacks are often easy to recognize once brought to the attention of an analyst, the sheer volume of background activity makes it imperative to find ways to automatically direct attention to suspicious activity.

2.2 Contexts

We now give the details of the contexts that form the starting point for our proposed algorithms. In our approach, the context definitions are the only places where domain knowledge about the data is used. We consider the following contexts:

•

ProcessEvent (Process Event (PE)): The integrated traces use event types such as open, close, exit, etc. to describe process activity in a OS-independent way. A process $p$ has attribute $ty$ if $p$ ever performs an event of type $ty$ (disregarding the exact number of events).

•

ProcessExec (Process Exec (PX)): The attributes are executable names $nm$ , for example ls or sudo. A process $p$ has attribute $nm$ if $p$ is an instance of executable $nm$ .

•

ProcessParent (Process Parents (PP)): The attributes are again executable names $nm$ . A process $p$ has attribute $nm$ if $p$ is a child process of an executable named $nm$ .

•

ProcessNetflow (Process Network (PN)): The attributes are IP addresses $ip$ and port numbers $pn$ . A process $p$ has attributes $ip$ and $pn$ if it ever communicates with IP address $ip$ at port $pn$ .

•

ProcessAll (Process All (PA)): the combination of all of the above contexts, with attributes renamed to avoid any ambiguity (for example between $\mathsf{PX}$ and $\mathsf{PP}$ ).

These contexts may seem rather simplistic. For example, it seems intuitive to also consider files accessed by processes as attributes. Also, it would make sense to consider more complex attributes that look for patterns that are known to be suspicious, such as downloading a file, executing it, and then deleting it. However, our goal is to minimize the amount of fine-tuning needed to obtain useful results. There is also a trade-off between granularity of attributes and performance: the more attributes we track, the more work needs to be done at each step. Nevertheless, it would be worthwhile, in subsequent work, to consider richer contexts or well-chosen attributes that encode domain knowledge about what activities are suspicious. It might also be interesting to consider features that extend existing contexts, for example:

•

the number of times each type of event is performed or the frequency of each type of event performed (as opposed to just whether particular types of event are performed as in $\mathsf{PE}$ )

•

Netflow properties not taken into account in $\mathsf{PN}$ such as total number of bytes transferred

Such features would require discretization if they are to be used with the categorical anomaly detection methods explored in this paper. Otherwise, they would have to be used with numerical anomaly detection methods yet to be explored, with the results of such methods then fused with those obtained from categorical anomaly detection methods. This is beyond the scope of the current paper and will be explored in future work.

Each of these contexts can also be extracted from the data incrementally, as the data is ingested. For each process encountered, we construct an attribute vector with value 1 for each attribute the process has (in a given context) and 0 otherwise. The resulting sequence of vectors constitutes a dataset $D=x^{(1)},\ldots,x^{(n)}$ which we use as the starting point for the algorithms in the next section.

3 Algorithms

We consider datasets $D$ to be sequences of $m$ -dimensional Boolean vectors, where there are $n>0$ vectors and $m>0$ attribute values. Likewise, we consider data sources to be streams of $m$ -dimensional vectors. In either case, we consider a typical record $x^{(i)}$ at position $i$ and write $x^{(i)}_{\_}j$ for the value of attribute $j$ in $x^{(i)}$ . We assume for simplicity that all attributes are Boolean-valued. It is not difficult to generalize to finite sets of attribute values. We also assume that the number of possible attributes $m$ is fixed.

We start by reviewing the various batch-only approaches then describe both Attribute Value Frequency algorithm version, the original batch Attribute Value Frequency (AVF) algorithm Koufakou et al. (2007) and its extension to a streaming setting. We present the original algorithm in a batch processing form, i.e. where we assume we have all of the data before computing scores. We show how to modify it to obtain an online algorithm that gives a good approximation of the results of the batch algorithm, and allows for a choice of different window sizes. This algorithm is a mild variation of the one-pass AVF algorithm Tan et al. (2013).

3.1 Batch-only anomaly detection techniques

In this section, we briefly review the batch-only algorithms for anomaly detection in the literature used in our evaluation. These descriptions are not exhaustive; the respective research papers should be consulted for full details.

3.1.1 FPOutlier (FPOF)

The FPOutlier algorithm He et al. (2005) starts by mining frequent itemsets according to a support parameter $minsupp$ (the algorithm only mines and considers itemsets that occur in a fraction of data transactions higher or equal to $minsupp$ ). Then each object is assigned a score corresponding roughly to the number of frequent itemsets it contains. Thus, larger scores correspond to more occurrences of frequent itemsets, meaning that anomalous objects should have low scores. This approach seems well-suited to detect anomalies corresponding to expected, but missing, activity. However, objects that have unusual activity but also display a large number of common patterns may have high scores and not be considered anomalous. In addition, the fact that this approach has a tunable parameter is problematic in an unsupervised setting, since it means that we need to guess an appropriate value for this parameter in advance. We reimplemented FPOutlier using standard itemset mining libraries.

3.1.2 Outlier Degree (OD)

The Outlier Degree algorithm Narita and Kitagawa (2008) also starts by mining frequent itemsets as well as high-confidence rules, so there are two parameters, $minsupp$ governing the minimum support of the itemsets and $minconf$ governing the minimum confidence of the rules. Then each object is scored by applying the high-confidence rules to it, and assigning a score corresponding roughly to the difference between the object’s actual behavior and expected behavior (according to the rules). For example, if $X\to Y$ is a high-confidence rule and object $O$ displays behavior $X$ but not $Y$ , this will contribute to the score. High scores correspond to larger differences between actual and expected behavior, so are more anomalous. Like FPOutlier, this approach seems more likely to consider missing, but expected, behaviors to be anomalous, and could miss anomalies that consist of rare behaviors that do not occur frequently enough to participate in rules. Also, the presence of two tunable parameters is even more problematic from the point of view of unsupervised anomaly detection. We reimplemented OD using standard itemset and rule mining libraries.

3.1.3 One-Class Classification by Compression (OC3)

OC3 Smets and Vreeken (2011) is based on a compression technique for identifying “interesting” itemsets, implemented using the Krimp algorithm Vreeken et al. (2011). Essentially, the idea is to first mine frequent itemsets from the data, and then identify a subset of the itemsets that help to compress the data well. Then, each object is assigned an anomaly score corresponding to its estimated compressed size. If the compression algorithm has done a good job, then objects exhibiting commonly occurring patterns will compress well, and anomalies will not. So high compression sizes (i.e high scores) point to anomalies. OC3 can take a $minsupp$ support parameter, but parameter tuning is typically not necessary because the compression algorithm will filter out any non-useful itemsets; therefore we used the smallest possible $minsupp$ setting in our experiments. The implementation of Krimp is available and we modified it slightly to perform OC3-style anomaly scoring.

3.1.4 CompreX

CompreX Akoglu et al. (2012) is perhaps the most sophisticated approach studied to date. It is based on compression, like OC3, but uses a different compression strategy. CompreX searches for a partition of the attributes such that each set of attributes in the partition has high mutual information. Since there are exponentially many partitions to consider, CompreX starts with the finest partition (all attributes are in their own class) and greedily searches for pairs of classes to merge. Each resulting partition is then compressed separately, to obtain an anomaly score for each record based on its compressed size, as in OC3. CompreX has no tuning parameters and was shown experimentally to be competitive or superior in anomaly detection performance to Krimp/OC3 on several datasets. However, CompreX’s default search strategy is quadratic in the number of attributes; therefore, it was not usable on contexts with over 20-30 attributes.

3.2 Attribute Value Frequency (AVF)

In this section, we describe the original batch Attribute Value Frequency (AVF) algorithm Koufakou et al. (2007) and then its modification to suit a streaming setting Tan et al. (2013). Unlike the algorithms mentioned earlier, AVF is rather simple and does not require additional background material to describe, both in the batch and streaming settings. Since we implemented both variants of AVF from scratch in a unified way, rather than reusing existing libraries or implementations as for the other approaches, we will spell out the details.

Attribute Value Frequency (AVF) Koufakou et al. (2007) is a non-parametric outlier detection technique appropriate for categorical data and was shown to be fast, scalable and accurate on a variety of standard data sets. The algorithm relies on the intuition that outliers in a dataset have values of attributes which occur infrequently. That the attribute values in a data point are infrequent can be determined simply by computing the frequencies of the respective attribute values across the data.

Given a dataset $D$ of size $n$ , we write $c_{\_}j$ for the number of occurrences of attribute value $1$ for attribute $j$ , i.e. $c_{\_}j=|\{i\mid x^{(i)}_{\_}{j}=1\}|=\sum_{\_}{i=1}^{n}x^{(i)}_{\_}{j}$ . Then, the AVF score of a data point $x$ is:

[TABLE]

That is, when $x_{\_}j=1$ , the contribution to the score for attribute $x_{\_}j$ is $c_{\_}j$ , the number of occurrences of $j$ -value of 1, and when $x_{\_}j=0$ , the contribution is the number of occurrences of a $j$ -value of 0. The initial multiplication by $1/m$ effectively averages the counts, so $0\leq AVF(x)\leq n$ , but such scaling has no effect on the relative ordering among scores in the batch setting. Lower AVF scores indicate more unusual behavior.

Example 1 (Running example).

To illustrate AVF, we introduce a small running example with four processes $P_{\_}{17},P_{\_}{42},P_{\_}{1337},P_{\_}{007}$ and three attributes $abc.com$ , $xyz.com$ and $evil.com$ , corresponding to network addresses accessed by the processes. In this (extremely simplistic) example, $P_{\_}{17}$ and $P_{\_}{42}$ are innocuous activity and access both abc.com and xyz.com, while $P_{\_}{1337}$ is a naive attacker that only accesses evil.com and $P_{\_}{007}$ is a more sophisticated attacker that accesses all three in order to attempt to camouflage its behavior. This behavior corresponds to the following dataset:

[TABLE]

We calculate the frequencies of the three attributes as $c_{\_}{\texttt{abc.com}}=c_{\_}{\texttt{xyz.com}}=3$ and $c_{\_}{\texttt{evil.com}}=2$ . Thus, the AVF scores are:

[TABLE]

The naive attacker’s isolated access of evil.com, together with failure to mask its activity with common behavior, results in a lower score, while the more sophisticated attacker’s score is the same as that of the first two processes.

Streaming AVF: Naive approach

A simple, but unfortunately too naive, approach to streaming the AVF algorithm is to maintain the attribute value counts incrementally as data is processed, and use the current counts to score each new transaction. That is, if $c^{(i)}_{\_}j$ are the counts calculated for $x^{(1)}\ldots x^{(i)}$ , then to score a new record $x=x^{(i+1)}$ we proceed as follows:

[TABLE]

However, because the counts are monotonically increasing, this means that the scoring will be heavily biased towards considering records appearing early in the dataset to be anomalous. For example:

Example 2.

Continuing our running example, we need to update the counts after each step. Thus, the AVF scores are:

[TABLE]

In this (admittedly extreme) example, the first process $P_{\_}{17}$ is judged most anomalous, followed by $P_{\_}{1337}$ , then $P_{\_}{42}$ and finally $P_{\_}{007}$ .

Streaming AVF

As observed by Tan et al. (2013), the problem is that the “scale” of the AVF scores is not fixed in the streaming setting, since seeing an attribute whose value has occurred only once means something very different for the 5th record in the dataset than for the 5000th record.

Instead, to compute AVF-like scores incrementally, we propose to use the frequency counts to estimate probabilities for each attribute. We initially take $p^{(0)}_{\_}j=0$ since the data is typically sparse (having relatively few attribute values $x_{\_}j=1$ ); however, any other initial probability distribution could be used based on domain knowledge. Next, for each new record $x^{(i+1)}$ , we adjust the probability $p_{\_}j^{(i+1)}$ of each attribute value $j$ being 1 after seeing $x^{(i+1)}$ as follows:

[TABLE]

We then calculate the AVF score for the $i+1$ st record $x=x^{(i+1)}$ as follows:

[TABLE]

Note that, in the batch setting, dividing the counts by $n$ and summing probabilities instead of counts would not affect the final results, because all the counts are divided by the same $n$ . However, for the streaming setting, we update the attribute value probabilities after each step, so the results of AVF scoring will be different in the streaming setting.

Example 3.

Continuing our running example, we now update the probabilities after each step. Thus, the AVF scores are:

[TABLE]

The naive attacker’s behavior results in a lower (more anomalous) score than the first process $P_{\_}{17}$ .

3.2.1 Analysis

As outlined already, the batch AVF approach is implementable as two scans over the data, and the online AVF approach can be implemented in a single, linear scan, where scoring each new record and updating the frequencies takes $O(m)$ time and space. Both algorithms just need to maintain the number of records $n$ and the $m$ counts or probabilities. Thus, the overall time complexity of each algorithm is $O(nm)$ and the space required is $O(m)$ . In our experiments, the number of attributes $m$ ranges from around 20 to over 14,000. Our approach may not scale well if the attributes are fine-grained and $m$ is much larger than $n$ .

Another concern the reader might have is regarding arithmetic precision and overflow. If fixed-size (say, 32-bit) integers are used, then whenever we are in danger of overflowing we can rescale by dividing all of the counts by 2; this is exactly what is done in arithmetic coding Witten et al. (1987). Our implementation uses arbitrary-precision arithmetic.

4 Experimental evaluation

4.1 Experimental setup

The experiments were run on a desktop with an Intel Core i7-6700 CPU (3.4 GHz), 16 GB RAM, running Ubuntu 16.04. The raw provenance trace data was ingested on a variety of machines and the contexts used in the experiments were extracted and stored as CSV files222available at http://www.gitlab.com/adaptdata. We do not report the experimental setup for the ingestion stage here in detail; however, it is easily able to keep up with the data in real-time (that is, ingestion of data representing 7 days of system activity takes much less than 7 days). Our experiments focus on evaluating the detection effectiveness and runtime cost of the anomaly detection algorithms on the given context data.

4.2 Datasets

In our experiments, we use two data collections described in Table 2 and representing two attack scenarios, each consisting of several days’ worth of activity in a DARPA evaluation of provenance-tracking systems, running on Windows, BSD, Linux and Android respectively. These data collections result from two exercises for evaluating provenance recorders and anomaly detection techniques. The first data collection/scenario (a) consists of roughly 5 days of processes and netflows activities, whereas the second data collection/scenario (b) corresponds to around 8 days of data generated in similar conditions to the previous scenario. The provenance graphs have been recorded on four different tracking systems, running on Windows, BSD, Linux and Android respectively, each of which was subject to (part of) an APT-style campaign. The main differences between scenarios 1 and 2 concern the background activity workload, the quality and the robustness of the attacks, and the size of the provenance graphs.

Table 2 records, for each triplet context (rows 3 to 7)/OS/scenario (OS and scenarios are columns), the number of transactions $n$ (top value per context row) and the number of attributes $m$ (bottom value per context row). The number of processes encountered in each system varies significantly: in particular, the Linux dataset records from 3–10 times as many distinct processes compared to the Windows or BSD datasets and up to 2400 times as many processes compared to Android. Some contexts are empty, e.g. $\mathsf{PP}$ for Android in Scenario 1, where information about parent process relationships was unavailable. In general, among the base contexts, the $\mathsf{PE}$ context usually has the largest number of processes, followed by $\mathsf{PX}$ and $\mathsf{PP}$ , while $\mathsf{PN}$ or $\mathsf{PX}$ have the largest number of attributes, followed by $\mathsf{PP}$ . The number of attacks per OS/scenario is extremely low and ranges from 8 (Windows both scenarios) to 46 (Linux scenario 2). Note that the size of the original dataset does not directly correlate with the number of processes or attributes. For example, in scenario 1, the Android dataset is the largest but has the fewest processes and attributes, because the provenance recorder for Android records a great deal of low-level app activity and dynamic information flow tracking, which we do not analyze. The last row represents the percentage of attacks observed in each OS/context. For example, there are 8 attack processes in the Windows data (0.04%) in the first scenario, and 8 (0.07%) in the second one. The percentage of attacks per OS/scenario goes as low as 0.004% (BSD scenario 2) and as high as 8.8% (Android scenario 1).

4.3 Evaluation metrics

The anomaly detection methods that we evaluate output a ranking of processes according to their degree of suspiciousness/anomaly scores. These methods do not explicitly classify or label entities as anomalous or normal. Moreover, the data is unbalanced, with between 0.004% and 8.8% of the data belonging to attacks. A high accuracy could be obtained by simpliy classifying all processes as non-attacks, so accuracy would be a poor indicator of model quality: this is the accuracy paradox Thomas and Balakrishnan (2008). That being the case, it would not be appropriate to use metrics usually employed to evaluate classification methods.

4.3.1 Normalized discounted cumulative gain

To evaluate the anomaly detection algorithms described earlier, we propose using a metric called the normalized discounted cumulative gain metric (or nDCG for short). It is a metric often used in information retrieval to assess the quality of a ranking.

Given a typical document search application, Järvelin and Kekäläinen (2002) argued that, from a user’s perspective, relevant documents are more valuable to a user than marginally relevant documents and a relevant document ranked high in the returned list of results is more valuable than an equally relevant document ranked lower in the list. A user may be reasonably assumed to scan the list of returned results from the beginning before interrupting the scan at some point correlated with time availability, effort required as well as the cumulated information from documents already seen. So it is safe to assume that relevant documents located further down the list of returned results are unlikely to be seen by the user as they would require more time and effort and become less valuable. Taking these facts into account, Järvelin and Kekäläinen (2002) introduced the nDCG measure.

We similarly argue that, in our application, processes that are part of an attack but are ranked very low by an anomaly detection technique are virtually useless to an analyst since his/her monitoring burden would increase substantially with the amount of processes to be checked (not to talk about issues such as acquired loss of trust in the automated monitoring system and discarding of its alerts as well as the increased potential for misses and errors with the increase of data to monitor). Because of this, we believe nDCG to be an appropriate metric for our application.

To compute the nDCG, we start by computing a score called discounted cumulative gain or DCG. The basis of DCG is that each document/entity in the ranking is assigned a relevance score and is penalized by a value logarithmically proportional to its position/rank in the list of results. The DCG is therefore computed as follows:

[TABLE]

where $N$ is the number of entities/documents in the list, $rel_{\_}i$ the relevance score of the $i$ -th entity/document in the list.

Since the length of result lists can vary and the DCG score does not take that into account, it is common to normalize the DCG score by the ideal DCG score (iDCG), which is simply the best achievable DCG score, i.e. the score that would be achieved if all relevant entities were at the top of the list (and in the case of different degrees of relevance, with the highest values of relevance at the very top). Assuming we have $p$ relevant entities in the list, we have:

[TABLE]

In our case, we only consider entities to be either relevant (processes that are part of an attack) or irrelevant (processes with normal behavior) and assign a relevance score $rel_{\_}i$ of 1 to attack processes and of 0 to benign processes, and the idealized score results from ranking all $k$ attack processes at positions $1,\ldots,k$ . The closer the nDCG score to 1, the better the ranking.

4.3.2 Area under curve

The receiver operator characteristic curve (or ROC curve) for a given ranking of objects plots the fraction of true positives found against the number of false positives found. The area under Receiver Operator Characteristic curve (also called Area Under Curve (AUC)) is often used as a measure of anomaly detection performance

In our case, the AUC would correspond to the proportion of processes with normal behavior ranked lower than processes that are part of an attack, computed as follows:

[TABLE]

where $A$ is the set of elements with a relevant label (i.e. elements that are part of an attack), $\overline{A}$ is the set of elements with an irrelevant label (i.e. elements that have a normal behavior), $r(\alpha)$ (resp. $r(\beta)$ ) is the rank assigned to $\alpha$ (resp. $\beta$ ) by the method to be evaluated. The best performance for a method under this metric (resp. the worst performance) is achieved with AUC of one (resp. of zero).

However, in the presence of sparse anomalies in large datasets, the AUC score’s usefulness is somewhat limited. The AUC can either overestimate the effectiveness of an algorithm (e.g. if all attacks are found at rank 900–1000 out of 200,000 then the AUC will be over 0.995 but the results are still nearly useless), or underestimate it (e.g. if half of the attacks are found in the top 10 and the other half at rank 100,000, then the maximum AUC is around 0.75 even though these results might be very valuable). Berrada and Cheney (2019) reported some experiments on the same datasets including both AUC values and nDCG scores for the OC3 and AVF algorithms and found that the scores are loosely correlated but AUC scores are typically uniformly high values and much higher than nDCG scores. AUC scores usually fell in the relatively narrow range 0.75–0.99 (which would seem to indicate that all algorithms perform well in the attack detection task), whereas nDCG scores range typically from 0.2–0.8 (suggesting more nuanced performances). Based on this, AUC values wouldn’t necessarily allow to properly discriminate between well performing and poorly performing algorithms. We will therefore present only the nDCG scores for the batch algorithms, but present both nDCG and AUC scores for the comparison of batch and streaming AVF in order to understand whether either metric is affected by stream processing or block size.

4.4 Forensic anomaly detection

In this section we consider the following empirical question:

•

Q1: Can the five batch methods (FPOF, OD, OC3, CompreX, AVF) detect APT-style attacks effectively?

We first evaluate the effectiveness and performance of the batch version of AVF compared with several other offline techniques, such as FPOutlier (FPOF) He et al. (2005), Outlier-degree (OD) Narita and Kitagawa (2008), OC3 Smets and Vreeken (2011), and CompreX Akoglu et al. (2012).

FPOF and OD were reimplemented in Python according to the descriptions of the algorithms. We reused publicly-available implementations of OC3 and CompreX333http://eda.mmci.uni-saarland.de/prj/, implemented in C++ and Matlab respectively. The FPOF, OD and OC3 methods require setting some parameters, which is not the case for AVF or CompreX. For OC3, we used the lowest possible support parameter and used closed itemset mining to reduce the total number of itemsets considered in the mining stage. For FPOF and OD, we considered a range of support and confidence parameter settings in the range 0.1–0.9 and 0.97 and report the best results obtained using any parameter setting.

We report the results of all algorithms running on the contexts described in Section 2.2 in Table 3e for the first scenario, and Table 4e for the second one. Some algorithms did not finish within a reasonable time (more than 3 hours) and when this is the case we write $DNF$ . This happens most often with CompreX on contexts where there are large numbers of attributes, because CompreX searches for a partition of the attributes into groups with high mutual information, which seems to exhibit quadratic running time in the number of attributes.

FPOF and OD were not competitive on any dataset, even after trying several possible support and confidence parameter values and taking the maximum nDCG score. The best two methods are AVF and OC3: in scenario 1, AVF produced the best (or tied) results in 8 out of 19 scenarios and OC3 produced the best (or tied) results, in 12 out of 19 scenarios. In the second attack scenario, AVF produced the best results in 4 out of 20 scenarios and OC3 produced the best results, in 12 out of 20 scenarios. AVF’s performance degrades significantly from scenario 1 to scenario 2 (the nDCG range goes from a 0.20-0.84 range in scenario 1 to a 0.17-0.42 range in scenario 2), in particular for the BSD and Android datasets, which might be due to both a large increase in the size of BSD and Android contexts as well as a drop in the percentage of attacks present in the data. AVF performs best on small to medium datasets. In contrast, OC3’s performance is more stable between scenarios (the nDCG range goes from 0.21-0.74 in scenario 1 to 0.22-0.84 in scenario 2) and less affected by increase in context size/drop in attack percentage (the performance only really drops for BSD-related contexts and by a smaller margin than in the case of AVF).

CompreX was not able to complete within a reasonable time; for wider contexts such as $\mathsf{PX}$ or $\mathsf{PP}$ , it usually did not terminate within a few minutes. Akoglu et al. (2012) mention that CompreX could be run as an anytime algorithm, but the available implementation does not support this. In the few cases where CompreX completed in a reasonable time (5 out of 20 scenarios in scenario 1 and 4 out of 20 scenarios in scenario 2), it frequently outperformed both OC3 and AVF and performed best in most cases (3 out of 5 times for scenario 1 and 3 out of 4 times for scenario 2).

In general, nDCG scores were highest for the Android dataset with the first attack scenario (between 0.83-0.84) and lowest for the Linux dataset, suggesting a rough (but unsurprising) correlation between the amount of data and difficulty of ranking attacks effectively. OC3 and AVF performed considerably better than any other technique on the different datasets. Likewise, no single context was consistently best, and considering all contexts joined together in $\mathsf{PA}$ was not always better than considering one of the base contexts.

To help build intuition regarding how the nDCG scores correspond to actual rankings, we visualize the results of AVF for Linux (first attack scenario) in Figure 3. This “band diagram” shows the positions of the attacks in the rankings obtained by AVF for the five contexts. The x-axis of the figure is logarithmic scale, so red lines far to the left represent attacks ranked within the top 10, then top 100, etc. As this figure illustrates, an nDCG score of 0.43 (obtained by AVF on the $\mathsf{PX}$ context in the 1st scenario) corresponds to two attacks found in the top 10, while scores of under 0.3 tend to correspond to the highest-ranked attacks occuring at rank 100–1000.

Overall, we can conclude that, AVF and OC3 are competitive since they generated the highest nDCG scores in both scenarios.

Tables 9-14 show the running times for the various algorithms (Tables 9 to 9 for Scenario 1 and Tables 14 to 14 for Scenario 2). Just as with detection performance, the best performing algorithms in terms of running time are OC3 followed by AVF: most scenarios complete under 3 minutes (18 out 20 for both OC3 and AVF for scenario and 19 out of 20 for both OC3 and AVF for scenario 2). Runtime-wise, FPOF, CompreX and OD were significantly more expensive (in the cases where they complete, they typically run in minutes rather than seconds) compared to OC3 or AVF. As mentioned previously, Comprex does not complete in a reasonable time in most cases (it only completes in 20% to 25% of the cases depending on the scenario). Both OD and FPOF complete in more than 3 minutes in a significant proportion of the cases (7 out of 20 cases for scenario 1 and 14 out 20 cases for scenario 2) so are not competitive in terms of running time as well as detection performance: both algorithms start with frequent itemset/frequent rule mining, which is notoriously computationally expensive particularly for low support and/or confidence thresholds.

4.5 Streaming anomaly detection

In this section, we consider the following empirical questions:

•

Q2: Is the detection performance of streaming AVF competitive with batch AVF in terms of nDCG and AUC?

•

Q3: Is the runtime performance of streaming AVF competitive with batch AVF?

4.5.1 Detection performance

To evaluate the streaming version of AVF, we generated 10 randomly-shuffled versions of each dataset from Scenario 1 and ran the streaming algorithm on each dataset. We consider different randomly-shuffled datasets in order to avoid any dependence on a particular order of processing the data; it could be that analyzing the data ordered by time could produce better (or worse) results. In practice, it is not guaranteed that we will see all processes in temporal order, because records for some long-lived processes may not become available until the process terminates. We divided the datasets into block sizes of various granularities (1%, 5%, 10%, 25% of the data) to investigate the effect of granularity on effectiveness and performance. For each dataset and block size, we computed the median ranking of each attack over the 10 shuffled runs. These median rankings are taken to be representative.

We present nDCG and AUC results for the $\mathsf{PA}$ context only; these results are representative of the base contexts. Table 15 summarizes the nDCG and AUC metrics for the streaming algorithm (with four different block sizes) and for the batch algorithm (at the bottom). These results show that the nDCG scores for all four datasets are fairly stable, with only the Windows dataset displaying degradation of nDCG score of more than 0.01. Likewise, the AUC scores of most streaming variants were close to those of the batch algorithm, with only the Windows and Android AUC scores changing by more than 0.01. Overall these results suggest that small block sizes do not significantly degrade the usefulness of the results of AVF scoring.

Figure 4 plots the ratio of true positives found vs. ranking position, for the four different $\mathsf{PA}$ datasets. The red lines are the performance of the batch AVF algorithm while the blue lines are the streaming versions. (For the BSD dataset, the differences are not visible.) We can also gain a stronger intuition regarding the usefulness of the results from these figures: for example, for the Linux $\mathsf{PA}$ context we can see that the nDCG score of 0.298 corresponds to finding about half of the attacks in the first 1% of the rankings, while others are not found until 40%. Figure 4 also shows that, for most datasets (except Android), at least 80% of true positives (i.e attacks) are found in the top 5% of the data.

4.5.2 Analysis time

Figure 5 summarizes the time taken per run for both batch and streaming versions of AVF (the streaming times were obtained by taking the median of the times over the ten runs on shuffled inputs). Note that the y-axis is logarithmic scale. The running time is in general proportional to the amount of data in each context (number of rows $\times$ number of columns). In particular, the time needed for $\mathsf{PA}$ is often considerably longer than the times needed for the other contexts. The reason is that some contexts (such as $\mathsf{PE}$ ) have many rows and few columns, while others (such as $\mathsf{PN}$ ) have many columns and few rows. Combining them into $\mathsf{PA}$ yields a very sparse context with many zeros. We plan to investigate whether using a more succinct storage format for the contexts, or combining the scores of the subcontexts, might lead to better performance. The streaming execution times also increase, as expected, with the increase of streaming block size.

5 Related work

Prior work on APTs is mostly concerned with describing/modeling the characteristics of an APT and its attack model Sood and Enbody (2013); Virvilis et al. (2013); Chen et al. (2014), sometimes using case studies Karchefsky and Rao (2017). A few recent studies address the APT detection problem by constructing models of normal behavior against which incoming data is compared and flagged as anomalous if it deviates from the learned models. Friedberg et al. (2015) explain the shortcomings of current security solutions with regards to APT detection, in particular contending that preventive security mechanisms and signature-based methods are not enough to tackle the challenge of APTs, and propose an anomaly detection-based framework to detect APTs by learning a model of normal system behavior from host-based security logs and detecting deviations. Siddiqui et al. (2016) use the fractal dimension as a feature to classify TCP/IP session data patterns into anomalous (and part of an APT) or normal patterns. Moya et al. (2017) construct decision tree-based models of normal network activity based on features extracted from firewall logs, then use the learned models to classify incoming network traffic. Some work has also been done on the detection of specific patterns that might be part of an APT attack e.g. detection of data leakage/data exfiltration Jewell and Beaver (2011); Awad et al. (2016) or detection of command and control (C&C) domains Niu et al. (2017). Another recent paper Lamprakis et al. (2017) reconstructs a Web requests dependencies graph from Web requests logs using domain knowledge and proposes an unsupervised approach relying on the reconstructed graph to identify APT C&C channels. In contrast, in this paper, we seek to evaluate APT detection approaches developped on host-based data (unlike Lamprakis et al. (2017); Niu et al. (2017); Siddiqui et al. (2016); Moya et al. (2017) that rely on datasets recording various aspects of network activity) that use as little domain knowledge as possible (the goal being to check the detection performance on datasets constructed to minimize the amount of pre-processing and fine-tuning) and try to detect traces of APT activity without targeting a specific type of APT pattern (unlike Jewell and Beaver (2011); Awad et al. (2016)).

There is a considerable literature on intrusion and malware detection, which is mainly split in two approaches: misuse detection (e.g. Kumar and Spafford (1994)) and anomaly detection (e.g. Ji et al. (2016)). The principle of misuse detection is to search for events (i.e. known attacks) that match predefined signatures and patterns. Methods relying on misuse detection can only detect attacks whose signature and patterns are known, which would be unsuitable for APT detection. By contrast, anomaly detection assumes abnormal behaviours can come in varied, potentially unknown, shapes and focuses on detecting activity that deviates from normal activity i.e. activity usually recorded on a particular host or network.

There are several comprehensive surveys of anomaly detection and outlier detection that consider categorical data, continuous data, and structured data (e.g. graphs) Chandola et al. (2009); Akoglu et al. (2015). Of these approaches, graph anomaly detection appears the most relevant for our problem, but most of this work has considered special cases of graphs (e.g. undirected or unlabeled), whereas provenance graph data has rich structure (labeled nodes, labeled edges, multiple properties on nodes and edges). Anomaly detection approaches for provenance graphs reported so far rely on training on benign traces Manzoor et al. (2016), require user-provided annotations Hossain et al. (2017), or assume that the background activity is highly regular Ul Hassan et al. (2018). Another recent contribution by Siddiqui et al. (2018) shows that human-in-the-loop feedback can be used in a semi-supervised way to improve detection results over baseline unsupervised detectors over numerical data. Berrada and Cheney (2019) investigated aggregation of anomaly scores/ranks from different contexts, and found that using AVF and OC3 as base detectors, simple score or rank aggregation techniques provide improved detection performance.

On the other hand, there are a number of generic approaches to anomaly detection for discrete (categorical) data He et al. (2005); Narita and Kitagawa (2008); Koufakou et al. (2007, 2011); Smets and Vreeken (2011); Bertens et al. (2017); Akoglu et al. (2012); Bertens et al. (2017). Most of these approaches first mine the data for frequent itemsets or association rules, and all then perform anomaly scoring in a second pass over the data. A one-pass, streaming variant of AVF was presented by Tan et al. (2013). Some approaches, notably OC3 Smets and Vreeken (2011) and CompreX Akoglu et al. (2012), are based on the Minimum Description Length (MDL) principle Grünwald (2007). Both perform a preprocessing stage to find a compressed representation of the dataset, then consider the resulting compressed size of each record as its score. Since OC3 was often the most effective batch algorithm, we think it would be interesting to develop a streaming approach based on MDL, either by adapting the underlying Krimp compression algorithm Vreeken et al. (2011) to support streaming anomaly detection, or by building on streaming compression techniques such as adaptive arithmetic coding Witten et al. (1987). The UPC algorithm of Bertens et al. (2017) is also based on pattern mining and MDL, and is inherently a two-pass approach, but seeks a different kind of anomalies than AVF, OC3, and CompreX, consisting of unexpectedly rare combinations of frequent itemsets.

There are also some anomaly detection techniques for mixed categorical and numerical data Yamanishi et al. (2004); Koufakou and Georgiopoulos (2010) that could be applied to pure categorical data. The ODMAD algorithm Koufakou and Georgiopoulos (2010), like most categorical techniques, performs an initial off-line pattern mining stage. To the best of our knowledge SmartSifter Yamanishi et al. (2004) is the only previous unsupervised online algorithm applicable to categorical data. SmartSifter incrementally maintains a histogram density model of the categorical data and, for each combination of attributes, a continuous distribution (such as a multivariate Gaussian mixture model) for the numerical attributes. SmartSifter’s running time is $O(2^{m}d^{2}k)$ where $m$ is the number of categorical attributes, $d$ the number of numerical attributes (i.e. dimension) and $k$ the number of components of the mixture model. Their experiments considered datasets with $m\leq 1$ and $d\leq 7$ , and it is unclear whether this approach can scale to large numbers ( $m>100$ ) of categorical attributes. The techniques based on itemset mining are exponential in the number of attributes in the worst-case, but have acceptable performance in practice, while the AVF approaches require only $O(m)$ time to process each input record.

Relatively few publications making use of the DARPA Transparent Computing datasets have appeared; much of the data has not been made publicly available, and ground truth annotations are often not available in machine-readable form. In some cases, systems have been evaluated using these datasets but the raw data, or derived products, have not been made available, making it difficult to reproduce their results. Both Siddiqui et al. (2018) and Berrada and Cheney (2019) used datasets derived from Transparent Computing but the data were not made publicly available. We believe that this article is the first to evaluate anomaly detection algorithms on publicly available datasets derived from the Transparent Computing project.

6 Conclusion

Detecting APT-style attacks in real-world settings is extremely difficult in general. In this paper, we investigate the feasibility of finding processes that may be part of such attacks by analyzing their behavior. We considered five different batch algorithms, one of which can also be adapted easily to a streaming setting. Our experiments showed that both batch and online approaches are effective in finding attacks and can analyze several days’ worth of activity (tens or hundreds of thousands of process summaries, sometimes with over ten thousand attributes) in a few minutes, a negligible cost compared to the time and effort needed to record and store this data. Moreover, our results are validated on provenance traces gathered from four different operating systems, subject to several different kinds of attacks; many of the attacks were typically ranked among the top 0.1-1%.

We believe that this work represents a significant contribution, in that it can provide a low-cost, yet effective line of defense in a larger provenance-based monitoring system, and establishes a baseline for comparison of more sophisticated (and time-consuming) techniques. Nevertheless, there are a number of areas for improvement. First, interpreting and analyzing the processes flagged for investigation is still mostly a manual process, motivating further support for identifying connections between the most anomalous processes. Second, it is also important to consider the (common) case when there is no attack. Since attacks are rare and, in a given trace, there are typically hundreds or thousands of anomalous processes that are not part of the attack, more work is needed to identify suitable thresholds to limit effort in this case. Finally, our approach assumes that the attacker is not aware of or able to manipulate the detection system; sophisticated attackers will naturally seek to either evade observation entirely or modify their behavior so as to minimize anomaly scores. Further research is needed on how to make anomaly detection robust even if attackers know how their activity is being monitored.

Acknowledgements

This material is based upon work partially supported by the Defense Advanced Research Projects Agency (DARPA) under contract FA8650-15-C-7557. Mookherjee was partially supported by a grant from LogicBlox, Inc.

Glossary

Acronyms

APTs Advanced Persistent Threats AUC Area Under Curve AVF Attribute Value Frequency FPOF Frequent Pattern Outlier Factor nDCG Normalized Discounted Cumulative Gain OC3 One-Class Classification by Compression OD Outlier Degree PA Process All PE Process Event PN Process Network PP Process Parents PX Process Exec ROC Receiver Operator Characteristic UI User Interface

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Akoglu et al. [2012] Leman Akoglu, Hanghang Tong, Jilles Vreeken, and Christos Faloutsos. Fast and reliable anomaly detection in categorical data. In CIKM , pages 415–424, 2012.
2Akoglu et al. [2015] Leman Akoglu, Hanghang Tong, and Danai Koutra. Graph based anomaly detection and description: a survey. Data Min. Knowl. Discov. , 29(3):626–688, 2015.
3Auty [2015] Mike Auty. Anatomy of an advanced persistent threat. Network Security , 2015(4):13–16, 2015.
4Awad et al. [2016] Abir Awad, Sara Kadry, Guraraj Maddodi, Saul Gill, and Brian Lee. Data leakage detection using system call provenance. In IN Co S , pages 486–491. IEEE, 2016.
5Berrada and Cheney [2019] Ghita Berrada and James Cheney. Aggregating unsupervised provenance anomaly detectors. In 11th International Workshop on Theory and Practice of Provenance (Ta PP 2019) , Philadelphia, PA, June 2019. USENIX Association. URL https://www.usenix.org/conference/tapp 2019/presentation/berrada .
6Bertens et al. [2017] Roel Bertens, Jilles Vreeken, and Arno Siebes. Efficiently discovering unexpected pattern-co-occurrences. In SDM , pages 126–134, 2017.
7Chandola et al. [2009] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Comput. Surv. , 41(3):15:1–15:58, July 2009. ISSN 0360-0300.
8Chen et al. [2014] Ping Chen, Lieven Desmet, and Christophe Huygens. A study on advanced persistent threats. In IFIP International Conference on Communications and Multimedia Security , pages 63–72. Springer, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

A baseline for unsupervised advanced persistent threat detection in system-level provenance

Abstract

1 Introduction

2 Overview

2.1 Provenance trace analysis

2.2 Contexts

3 Algorithms

3.1 Batch-only anomaly detection techniques

3.1.1 FPOutlier (FPOF)

3.1.2 Outlier Degree (OD)

3.1.3 One-Class Classification by Compression (OC3)

3.1.4 CompreX

3.2 Attribute Value Frequency (AVF)

Example 1** (Running example).**

Streaming AVF: Naive approach

Example 2**.**

Streaming AVF

Example 3**.**

3.2.1 Analysis

4 Experimental evaluation

4.1 Experimental setup

4.2 Datasets

4.3 Evaluation metrics

4.3.1 Normalized discounted cumulative gain

4.3.2 Area under curve

4.4 Forensic anomaly detection

4.5 Streaming anomaly detection

4.5.1 Detection performance

4.5.2 Analysis time

5 Related work

6 Conclusion

Acknowledgements

Glossary

Acronyms

Example 1 (Running example).

Example 2.

Example 3.