Deep Graph Stream SVDD: Anomaly Detection in Cyber-Physical Systems

Ehtesamul Azim; Dongjie Wang; Yanjie Fu

arXiv:2302.12918·cs.LG·February 28, 2023

Deep Graph Stream SVDD: Anomaly Detection in Cyber-Physical Systems

Ehtesamul Azim, Dongjie Wang, Yanjie Fu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a deep graph stream SVDD method that leverages transformers and variational graph auto-encoders to improve anomaly detection in cyber-physical systems by capturing temporal patterns and dynamic sensor connections.

Contribution

It proposes a novel approach combining transformers, graph clustering, and auto-encoders to address limitations in existing anomaly detection methods for cyber-physical systems.

Findings

01

F1-score improved by 35.87%

02

AUC increased by 19.32%

03

Model is 32 times faster than baselines

Abstract

Our work focuses on anomaly detection in cyber-physical systems. Prior literature has three limitations: (1) Failing to capture long-delayed patterns in system anomalies; (2) Ignoring dynamic changes in sensor connections; (3) The curse of high-dimensional data samples. These limit the detection performance and usefulness of existing works. To address them, we propose a new approach called deep graph stream support vector data description (SVDD) for anomaly detection. Specifically, we first use a transformer to preserve both short and long temporal patterns of monitoring data in temporal embeddings. Then we cluster these embeddings according to sensor type and utilize them to estimate the change in connectivity between various sensors to construct a new weighted graph. The temporal embeddings are mapped to the new graph as node attributes to form weighted attributed graph. We input the…

Tables3

Table 1. Table 1: Statistics of SWaT Dataset

Data Type	Feature Number	Total Items	Anomaly Number	Normal/Anomaly
Normal	51	496800	0	-
Anomalous	51	449919	53900	7:1

Table 2. Table 2: Experimental Results on SWaT dataset

Method	Precision (%)	Recall (%)	F1-score (%)	AUC (%)
OC-SVM	34.11	68.23	45.48	75
Isolation-Forest	35.42	81.67	49.42	80
LOF	15.81	93.88	27.06	63
KNN	15.24	96.77	26.37	61
ABOD	14.2	97.93	24.81	58
GANomaly	42.12	67.87	51.98	68.64
LODA	75.25	38.13	50.61	67.1
DGS-SVDD	94.17	82.33	87.85	87.96

Table 3. Table 3: Ablation Study of DGS-SVDD

Method

Precision (%)

Recall (%)

F1-score (%)

AUC (%)

Transformer-based Temporal

Embedding Module

Weighted Attributed

Graph Generator

VGAE-based Spatiotemporal

Embedding Module

✗

4.61

12.45

6.74

18.55

✓

✗

69.98

64.75

67.26

78.14

✗

✓

12.16

99.99

21.68

18.22

✓

✗

✓

87.79

76.68

81.86

82.45

✓

94.17

82.33

87.75

87.96

Equations24

Attn (T_{t}, T_{t}) = softmax (\frac{( T _{t} \cdot W _{t}^{Q} ) ( T _{t} \cdot W _{t}^{K} ) ^{⊤}}{L _{x}}) \cdot (T_{t} \cdot W_{t}^{V})

Attn (T_{t}, T_{t}) = softmax (\frac{( T _{t} \cdot W _{t}^{Q} ) ( T _{t} \cdot W _{t}^{K} ) ^{⊤}}{L _{x}}) \cdot (T_{t} \cdot W_{t}^{V})

T_{t}^{'} = Concat (Attn_{t}^{1}, Attn_{t}^{2}, \dots, Attn_{t}^{h}) \cdot W_{t}^{O}

T_{t}^{'} = Concat (Attn_{t}^{1}, Attn_{t}^{2}, \dots, Attn_{t}^{h}) \cdot W_{t}^{O}

U_{t} = T_{t}^{'} + Relu (T_{t}^{'} \cdot W_{t}^{1} + b_{t}^{1}) \cdot W_{t}^{2} + b_{t}^{2}

U_{t} = T_{t}^{'} + Relu (T_{t}^{'} \cdot W_{t}^{1} + b_{t}^{1}) \cdot W_{t}^{2} + b_{t}^{2}

\overset{ˇ}{T}_{t + 1} = U_{t} \cdot W_{t}^{p} + b_{t}^{p}

\overset{ˇ}{T}_{t + 1} = U_{t} \cdot W_{t}^{p} + b_{t}^{p}

min t = 1 \sum L_{x} ∣∣ T_{t + 1} - \overset{ˇ}{T}_{t + 1} ∣ ∣^{2}

min t = 1 \sum L_{x} ∣∣ T_{t + 1} - \overset{ˇ}{T}_{t + 1} ∣ ∣^{2}

\hat{U}_{t}

\hat{U}_{t}

μ_{t}, log (δ_{t}^{2})

μ_{t}, log (δ_{t}^{2})

r_{t} = μ_{t} + δ_{t} \times ϵ_{t}

r_{t} = μ_{t} + δ_{t} \times ϵ_{t}

\hat{A}_{t} = σ (r_{t} r_{t}^{⊤})

\hat{A}_{t} = σ (r_{t} r_{t}^{⊤})

min t = 1 \sum T KL divergance between q (.) and p (.) KL [q (r_{t} ∣ U_{t}, A_{t}) ∣∣ p (r_{t})] + ∣∣ A_{t} - \hat{A}_{t} ∣ ∣^{2} Loss between A_{t} and \hat{A}_{t}

min t = 1 \sum T KL divergance between q (.) and p (.) KL [q (r_{t} ∣ U_{t}, A_{t}) ∣∣ p (r_{t})] + ∣∣ A_{t} - \hat{A}_{t} ∣ ∣^{2} Loss between A_{t} and \hat{A}_{t}

W min Average sum of weights, using squared error, for all normal training instances (from T segments) \frac{1}{n} t = 1 \sum T ∣∣ ϕ (r_{t}; W) - c ∣ ∣^{2} + \frac{λ}{2} ∣∣ W ∣ ∣_{F}^{2} Regularization item

W min Average sum of weights, using squared error, for all normal training instances (from T segments) \frac{1}{n} t = 1 \sum T ∣∣ ϕ (r_{t}; W) - c ∣ ∣^{2} + \frac{λ}{2} ∣∣ W ∣ ∣_{F}^{2} Regularization item

s (r_{o}) = ∣∣ ϕ (r_{o}; W^{*}) - c ∣ ∣^{2}

s (r_{o}) = ∣∣ ϕ (r_{o}; W^{*}) - c ∣ ∣^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ehtesam3154/dgs_svdd
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Network Security and Intrusion Detection · Time Series Analysis and Forecasting

Full text

11institutetext: Department of Computer Science

University of Central Florida, Orlando, FL 32826, USA

11email: {azim.ehtesam,wangdongjie}@knights.ucf.edu, [email protected]

Deep Graph Stream SVDD: Anomaly Detection in Cyber-Physical Systems

Ehtesamul Azim

Dongjie Wang

Yanjie Fu

Abstract

Our work focuses on anomaly detection in cyber-physical systems. Prior literature has three limitations: (1) Failing to capture long-delayed patterns in system anomalies; (2) Ignoring dynamic changes in sensor connections; (3) The curse of high-dimensional data samples. These limit the detection performance and usefulness of existing works. To address them, we propose a new approach called deep graph stream support vector data description (SVDD) for anomaly detection. Specifically, we first use a transformer to preserve both short and long temporal patterns of monitoring data in temporal embeddings. Then we cluster these embeddings according to sensor type and utilize them to estimate the change in connectivity between various sensors to construct a new weighted graph. The temporal embeddings are mapped to the new graph as node attributes to form weighted attributed graph. We input the graph into a variational graph auto-encoder model to learn final spatio-temporal representation. Finally, we learn a hypersphere that encompasses normal embeddings and predict the system status by calculating the distances between the hypersphere and data samples. Extensive experiments validate the superiority of our model, which improves F1-score by 35.87%, AUC by 19.32%, while being 32 times faster than the best baseline at training and inference.

1 Introduction

Cyber-physical systems (CPS) have been deployed everywhere and play a significant role in the real world, including smart grids, robotics systems, water treatment networks, etc. Due to their complex dependencies and relationships, these systems are vulnerable to abnormal system events (e.g., cyberattacks, system exceptions), which can cause catastrophic failures and expensive costs. In 2021, hackers infiltrated Florida’s water treatment plants and boosted the sodium hydroxide level in the water supply by 100 times of the normal level [3]. This may endanger the physical health of all Floridians. To maintain stable and safe CPS, considerable research effort has been devoted to effectively detect anomalies in such systems using sensor monitoring data [19, 16].

Prior literature partially resolve this problem- however, there are three issues restricting their practicality and detection performance. Issue 1: long-delayed patterns. The malfunctioning effects of abnormal system events often do not manifest immediately. Kravchik et al. employed LSTM to predict future values based on past values and assessed the system status using prediction errors[5]. But, constrained by the capability of LSTM, it is hard to capture long-delayed patterns, which may lead to suboptimal detection performance. How can we sufficiently capture such long-delayed patterns? Issue 2: dynamic changes in sensor-sensor influence. Besides long-delayed patterns, the malfunctioning effects may propagate to other sensors. Wang et al. captured such propagation patterns in water treatment networks by integrating the sensor-sensor connectivity graph for cyber-attack detection [17]. However, the sensor-sensor influence may shift as the time series changes due to system failures. Ignoring such dynamics may result in failing to identify propagation patterns and cause poor detection performance. How can we consider such dynamic sensor-sensor influence? Issue 3: high-dimensional data samples. Considering the labeled data sparsity issue in CPS, existing works focus on unsupervised or semi-supervised setting [17, 10]. But traditional models like One-Class SVM are too shallow to fit high-dimensional data samples. They have substantial time costs for feature engineering and model learning. *How can we improve the learning efficiency of anomaly detection in high-dimensional scenarios?

To address these, we aim to effectively capture spatial-temporal dynamics in high-dimensional sensor monitoring data. In CPS, sensors can be viewed as nodes, and their physical connections resemble a graph. Considering that the monitoring data of each sensor changes over time and that the monitoring data of various sensors influences one another, we model them using a graph stream structure. Based on that, we propose a new framework called Deep Graph Stream Support Vector Data Description (DGS-SVDD). Specifically, to capture long-delayed patterns, we first develop a temporal embedding module based on transformer [15]. This module is used to extract these patterns from individual sensor monitoring data and embed them in low-dimensional vectors. Then, to comprehend dynamic changes in sensor-sensor connection, we estimate the influence between sensors using the previously learned temporal embedding of sensors. The estimated weight matrix is integrated with the sensor-sensor physically connected graph to produce an enhanced graph. We map the temporal embeddings to each node in the enhanced graph as its attributes to form a new attributed graph. After that, we input this graph into the variational graph auto-encoder (VGAE) [4] to preserve all information as final spatial-temporal embeddings. Moreover, to effectively detect anomalies in high-dimensional data, we adopt deep learning to learn the hypersphere that encompasses normal embeddings. The distances between the hypersphere and data samples are calculated to be criteria to predict the system status at each time segment. Finally, we conduct extensive experiments on a real-world dataset to validate the superiority of our work. In particular, compared to the best baseline model, DGS-SVDD improves F1-score by 35.87% and AUC by 19.32%, while accelerating model training and inference by 32 times.

2 Preliminaries

2.1 Definitions

Definition 1.

Graph Stream. A graph object $\mathcal{G}_{i}$ describes the monitoring values of the Cyber-Physical System at timestamp $i$ . It can be defined as $\mathcal{G}_{i}$ = ( $\mathcal{V}$ , $\mathcal{E}$ , $\mathbf{t}_{i}$ ) where $\mathcal{V}$ is the vertex (i.e., sensor) set with a size of $n$ ; $\mathcal{E}$ is the edge set with a size of $m$ , and each edge indicates the physical connectivity between any two sensors; $\mathbf{t}_{i}$ is a list that contains the monitoring value of $n$ sensors at the $i$ -th timestamp. A graph stream is a collection of graph objects over the temporal dimension. The graph stream with the length of $L_{x}$ at the $t$ -th time segment can be defined as $\mathbf{X}_{t}=[\mathcal{G}_{i},\mathcal{G}_{i+1},\cdots\mathcal{G}_{i+L_{x}-1}]$ .

Definition 2.

Weighted Attributed Graph. The edge set $\mathcal{E}$ of each graph object in the graph stream $\mathbf{X}_{t}$ does not change over time, which is a binary edge set that reflects the physical connectivity between sensors. However, the correlations between different sensors may change as system failures happen. To capture such dynamics, we use $\mathcal{\tilde{G}}_{t}=(\mathcal{V},\mathcal{\tilde{E}}_{t},\mathbf{U}_{t})$ to denote the weighted attributed graph at the $t$ -th time segment. In the graph, $\mathcal{V}$ is the same as the graph object in the graph stream, which is the vertex (i.e., sensor) set with a size of $n$ ; $\mathcal{\tilde{E}}_{t}$ is the weighted edge set, in which each item indicates the weighted influence calculated from the temporal information between two sensors; $\mathbf{U}_{t}$ is the attributes of each vertex, which is also the temporal embedding of each node at the current time segment. Thus, $\mathcal{\tilde{G}}_{t}$ contains the spatial-temporal information of the system.

2.2 Problem Statement

Our goal is to detect anomalies in cyber-physical systems at each time segment. Formally, assuming that the graph stream data at the $t$ -th segment is $\mathbf{X}_{t}$ , the corresponding system status is $y_{t}$ . We aim to find an outlier detection function that learns the mapping relation between $\mathbf{X}_{t}$ and $y_{t}$ , denoted by $f(\mathbf{X}_{t})\rightarrow y_{t}$ . Here, $y_{t}$ is a binary constant whose value is 1 if the system status is abnormal and 0 otherwise.

3 Methodology

In this section, we give an overview of our framework and then describe each technical part in detail.

3.1 Framework Overview

Figure 1 shows an overview of our framework, named DGS-SVDD. Specifically, we start by feeding the DGS-SVDD model the graph stream data for one time segment. In the model, we first analyze the graph stream data by adopting the transformer-based temporal embedding module to extract temporal dependencies. Then, we use the learnt temporal embedding to estimate the dynamics of sensor-sensor influence and combine it with information about the topological structure of the graph stream data to generate weighted attributed graphs. We then input the graph into the variational graph autoencoder (VGAE)-based spatial embedding module to get the spatial-temporal embeddings. Finally, we estimate the boundary of the embeddings of normal data using deep learning and support vector data description (SVDD), and predict the system status by measuring how far away the embedding sample is from the boundary.

3.2 Embedding temporal patterns of the graph stream data

The temporal patterns of sensors may evolve over time if abnormal system events occur. We create a temporal embedding module that uses a transformer in a predictive manner to capture such patterns for accurate anomaly detection. To illustrate the following calculation process, we use the graph stream data $\mathbf{X}_{t}$ at the $t$ -th time segment as an example. We ignore the topological structure of the graph stream data at first during the temporal embedding learning process. Thus, we collect the time series data in $\mathbf{X}_{t}$ to form a temporal matrix $\mathbf{T}_{t}=[\mathbf{t}_{1},\mathbf{t}_{2},\cdots,\mathbf{t}_{L_{x}}]$ , such that $\mathbf{T}_{t}\in\mathbb{R}^{n\times L_{x}}$ , where $n$ is the number of sensors and $L_{x}$ is the length of the time segment.

The temporal embedding module consists of an encoder and a decoder. For the encoder part, we input $\mathbf{T}_{t}$ into it for learning enhanced temporal embedding $\mathbf{U}_{t}$ . Specifically, we first use the multi-head attention mechanism to calculate the attention matrices between $\mathbf{T}_{t}$ and itself for enhancing the temporal patterns among different sensors by information sharing. Considering that the calculation process in each head is the same, we take head1 as an example to illustrate. To obtain the self-attention matrix $\text{Attn}(\mathbf{T}_{t},\mathbf{T}_{t})$ , we input $\mathbf{T}_{t}$ into head1, which can be formulated as follows,

[TABLE]

where $\mathbf{W}^{K}_{t}\in\mathbb{R}^{L_{x}\times d}$ , $\mathbf{W}^{Q}_{t}\in\mathbb{R}^{L_{x}\times d}$ , and $\mathbf{W}^{V}_{t}\in\mathbb{R}^{L_{x}\times d}$ are the weight matrix for “key”, “query” and “value” embeddings; ${\sqrt{L_{x}}}$ is the scaling factor. Assuming that we have $h$ heads, we concatenate the learned attention matrix together in order to capture the temporal patterns of monitoring data from different perspectives. The calculation process can be defined as follows:

[TABLE]

where $\mathbf{W}^{O}_{t}\in\mathbb{R}^{hd\times d_{\text{model}}}$ is the weight matrix and $\mathbf{T}^{\prime}_{t}\in\mathbb{R}^{n\times d_{\text{model}}}$ . After that, we input $\mathbf{T}^{\prime}_{t}$ into a fully connected feed-forward network constructed by two linear layers to obtain the enhanced embedding $\mathbf{U}_{t}\in\mathbb{R}^{n\times d_{\text{model}}}$ . The calculation process can be defined as follows:

[TABLE]

where $\mathbf{W}_{t}^{1}$ and $\mathbf{W}_{t}^{2}$ are the weight matrix respectively and their shape information is $\mathbb{R}^{d_{\text{model}}\times d_{\text{model}}}$ ; $\mathbf{b}^{1}_{t}$ and $\mathbf{b}^{2}_{t}$ are the bias item respectively and their shape information is $\mathbb{R}^{n\times d_{\text{model}}}$ .

For the decoder part, we input the learned embedding $\mathbf{U}_{t}$ into a prediction layer to predict the monitoring value of the future time segment. The prediction process can be defined as follows:

[TABLE]

where $\mathbf{\check{T}}_{t+1}\in\mathbb{R}^{n\times L_{x}}$ is the prediction value of the next time segment; $\mathbf{W}_{t}^{p}\in\mathbb{R}^{d_{\text{model}}\times L_{x}}$ is the weight matrix and $\mathbf{b}_{t}^{p}\in\mathbb{R}^{n\times L_{x}}$ is the bias item. During the optimization process, we minimize the difference between the prediction $\mathbf{\check{T}}_{t+1}$ and the real monitoring value $\mathbf{T}_{t+1}$ . The optimization objective can be defined as follows

[TABLE]

When the model converges, we have preserved temporal patterns of monitoring data in the temporal embedding $\mathbf{U}_{t}$ .

3.3 Generating dynamic weighted attributed graphs

In CPS, different sensors connect with each other, which forms a sensor-sensor graph. As a result, the malfunctioning effects of system abnormal events may propagate over time following the graph structure. But, the sensor-sensor influence is not static and may vary as the monitoring data changes are caused by system anomaly events. To capture such dynamics, we want to build up weighted attributed graphs using sensor-type information and learned temporal embeddings. For simplicity, we take the graph stream data of $t$ -th time segment $\mathbf{X}_{t}$ as an example to illustrate the following calculation process.

Specifically, the adjacency matrix of $\mathbf{X}_{t}$ is $\mathbf{A}\in\mathbb{R}^{n\times n}$ , which reflects the physical connectivity between different sensors. $\mathbf{A}[i,j]$ = 1 when sensor $i$ and $j$ are directly connected and $\mathbf{A}[i,j]$ = 0 otherwise. From section 3.2, we have obtained the temporal embedding $\mathbf{U}_{t}\in\mathbb{R}^{n\times d_{model}}$ , each row of which represents the temporal embedding for each sensor. We assume that the sensors belonging to the same type have similar changing patterns when confronted with system anomaly events. Thus, we want to capture this characteristic by integrating sensor type information into the adjacency matrix. We calculate the sensor type embedding by averaging the temporal embedding of sensors belonging to the type. After that, we construct a type-type similarity matrix $\mathbf{C}_{t}\in\mathbb{R}^{k\times k}$ by calculating the cosine similarity between each pair of sensor types, $k$ being the number of sensor types. Moreover, we construct the similarity matrix $\mathbf{\check{C}}_{t}\in\mathbb{R}^{n\times n}$ by mapping $\mathbf{C}_{t}$ to each element position of $\mathbf{A}$ . For instance, if sensor 1 belongs to type 2 and sensor 2 belongs to type 3, we update $\mathbf{\check{C}}_{t}[1,2]$ with $\mathbf{C}_{t}[2,3]$ . We then introduce the dynamic property to the adjacency matrix $\mathbf{A}$ through element-wise multiplication between $\mathbf{A}$ and $\mathbf{\check{C}}_{t}$ . Each temporal embedding of this time segment is mapped to the weighted graph as the node attributes according to sensor information. The obtained weighted attributed graph $\mathcal{\tilde{G}}_{t}$ contains all spatial-temporal information of CPS for the $t$ -th time segment. The topological influence of this graph may change over time.

3.4 Representation learning for weighted attributed graph

To make the outlier detection model easily comprehend the information of $\mathcal{G}_{t}$ , we develop a representation learning module based on variational graph autoencoder (VGAE). For simplicity, we use $\mathcal{G}_{t}$ to illustrate the representation learning process. For $\mathcal{G}_{t}=(\mathcal{V},\mathcal{\tilde{E}}_{t},\mathbf{U}_{t})$ , the adjacency matrix is $\mathbf{\tilde{A}}_{t}$ made up by $\mathcal{V}$ and $\mathcal{\tilde{E}}_{t}$ , and the feature matrix is $\mathbf{U}_{t}$ .

Specifically, this module follows the encoder-decoder paradigm. The encoder includes two Graph Convolutional Network(GCN) layers. The first GCN layer takes $\mathbf{U}_{t}$ and $\mathbf{\tilde{A}}_{t}$ as inputs and outputs a lower dimensional feature matrix $\mathbf{\hat{U}}_{t}$ . The calculation process can be represented as follows:

[TABLE]

where $\mathbf{\hat{D}}_{t}$ is the diagonal degree matrix of $\mathcal{G_{t}}$ and $\mathbf{\tilde{W}}_{0}$ is the weight matrix of the first GCN layer. The second GCN layer estimates the distribution of the graph embeddings. Assuming that such embeddings conform to the normal distribution $\mathcal{N}(\bm{\mu}_{t},\bm{\delta}_{t})$ , we need to estimate the mean $\bm{\mu}_{t}$ and variance $\bm{\delta}_{t}$ of the distribution. Thus, the encoding process of the second GCN layer can be formulated as follows:

[TABLE]

where $\mathbf{\tilde{W}}_{1}$ is the weight matrix of the second GCN layer. Then, we use the reparameterization technique to mimic the sample operation to obtain the graph embedding $\mathbf{r}_{t}$ , which can be represented as follows:

[TABLE]

where $\bm{\epsilon}_{t}$ is the random variable vector, which is sampled from $\mathcal{N}(0,I)$ . Here, $\mathcal{N}(0,I)$ represents the high-dimensional standard normal distribution.

The decoder part aims to reconstruct the adjacency matrix of the graph using $\mathbf{r}_{t}$ , which can be defined as follows:

[TABLE]

where $\mathbf{\hat{A}}_{t}$ is the reconstructed adjacency matrix and $\mathbf{r}_{t}{\mathbf{r}_{t}}^{\top}$ = $||\mathbf{r}_{t}||$ $||{\mathbf{r}_{t}}^{\top}||$ cos $\theta$ .

During the optimization process, we aim to minimize two objectives: 1) the divergence between the prior embedding distribution $\mathcal{N}(0,I)$ and the estimated embedding distribution $\mathcal{N}(\bm{\mu}_{t},\bm{\delta}_{t})$ ; 2) the difference between the adjacency matrix $\mathbf{A}_{t}$ and the reconstructed adjacency matrix $\mathbf{\tilde{A}}_{t}$ ; Thus, the optimization objective function is as follows:

[TABLE]

where KL refers to the Kullback-Leibler divergence; $q(.|.)$ is the estimated embedding distribution and $p(.)$ is the prior embedding distribution. When the model converges, the graph embedding $\mathbf{r}_{t}\in\mathbb{R}^{n\times d_{\text{emb}}}$ contains spatiotemporal patterns of the monitoring data for the $t$ -th time segment.

3.5 One-Class Detection with SVDD

Considering the sparsity issue of labeled anomaly data in CPS, anomaly detection is done in an unsupervised setting. Inspired by deep SVDD [14], we aim to learn a hypersphere that encircles most of the normal data, with data samples located beyond it being anomalous. Due to the complex nonlinear relations among the monitoring data, we use deep neural networks to approximate this hypersphere.

Specifically, through the above procedure, we collecte the spatiotemporal embedding of all time segments, denoted by $\left[\mathbf{r}_{1},\mathbf{r}_{2},\cdots,\mathbf{r}_{T}\right]$ . We input them into multi-layer neural networks to estimate the non-linear hypersphere. Our goal is to minimize the volume of this data-enclosing hypersphere. The optimization objective can be defined as follows:

[TABLE]

where $\mathcal{W}$ is the set of weight matrix of each neural network layer; $\phi(\mathbf{r}_{t};\mathcal{W})$ maps $\mathbf{r}_{t}$ to the non-linear hidden representation space; $c$ is the predefined hypersphere center; $\lambda$ is the weight decay regularizer. The first term of the equation aims to find the most suitable hypersphere that has the closest distance to the center $c$ . The second term is to reduce the complexity of $\mathcal{W}$ , which avoids overfitting. As the model converges, we get the network parameter for a trained model, $\mathcal{W}^{*}$ .

During the testing stage, given the embedding of a test sample $\mathbf{r}_{o}$ , we input it into the well-trained neural networks to get the new representation. Then, we calculate the anomaly score of the sample based on the distance between it and the center of the hypersphere. The process can be formulated as follows:

[TABLE]

After that, we compare the score with our predefined threshold to assess the abnormal status of each time segment in CPS.

4 Experiments

We conduct extensive experiments to validate the efficacy and efficiency of our framework (DGS-SVDD) and the necessity of each technical component.

4.1 Experimental Settings

4.1.1 Data Description

We adopt the SWaT dataset [11], from the Singapore University of Technology and Design in our experiments. This dataset was collected from a water treatment testbed that contains 51 sensors and actuators. The collection process continued for 11 days. The system’s status was normal for the first 7 days and for the final 4 days, it was attacked by a cyber-attack model. The statistical information of the SWaT dataset is shown in Table 1. Our goal is to detect attack anomalies as precisely as feasible. We only use the normal data to train our model. After the training phase, we validate the capability of our model by detecting the status of the testing data that contains both normal and anomalous data.

4.1.2 Evaluation Metrics

We evaluate the model performance in terms of precision, recall, area under the receiver operating characteristic curve (ROC/AUC), and F1-score. We adopt the point-adjust way to calculate these metrics. In particular, abnormal observations typically occur in succession to generate anomaly segments and an anomaly alert can be triggered inside any subset of a real window for anomalies. Therefore, if one of the observations in an actual anomaly segment is detected as abnormal, we would consider the time points of the entire segment to have been accurately detected.

4.1.3 Baseline Models

To make the comparison objective, we input the spatial-temporal embedding vector $\mathbf{r}_{t}$ into baseline models instead of the original data. There are seven baselines in our work: KNN [12]: calculates the anomaly score of each sample according to the anomaly situation of its K nearest neighborhoods. Isolation-Forest[8]: estimates the average path length (anomaly score) from the root node to the terminating node for isolating a data sample using a collection of trees.LODA[13]: collects a list of weak anomaly detectors to produce a stronger one. LODA can process sequential data flow and is robust to missing data. LOF[2]: measures the anomalous status of each sample based on its local density. If the density is low, the sample is abnormal; otherwise, it is normal. ABOD[6]: is an angle-based outlier detector. If a data sample is located in the same direction of more than K data samples, it is an outlier; otherwise it is normal data. OC-SVM[9]: finds a hyperplane to divide normal and abnormal data through kernel functions.. GANomaly[1]: utilizes an encoder-decoder-encoder architecture. It evaluates the anomaly status of each sample by calculating the difference between the output embedding of two encoders.

4.2 Experimental Results

4.2.1 Overall Performance

Table 2 shows experimental results on the SWaT dataset, with the best scores highlighted in bold. As can be seen, DGS-SVDD outperforms other baseline models in the majority of evaluation metrics. Compared with the second-best baseline, DGS-SVDD improves precision by 19%, F1-score by 36% and AUC by 8%. This observation validates that DGS-SVDD is effective to detect anomalies accurately. The underlying driver for the success of our model is that DGS-SVDD can capture long-delayed temporal patterns and dynamic sensor-sensor influences in CPS. Another interesting observation is that the detection performance of distance-based or angle-based outlier detectors is poor. A possible reason is that these geometrical measurements are vulnerable to high-dimensional data samples.

4.2.2 Ablation Study

To study the individual contribution of each component of DGS-SVDD, we perform ablation studies, the findings of which are summarized in Table 3 where bold indicates the best score. We build four variations of the DGS-SVDD model: 1) We feed unprocessed raw data into SVDD; 2) We only capture temporal patterns; 3) We capture the dynamics of sensor-sensor impact and spatial patterns in CPS; 4) We capture spatial-temporal patterns in CPS but discard the dynamics of sensor-sensor influence. We can find that DGS-SVDD outperforms its variants by a significant margin. The observation validates that each technical component of our work is indispensable. Another interesting observation is that removing the temporal embedding module dramatically degrades the detection performance, rendering the temporal embedding module the highest significance. Results from the final experiment show that capturing the dynamics of sensor-sensor influence really boosts model performance.

4.2.3 Robustness Check and Parameter Sensitivity

Figure 2 shows the experimental results for robustness check and parameter sensitivity analysis. To check the model’s robustness, we train DGS-SVDD on different percentages of the training data, starting from 10% to 100%. We can find that DGS-SVDD is stable when confronted with different training data from Figure 2(a). But, compared with other percentages, DGS-SVDD achieves the best performance when we train it on 50% training data. In addition, we vary the dimension of the final spatial-temporal embedding in order to check its impacts. From Figure 2(b) and 2(c), we can find that DGS-SVDD is barely sensitive to the the sliding window length and dimension of the spatiotemporal embeddings. This observation validates that DGS-SVDD is robust to the dimension parameters. A possible reason is that our representation learning module has sufficiently captured spatial-temporal patterns of monitoring data for anomaly detection.

4.2.4 Study of Time Cost

We conduct six folds cross-validation to evaluate the time costs of different models. Figure 3 illustrates the comparison results. We can find that DGS-SVDD can be trained at a time competitive with simple models like OC-SVM or LOF while outperforming them by a huge margin as seen from Table 2. This shows that DGS-SVDD effectively learns the representation of each time segment of the graph stream data. Another important observation is that the testing time of DGS-SVDD is consistent with the simpler baselines. A potential reason is that the network parameter $\mathcal{W^{*}}$ , as discussed in section 3.5, completely characterizes our one-class classifier. This allows fast testing by simply evaluating the network $\phi$ with learnt parameters $\mathcal{W^{*}}$ .

5 Related Work

Anomaly Detection in Cyber-Physical Systems. Numerous existing literature have studied the exploitation of temporal and spatial relationships in data streams from CPS to detect anomalous points [5]. For instance, [5, 7] adopts a convolutional layer as the first layer of a Convolutional Neural Network to obtain correlations of multiple sensors in a sliding time window. Further, the extracted features are fed to subsequent layers to generate output scores. [7] proposed a GAN-based framework to capture the spatial-temporal correlation in multidimensional data. Both generator and discriminator are utilized to detect anomalies by reconstruction and discrimination errors.

Outlier detection with Deep SVDD. After being introduced in [14], deep SVDD and its many variants have been used for deep outlier detection. [18] designed deep structure preservation SVDD by integrating deep feature extraction with the data structure preservation. [20] proposed a Deep SVDD-VAE, where VAE is used to reconstruct the input sequences while a spherical discriminative boundary is learned with the latent representations simultaneously, based on SVDD. Although these models have been successfully applied to detect anomalies in the domain of computer vision, this domain lacks temporal and spatial dependencies prevalent in graph stream data generated from CPS.

6 Conclusion

We propose DGS-SVDD, a structured anomaly detection framework for cyber-physical systems using graph stream data. To this end, we integrate spatiotemporal patterns, modeling dynamic characteristics, deep representation learning, and one-class detection with SVDD. Transformer-based encoder-decoder architecture is used to preserve the temporal dependencies within a time segment. The temporal embedding and the predefined connectivity of the CPS are then used to generate weighted attributed graphs from which the fused spatiotemporal embedding is learned by a spatial embedding module. A deep neural network, integrated with one-class SVDD is then used to group the normal data points in a hypersphere from the learnt representations. Finally, we conduct extensive experiments on the SWaT dataset to illustrate the superiority of our method as it delivers 35.87% and 19.32% improvement in F1-score and AUC respectively. For future work, we wish to integrate a connectivity learning policy into the transformer so that it just does not learn the temporal representation, rather it also models the dynamic influence among sensors. The code can be publicly accessed at https://github.com/ehtesam3154/dgs_svdd.

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Akcay, S., Atapour-Abarghouei, A., Breckon, T.P.: Ganomaly: Semi-supervised anomaly detection via adversarial training. In: Asian conference on computer vision. pp. 622–637. Springer (2018)
2[2] Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on Management of data. pp. 93–104 (2000)
3[3] Jenni Bergal: Florida hack exposes danger to water systems (2021), https://www.pewtrusts.org/en/research-and-analysis/blogs/stateline/2021/03/10/florida-hack-exposes-danger-to-water-systems
4[4] Kipf, T.N., Welling, M.: Variational graph auto-encoders. ar Xiv preprint ar Xiv:1611.07308 (2016)
5[5] Kravchik, M., Shabtai, A.: Detecting cyber attacks in industrial control systems using convolutional neural networks. In: Proceedings of the 2018 workshop on cyber-physical systems security and privacy. pp. 72–83 (2018)
6[6] Kriegel, H.P., Schubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 444–452 (2008)
7[7] Li, D., Chen, D., Jin, B., Shi, L., Goh, J., Ng, S.K.: Mad-gan: Multivariate anomaly detection for time series data with generative adversarial networks. In: International conference on artificial neural networks. pp. 703–716. Springer (2019)
8[8] Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 eighth ieee international conference on data mining. pp. 413–422. IEEE (2008)