Evaluation of Machine Learning-based Anomaly Detection Algorithms on an   Industrial Modbus/TCP Data Set

Simon Duque Anton; Suneetha Kanoor; Daniel Fraunholz; and Hans Dieter; Schotten

arXiv:1905.11757·cs.CR·May 29, 2019

Evaluation of Machine Learning-based Anomaly Detection Algorithms on an Industrial Modbus/TCP Data Set

Simon Duque Anton, Suneetha Kanoor, Daniel Fraunholz, and Hans Dieter, Schotten

PDF

TL;DR

This paper evaluates machine learning algorithms for detecting malicious traffic in industrial Modbus/TCP data, highlighting the effectiveness of SVM and k-NN in a synthetic scenario, and comparing their performance.

Contribution

It provides an empirical comparison of several ML-based anomaly detection algorithms on industrial Modbus/TCP data, emphasizing the suitability of supervised methods.

Findings

01

SVM and k-NN perform well on synthetic data

02

k-means clustering does not perform satisfactorily

03

Supervised learning enables effective anomaly detection

Abstract

In the context of the Industrial Internet of Things, communication technology, originally used in home and office environments, is introduced into industrial applications. Commercial off-the-shelf products, as well as unified and well-established communication protocols make this technology easy to integrate and use. Furthermore, productivity is increased in comparison to classic industrial control by making systems easier to manage, set up and configure. Unfortunately, most attack surfaces of home and office environments are introduced into industrial applications as well, which usually have very few security mechanisms in place. Over the last years, several technologies tackling that issue have been researched. In this work, machine learning-based anomaly detection algorithms are employed to find malicious traffic in a synthetically generated data set of Modbus/TCP communication of a…

Tables20

Table 1. Table 1. An Overview of the Most Used Modbus Versions

Version	Description
Modbus RTU	Serial communication via RS-232 connector to
	connect PLCs with RTUs
Modbus ASCII	Same connector as above, but instead of binary
	coding, ASCII-encoded characters are used
Modbus TCP/IP	Communication based on the TCP/IP protocol stack
	Same as above, but including a checksum in the
Modbus over TCP/IP	payload, in addition to error correction mechanisms
	provided by layers 1 to 4 of the OSI model

Table 2. Table 2. The Basic Features Considered in this Work

Feature	Description
frame.number	Sequential number of packet
frame.time	Arrival time of packet with millisecond accuracy
eth.src	Ethernet source address (MAC)
eth.dst	Ethernet destination address (MAC)
ip.src	IP source address
ip.dst	IP destination address
ip.proto	Transport Layer protocol
frame.len	Length of IP-packet in bytes
tcp.flags	Control bits of TCP-packet
tcp.srcport	Port number of source in TCP connection
tcp.dstport	Port number of destination TCP connection
udp.srcport	Port number of source UDP connection
udp.dstport	Port number of destination UDP connection
tcp.analysis.lost_segment	A label set if there is a lost segment

Table 3. Table 3. The Derived Features Considered in this Work

Feature	Description
frame.time.min	Time of frame in minutes
packets_per_minute	Number of packets per minute
frame.time.sec	Time of frame in seconds
packets_per_sec	Number of packets per second
packets_per_ip.dst	Number of packets per unique destination-IP
stats.packets_per_proto	Number of packets per protocol
max_packets	Maximum number of packets per second
min_packets	Minimum number of packets per second
mean_packets	Mean number of packets per second

Table 4. Table 4. Features Capable of Perfectly Splitting DS1

Feature	Norm. Values [Packets/s]	Anom. Values [Packets/s]
packets_per_sec	162, 164	3-9, 41
max_packets	72	208, 235, 401
mean_packets	58,28, 58,48	105-150

Table 5. Table 5. Predictions and Correct Labels of DS1 by Using SVM

Label\Prediction	Normal	Anomalous
Normal	1 097	0
Anomalous	0	22

Table 6. Table 6. Accuracy and F1-score of SVM

Dataset	Accuracy	F1-score
DS1	1,0	1,0
DS2	1,0	1,0
DS3	0,999 936	0,999 968

Table 7. Table 7. Predictions and Correct Labels of DS2 by Using SVM

Label\Prediction	Normal	Anomalous
Normal	3 364	0
Anomalous	0	3

Table 8. Table 8. Predictions and Correct Labels of DS2 by Using SVM

Label\Prediction	Normal	Anomalous
Normal	109 702	4
Anomalous	3	63

Table 9. Table 9. Predictions and Correct Labels of DS1 by Using Random Forest

Label\Prediction	Normal	Anomalous
Normal	973	0
Anomalous	0	23

Table 10. Table 10. Accuracy and F1-score of Random Forest

Dataset	Accuracy	F1-score
DS1	1,0	1,0
DS2	0,999 701	0,999 851
DS3	0,999 973	0,999 986

Table 11. Table 11. Predictions and Correct Labels of DS2 by Using Random Forest

Label\Prediction	Normal	Anomalous
Normal	3 347	1
Anomalous	0	2

Table 12. Table 12. Predictions and Correct Labels of DS3 by Using Random Forest

Label\Prediction	Normal	Anomalous
Normal	109 710	3
Anomalous	0	59

Table 13. Table 13. Predictions and Correct Labels of DS1 by Using k-nearest Neighbour

Label\Prediction	Normal	Anomalous
Normal	678	0
Anomalous	2	9

Table 14. Table 14. Accuracy and F1-score of k-nearest Neighbour

Dataset	Accuracy	F1-score
DS1	0,997 097	0,998 527
DS2	0,999 118	0,999 559
DS3	0,999 412	0,999 706

Table 15. Table 15. Predictions and Correct Labels of DS2 by Using k-nearest Neighbour

Label\Prediction	Normal	Anomalous
Normal	2 265	0
Anomalous	2	0

Table 16. Table 16. Predictions and Correct Labels of DS3 by Using k-nearest Neighbour

Label\Prediction	Normal	Anomalous
Normal	73 140	0
Anomalous	43	0

Table 17. Table 17. Predictions and Clusters of DS1 by Using k-means Clustering

Cluster\Label	Normal	Anomalous
Cluster 1	0	12
Cluster 2	3 244	63

Table 18. Table 18. Accuracy and F1-score of k-means Clustering

Dataset	Accuracy	F1-score
DS1	0,981 018	0,990 383
DS2	0,556 242	0,714 853
DS3	0,633 624	0,775 728

Table 19. Table 19. Predictions and Clusters of DS2 by Using k-means Clustering

Clusters\Label	Normal	Anomalous
Clusters 1	4 945	0
Clusters 2	6 211	10

Table 20. Table 20. Predictions and Clusters of DS3 by Using k-means Clustering

Clusters\Label	Normal	Anomalous
Clusters 1	231 847	206
Clusters 2	133 853	0

Equations16

F_{1} = 2 \cdot \frac{p r ec i s i o n \cdot r ec a l l}{p r ec i s i o n + r ec a l l}

F_{1} = 2 \cdot \frac{p r ec i s i o n \cdot r ec a l l}{p r ec i s i o n + r ec a l l}

p r ec i s i o n = \frac{t _{p}}{t _{p} + f _{p}}

p r ec i s i o n = \frac{t _{p}}{t _{p} + f _{p}}

r ec a l l = \frac{t _{p}}{t _{p} + f _{n}}

r ec a l l = \frac{t _{p}}{t _{p} + f _{n}}

a cc u r a cy = \frac{t _{p} + t _{n}}{t _{p} + f _{p} + t _{n} + f _{n}}

a cc u r a cy = \frac{t _{p} + t _{n}}{t _{p} + f _{p} + t _{n} + f _{n}}

(x_{i}, y_{i}), i = 1, ..., m, y \in {- 1, 1}

(x_{i}, y_{i}), i = 1, ..., m, y \in {- 1, 1}

y_{i} = s g n (w, x_{i} - b)

y_{i} = s g n (w, x_{i} - b)

D = i = 1 \sum n (x_{i} - w_{i})^{2}

D = i = 1 \sum n (x_{i} - w_{i})^{2}

E = j = 1 \sum k i_{l} \in C_{j} \sum ∣ i_{l} - w_{j} ∣^{2} j \in {1, ..., k}, l \in {1, ..., n}

E = j = 1 \sum k i_{l} \in C_{j} \sum ∣ i_{l} - w_{j} ∣^{2} j \in {1, ..., k}, l \in {1, ..., n}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methodsk-Means Clustering

Full text

Evaluation of Machine Learning-based Anomaly Detection Algorithms on an Industrial Modbus/TCP Data Set

Simon Duque Anton, Suneetha Kanoor, Daniel Fraunholz, and Hans Dieter Schotten

Intelligent Networks Research Group

German Research Center for Artificial IntelligenceKaiserslauternGermany67663

simon.duque_anton, suneetha.kanoor, daniel.fraunholz, [email protected]

(2018)

Abstract.

In the context of the Industrial Internet of Things, communication technology, originally used in home and office environments, is introduced into industrial applications. Commercial off-the-shelf products, as well as unified and well-established communication protocols make this technology easy to integrate and use. Furthermore, productivity is increased in comparison to classic industrial control by making systems easier to manage, set up and configure. Unfortunately, most attack surfaces of home and office environments are introduced into industrial applications as well, which usually have very few security mechanisms in place. Over the last years, several technologies tackling that issue have been researched. In this work, machine learning-based anomaly detection algorithms are employed to find malicious traffic in a synthetically generated data set of Modbus/TCP communication of a fictitious industrial scenario. The applied algorithms are Support Vector Machine (SVM), Random Forest, k-nearest neighbour and k-means clustering. Due to the synthetic data set, supervised learning is possible. Support Vector Machine and k-nearest neighbour perform well with different data sets, while k-nearest neighbour and k-means clustering do not perform satisfactorily.

This is a preprint of a work published in the Proceedings of the 13th International Conference on Availability, Reliability and Security (ARES 2018). Please cite as follows:

S. D. Duque Anton, S. Kanoor, D. Fraunholz, and H. D. Schotten: “Evaluation of Machine Learning-based Anomaly Detection Algorithms on an Industrial Modbus/TCP Data Set.” In: Proceedings of the 13th International Conference on Availability, Reliability and Security (ARES 2018), ACM, 2018, pp. 41:1–41:9.

Modbus, Machine Learning, Anomaly Detection, Industrial, IT-Security

††copyright: rightsretained††journalyear: 2018††copyright: acmlicensed††conference: International Conference on Availability, Reliability and Security; August 27–30, 2018; Hamburg, Germany††booktitle: ARES 2018: International Conference on Availability, Reliability and Security, August 27–30, 2018, Hamburg, Germany††price: 15.00††doi: 10.1145/3230833.3232818††isbn: 978-1-4503-6448-5/18/08††ccs: Security and privacy Intrusion detection systems

1. Introduction

Since the appearance of industrial control in the 1970’s, industry has been looking for ways to improve production. At first, hardwired sensors and actuators were employed, followed by so-called Supervisory Control And Data Acquisition (SCADA) systems in the 1980’s and ‘90’s. With the emerging computational and communication technology, the automation pyramid, as depicted in figure 1, arose.

It categorised the industrial networks according to their function: Resource and production planning was done on the two topmost layers, four and five. SCADA systems are located at the third layer. Programmable Logic Controller (PLC)s can be found on the second layer, while sensors and actuators are placed on the first layer. This has been possible mainly due to Commercial Off The Shelve (COTS) products that are interchangeable and configurable. Many of the newly introduced network protocols either base on the Ethernet-protocol on layer 2 of the Open System Interconnection (OSI)-model, such as EtherCAT (EtherCAT Technology Group, 1991) and Modbus (MODICON Inc., 1996; Modbus-IDA, 2006), or even on the Transmission Control Protocol (TCP)/ Internet Protocol (IP)-stack on layers 3 and 4 of said OSI-model. An abundance of proprietary and open-source communication protocols, tailored to the needs of industrial applications, was developed. Prominent, TCP/IP-based examples are Modbus/TCP (Modbus-IDA, 2006; Modbus, 2012), ProfiNET (PROFIBUS, 2017) and OPC Unified Architecture (UA) (OPC Foundation, 2017). After integrating communication protocols, based on the OSI-model, and introducing reprogrammable industrial computers, so-called PLCs, industrial and automation networks are being opened to insecure networks. The arising Industrial Internet of Things (IIoT) requires interconnectivity of networks, accessibility and availability of resources, also outside of trust boundaries. A common assumption of early SCADA-implementations was that networks were physically separated from public networks (Igure et al., 2006), breaking the trust boundaries creates a multitude of novel threats and attack vectors (Igure et al., 2006; Zhu et al., 2011; Duque Anton et al., 2017a). Attackers have identified industrial networks as valuable targets. Widespread botnets provide easy opportunities to probe and capture unprotected Internet of Things (IoT)-devices (Fraunholz et al., 2017a). This novel threat landscape necessitates new approaches for intrusion detection and attack prevention. Machine learning technologies, and methods of artificial intelligence have proven that they provide vast capacities for solving problems that were hard to solve otherwise.

The remainder of this work is structured as follows: Related work in artificial intelligence and machine learning for intrusion detection in industrial networks is presented in section 2. After that, the employed data set is presented in section 3. First, Modbus/TCP is presented as a protocol. Second, the data set is evaluated, and third, features are derived. The algorithms for anomaly detection and their application on the data set is described in section 4. Results are discussed in section 5. Finally, a conclusion is drawn in section 6.

2. Related Work

Intrusion detection in office Information Technology (IT) environments is a well-researched and well-established area. Tools, such as Bro (Project, [n. d.]) and Snort (Snort, [n. d.]), are commonly used and maintained by a widespread user base. They allow for easy integration of custom rules and make efficient firewalls and systems for detecting attackers and intrusions. The same holds for data sets of host and network traffic. There are numerous data sets to train and test Intrusion Detection System (IDS) appliances and machine learning methods, one of the most famous being the ‘99 KDD Cup data set (University of California, Irvine (UCI), 1999). As this work focuses on industrial applications of anomaly detection, IDSs for office applications, as well as data sets with home- and office-based network traffic are not considered here.

One of the most important aspects in the development of novel intrusion detection approaches is a sound data set to test the system and verify the findings. As mentioned, there is an abundance of such data sets for home and office network traffic, while industrial network traffic is still relatively rare. One recent data set is presented by Lemay and Fernandez (Lemay and Fernandez, 2016). They propose an architecture for a traffic simulation environment, based on commonly used ModbusTCP tools and sandbox environments. They also published a data set into which malicious traffic has been introduced. This data set was analysed in this work in order to evaluate the effectiveness of machine learning for industrial intrusion detection. Other than that, Wang et al. propose a simulation environment for SCADA security analysis (Wang et al., 2010). Their framework allows setting up OPC UA components, including sensors and actuators, in a simulation in order to test and verify security solutions. Furthermore, Siaterlis et al. propose a testbed for the effects of cyber attacks on Cyber Physical System (CPS) (Siaterlis et al., 2013). The testbed is based on an Emulab (The University of Utah, [n. d.]) emulation environment and is capable of monitoring the impact of an attack on a production system. Genge et al. follow a similar approach (Genge et al., 2012). They present an adaptable testbed that is capable of emulating different industrial production scenarios. These scenarios can then be attacked with real malware and the effects can be evaluated. Their testbed is based on Emulab (The University of Utah, [n. d.]) as well, with a real-time connection simulator Simulink (Mathworks, [n. d.]). Seidl designend a Python (Foundation, [n. d.])-environment that simulates user-defined industrial behaviour called VirtuaPlant (Seidl, [n. d.]). This simulation can then be introduced to attacks and malware.

In addition to simulation environments, there are also data sets available in order to train intrusion and anomaly detection algorithms. Morris and Gao present several files containing sets of industrial control system traffic (Morris and Gao, 2014; Morris, Thomas, [n. d.]). As malicious traffic is introduced into these data, algorithms can be trained to detect traffic of malware.

Apart from the issue of obtaining sound and plausible data, there is an abundance of algorithms for anomaly detection that could be employed in order to detect intrusions. Intrusion detection as a concept, including a formal model, was originally presented by Denning in 1987 in the context of the growing influence of computer systems and networks (Denning, 1987). The applications of anomaly detection mechanisms for network intrusion detection are discussed in several surveys (Igure et al., 2009; Bhuyan et al., 2014); Yang et al. give a brief introduction of these techniques for the domain of SCADA systems (Yang et al., 2006). Meshram and Haas published a roadmap of machine learning based anomaly detection in industrial networks, containing a simulation environment, as well as a semantic description of content (Meshram and Haas, 2016). Kleinman and Wool present a model of the Siemens S7 protocol for intrusion detection and forensics (Kleinmann and Wool, 2014). Critical infrastructures and industrial environments are considered in the work of Hadziosmanovic et al. (Hadziosmanovic et al., 2011). A framework that detects malicious and undesired actions is presented. Deriving features that can be used to distinguish valid from malicious traffic is the first step in applying an intrusion detection algorithm. Mantere et al. look into the derivation of features from IP traffic in an industrial environment (Mantere et al., 2013). Deterministic properties of industrial control systems, as well as the usability of this feature for anomaly detection in an industrial environment, is researched by Hadeli et al. (Hadeli et al., 2009).

3. Dissecting the Data set

In this section, the data set is described. First, a general introduction to the Modbus protocol is given in subsection 3.1. After that, the data set used in the course of this work, presented by Lemay and Fernandez (Lemay and Fernandez, 2016), is described in subsection 3.2. Finally, the features that have been extracted and derived are presented in subsection 3.3.

3.1. An Introduction to Modbus

Modbus is a communication protocol for serial communication among PLCs and Remote Terminal Units. It has been developed in 1979 by Schneider-Electric, formerly known as Modicon (Schneider Electric, 2017). It has become a de-facto standard communication protocol for industrial communication (Drury, 2009). There are several versions of Modbus available, the most noteworthy are listed in table 1.

In Modbus/TCP, communication is encapsulated in a TCP/IP packet, as shown in figures 2 and 3. They are transmitted via ethernet, which follows the structure depicted in figure 4. All dark gray fields were employed as features in this work.

Most Modbus/TCP messages contain commands regarding reading and writing coils or registers. In analogy to analogue control automation, one bit registers are called coils. Multi-bit registers are called registers. Modbus slaves poll their communication and either set the data as new input for their registers or load information into a register for a master to read it. They then respond to the request.

3.2. Description of the Data Set

Lemay and Fernandez simulated a controller network, consisting of a number of Master Terminal Unit (MTU) and of a number of controllers. The controllers control a simulated physical system with a 12 000 Volt power source, as well as main- and sub-branch cut-off breakers. In this scenario, different data collections have been performed. An exhaustive description can be found in their work (Lemay and Fernandez, 2016). Regular polling and manual operation are part of these data sets, as these actions occur in productive systems in this form. After collection, malicious activities, generated by state of the art penetration testing tools such as metasploit (Rapid7, [n. d.]), are introduced.

This is one of the most prevalent drawbacks of the employed data set: according to Morris and Gao there are several different groups of Modbus-based attacks (Morris and Gao, 2014). Unfortunately, none of these is introduced into this data set. Instead, attacks that are are also common in home and office-based penetration testing are introduced. Unfortunately, this does not mimic a wide range of attacks that could be employed against industrial applications. It does, however, mimic the timing behaviour and the rate of packets per time unit, which is a good distinguishing factor for attacks.

In this work, three data sets, henceforth called DS1 to DS3, were used for testing the algorithms:

•

DS1: Moving_two_files_Modbus_6RTU: Regular traffic between one MTU and six RTU during three minute interval, 3 319 packets captured, contains 75 malicious instances

•

DS2: Send_a_fake_command_Modbus_6RTU_with_operate: Regular traffic between one MTU and six RTU during 10 minute interval, 11 166 packets captured, contains 10 malicious instances

•

DS3: A combination of eight data sets, four of which do and four of which do not contain malicious activitiy, 365 906 packets overall, contains 206 malicious instances

DS3 addresses a common problem in real-world intrusion detection with machine learning: In order to train the algorithm, a normal condition of the system has to be derived, deviations from which have to be recognized as anomalies. A common practice is to monitor the behaviour of the productive system for a certain time under the assumptions that it does not contain malicious traffic. There are two issues, however. First, in productive systems, you can never be sure that there is no malicious traffic. It is just highly unlikely. Second, the recognition of anomalies based on normal behaviour can be difficult, as the user usually does not know the characteristics that have most impact on the algorithm. The effect of these limitations are evaluated in this work by mixing different kinds of traffic, even traffic with no malicious content. Due to the synthetic nature of this traffic, one can be sure that it is non-malicious.

3.3. Feature Extraction

The first step in anomaly detection and data mining is the determination of relevant features. These features can be used to describe the data instances with respect to a given goal; the goal in the given case is to determine instances that differ significantly from the common, productive behaviour. Hence, features that are suited to describe the normal behaviour of the system are needed. In general, there are two different kinds of features: Basic and derived features. Basic features are already present within the data. In the given case, they are contained within the protocol headers. Network traffic, for example, contains source and destination addresses, lengths, time stamps and other features. An exhaustive list containing the 14 basic features of this data set can be found in table 2. Two features are derived from the ethernet header as shown in figure 4, four features each are obtained from TCP and IP header, as shown in figures 2 and 3, two features from User Datagram Protocol (UDP) headers respectively and two features from the capturing tool, namely arrival time and information about broken packets.

Derived features result from the combination of basic features and can often only be derived from sequences of packets, e.g. the number of packets per time unit. Given the time stamp and the number of bytes of each packet, for example, the transmitted amount of bytes per second can be calculated. A list of nine derived features generated from this data set can be found in table 3.

The impact of each feature on the prediction can be calculated. In order to do so, the decrease of accuracy of the prediction is evaluated. The higher the decrease, the more important the feature. Another metric for the importance of a feature is the decrease in Gini index. The Gini index describes the pureness of a data set, split according to a given feature (Rokach and Maimon, 2005). The higher the decrease in Gini index, the more a feature is suited to split a data set into anomalous and non-anomalous.

The packets_per_second, mean_packets and max_packets are the features with the highest impact on the result for data sets DS1 and DS2 as shown in figures 5 and 6. The fact that all of them are derived features underlines the importance of feature engineering.

Due to the characteristic of DS3, consisting of different kinds of traffic, another feature importance occurs, as depicted in figure 7.

The TCP destination port are of importance in differentiating, as well as the number of pakets per protocol and per destination IP. Furthermore, the TCP source port distribution and the mean number of packets per second are important. In this scenario, some basic features are of high importance for the anomaly detection. A sound understanding of the scenario and application area therefore is of the essence.

4. Anomaly Detection in Modbus Data

In this section, the application of four different machine learning algorithms, namely Support Vector Machines (SVM), Random Forrest, k-nearest neighbour and k-means clustering, is described. Those algorithms are used to find outliers in the three data sets DS1, DS2 and DS3, described in subsection 3.2, using the features presented in subsection 3.3. At first, the data sets are split into 70% and 30%, as well as 80% and 20% respectively for cross-validation. The split values used depend on the quality of the cross-validation, the one providing better results is chosen. In this work, an 80%/20% has only been chosen for the k-nearest neighbour as described in subsection 4.4. The larger part is used to train the algorithm. Due to the labels, the prediction of an algorithm can be compared with the label in order to determine whether the prediction was correct or not. After training, the remaining part is used for testing. In this phase, the algorithm isn’t adjusted anymore. Still, the predictions are compared to the labels in order to determine metrics that describe the quality of an outlier detection algorithm.

Outlier detection can be seen as a binary classifier: An instance is either normal or anomalous. There are several metrics available to determine the performance of a binary classifier. For intrusion detection in industrial, but also in home and office networks, not only the number of detected attacks is relevant. Due to the high amount of traffic, false positives have severe effects. For one, they need a lot of time to investigate. Furthermore, they can, on a psychological level, have administrators become careless in cases of alarms as they expect them to be false positives, the so called alarm fatigue (Bliss et al., 1995). Finally, the amount of normal traffic in networks usually outnumbers the amount of malicious traffic by magnitudes. That means wrongly classifying 0.1% of malicious and of normal traffic still results in vastly different numbers of alarms. In this work, we used two metrics to describe the performance of the algorithms: The accuracy (Olson and Dursun, [n. d.]), as well as the f-measure (van Rijsbergen, 1979). The f-measure, or F1-score, is calculated as described in equation 1.

[TABLE]

$t$ stands for a correct classification of the algorithm, $f$ for an incorrect one. An index $p$ indicates that the algorithm classified it as positive, an index of $n$ indicates a classification as negative. The F1-score provides information about the relation of precision and recall, as defined in equations 2 and 3. Precision and recall describe the relation of all true positive classifications to all that have been classified as positive, respectively to all events that are positive. If both values are perfect, the F1-score amounts to one; at worst, it reaches 0. The accuracy is calculated according to equation 4.

[TABLE]

Accuracy gives information about the relation of correct classifications in relation to all classifications. First, a naive approach to find outliers is described in subsection 4.1. In subsections 4.2 to 4.5, the algorithms are applied to the three data sets DS1, DS2 and DS3 and their performance is evaluated with the given metrics. The results are then discussed in section 5.

4.1. Naive Approach

In some data sets, exploratory data analysis can lead to the discovery of singular or a group of features that can be used to distinguish between normal and anomalous data. In DS1, there are three features of the derived features as explained in table 3 capable of splitting the data set perfectly. These features with the according values are listed in table 4.

This makes the application of machine learning algorithms obsolete, all of the applied algorithms have to compete against a perfect score. It is noteworthy, however, that these are derived and not basic features. So at least a thorough understanding and sound feature engineering are necessary in order to be able to make sense of the data.

For DS2 and DS3, no such features exist. DS2 is too large with too few anomalous instances, so that each feature of an anomalous event takes the same value on at least one other normal event. DS3 is even more difficult, as several data traces are mixed. This leads to more heterogeneous feature distributions, making it impossible to classify it by exploratory data analysis.

4.2. Support Vector Machines

SVM were first introduced by Boser et al. in 1992 (Boser et al., 1992). The idea is to create a divider between two groups in such a way that each instance has the most possible distance from the divider. This is called a large margin classifier. In SVM, data points are described by tuples as shown in equation 5 (Cortes and Vapnik, 1995).

[TABLE]

$x$ is a vector describing a data point in an $n$ -dimensional feature space. $y$ describes the attribution to one of the two classes. $m$ is the number of data points. After training data, the attribution is performed by the signum-function, as shown in equation 6. $w$ is the normal vector of the separator hyperplane, $b$ is the offset from the hyperplane.

[TABLE]

Generally, when applying SVM, there are two different cases: Either the set of instances can or can not be divided by a linear geometric figure. If no linear division of the set of instances is possible, the so-called kernel trick is applied (Cortes and Vapnik, 1995). In using the kernel trick, the input space is mapped non-linearly into a higher dimensional feature space, where the algorithm can create a linear divider. In this work, the e1071-library (TU Wien, 2017) of the R programming language (The R Foundation, [n. d.]) has been used with a linear kernel.

DS1:

SVM performs exceedingly well with this data set. The relation of true and predicted labels can be found in table 5.

SVM is capable of predicting each instance of the test data set correctly, leading to an accuracy, as well as F1-score of 1, as shown in line 1 of table 6.

DS2:

SVM performs exceedingly well with this data set as well. The relation of true and predicted labels can be found in table 8.

SVM is capable of predicting each instance of the test data set correctly, leading to an accuracy, as well as F1-score of 1, as shown in line 2 of table 6.

DS3:

For this data set, SVM still performs relatively well. The relation of true and predicted labels can be found in table 8.

SVM is capable of predicting most instances of the data set correctly, indicated by accuracy and F1-score as shown in line 3 of table 6.

4.3. Random Forrest

A collection of Decision Trees is called a Random Forest (Breiman, 2001). It consists of a root node, internal nodes, so-called split nodes and leaf nodes. Each leaf node corresponds to a class predicted by the Random Forest. All Decision Trees have been grown during a training phase. The final decision is made by a majority voting. Random Forests are robust to noise and overfitting, a common problem in machine learning. It happens when an algorithm puts too much importance on singular features so that instances of one class with less expressive characteristic of this feature are no longer classified correctly. Furthermore, they converge quickly. In this work, 2 000 trees were used, created by rpart (rpa, 2018) and randomForest (University of Berkeley, 2015) in R (The R Foundation, [n. d.]).

DS1:

The Random Forest algorithm performs well on this data set, as shown in table 9. It reaches a perfect score, as depicted in line 1 of table 10.

DS2:

For this data set, the Random Forest algorithm obtains the worst results of all data sets. The results are shown in table 11. But since the number of anomalous instances in comparison to the size of the data set is tiny, the relatively poor results, shown in line 2 of table 10, derive from the metrics and the weighting of its factors.

DS3:

In this data set, the Random Forest algorithm performs very well again, the results are shown in table 12. No false negatives occur. It even outperforms the SVM, as shown in line 3 of table 10.

4.4. k-nearest Neighbour

This algorithm is a non-parametric classification and regression algorithm (Altman, 1992). In classification, the affiliation of an event to a group is calculated by determining the set of the $k$ nearest neighbours, commonly by calculating the Euclidean distance in an $n$ -dimensional feature space as shown in equation 7. The event under evaluation is classified as part of the group with which it has the most common neighbours among its $k$ nearest ones.

[TABLE]

As discussed before, this is the only algorithm where the 80%/20% split led to an increased cross-validation result.

DS1:

The performance of the k-nearest neighbour algorithm on this data set is poor. The relatively high false positive rate, as shown in table 13, leads to bad overall performance, as shown in line 1 of table 14.

DS2:

Even though the k-nearest neighbour algorithm performs better on this data set, it is still not satisfying. The algorithm classifies any event as normal, as shown in table 15. the small amount of anomalous events still leads to a medium performance evaluation, as shown in line 2 of table 14.

DS3:

As in applying the k-nearest neighbour algorithm to DS2, it classifies each instance of DS3 as normal as well. This is shown in table 16. The according metrics can be found in line 3 of table 14.

4.5. k-means Clustering

In k-means clustering (Alsabti et al., 1997), the probability of an object belonging to a group is calculated. This probability is commonly calculated as the Euclidean distance, as introduced in equation 7, of a point in an $n$ -dimensional feature space from the center of a cluster. In applying the k-means-algorithm those distances are minimized with an error function as shown in equation 8.

[TABLE]

$k$ is the number of clusters, that needs to be defined a priori. In this work, two clusters were used, one to describe normal, the other to describe anomalous behaviour. $n$ is the number of events or elements in the feature space and $C$ is the cluster. In contrast to the above algorithms, there cannot be a comparison between label and prediction. Instead, each of the two clusters has to be given a label in order to determine the quality. In this work, clusters were chosen as if a users did not have labels to support decision making, which is also the choice that minimises the error. This means that the cluster containing the larger portion of elements is seen as the cluster with label “normal”. Furthermore, k-means clustering is the only algorithm considered in this work that is non-supervised, meaning it does not need training.

The biggest advantage of non-supervised machine learning algorithms is omitting the need to find a valid training data set. On the other hand, if they are applied to unlabeled data, it is hard to determine their performance.

DS1:

In applying k-means clustering to this data set, all normal events are grouped in one cluster. Most of the anomalous events, however, are clustered there as well, as shown in table 17. The according accuracy and F1-score can be found in line 1 of table 18.

DS2:

The k-means clustering-algorithm distributes the “normal”-labeled events in both clusters, in a comparable amount (about 5 000 vs. 6 200) as shown in table 19. This leads to significantly reduced performance metrics, listed in line 2 of table 18. Furthermore, all anomalous events are grouped in the larger cluster, classifying them as normal.

DS3:

As in applying k-means clustering to DS2, all events labeled “anomalous” in this data set are grouped in the same cluster as most of the events labeled “normal”. This effect is depicted in table 20. Since there are about 2 000 times as many normal events as anomalous, the performance metrics are slightly improved in comparison to the above use case, as listed in line 3 of table 20.

5. Results and Discussion

DS1 can be seen as a sort of necessary condition: since it is perfectly separatable based on a three derived features, as described in subsection 4.1, the algorithms should lead to a perfect result as well. Only SVM and Random Forest did so. Both of them performed very well on the other two data sets as well, SVM outperformed Random Forest on DS2, and vice versa on DS3. k-nearest neighbour and k-means clustering performed significantly worse. In machine learning, F1-scores and accuracy scores of around 0,999 9 are usually required in order to consider the performance of a given algorithm good. While k-nearest neighbour is sometimes close to these values, k-means clustering leads to results far from satisfying. Maybe, optimizing the number of clusters, e.g. by calculating and maximising the silhouette coefficients (Rousseeuw, 1987), would improve the performance.

In their work, Lemay and Fernandez state that the regularity of their traffic would make it easy for machine learning-based anomaly detection algorithms to find the attacks. This is especially true for DS1. They also provide data sets covert channel attacks that are more subtle (Lemay and Fernandez, 2016). To increase the difficulty for the algorithms, and to mimic the changing nature of real industrial applications, we mixed several data sets in DS3. Still, Random Forest and SVM were able to find an impressive number of attacks.

Furthermore, it should be noted that all of the features used for detection are ethernet- and TCP/IP-based. The Modbus protocol-based characteristics did not have any direct influence on the detection mechanisms. However, the regularity and the structure of the traffic differs significantly from home- and office-based network traffic. This means that industrial traffic is different in character and thus different in detecting by algorithm, even if no protocol-specific attributes are employed.

6. Conclusion and Outlook

In this work, it is shown that some machine learning-based anomaly detection algorithms, in this case namely SVM and Random Forest, perform well in detecting network traffic anomalies in industrial networks. Since both of them are supervised methods, however, training data is needed. This data can be provided by simulators, as the one of Lemay and Fernandez, that was analysed in this work. The difficulty lies in generating sound, valid data that matches the industrial environment in which the anomaly detection algorithm shall be applied.

There are several possibilities for extension of the presented methods. Data from different sources can be gathered, combined and used to enhance the results (Duque Anton et al., 2017c). The introduction of context information into the anomaly detection process is promising and capable of increasing the performance (Duque Anton et al., 2017b). Furthermore, the employment of deception technologies as sensors for anomaly detection could be used to enhance the insight about malicious behaviour (Fraunholz et al., 2017b).

One of the most prevalent necessities is the generation of data with attacks that are specific to industrial applications in general, and especially to Modbus. The analysis performed in this work merely employs network-based features that, in the same form, exist in home and office appliances. The only major difference is the timing pattern that is strongly correlated to attacks.

Acknowledgments

This work has been supported by the Federal Ministry of Education and Research of the Federal Republic of Germany (Foerderkennzeichen KIS4ITS0001, IUNO). The authors alone are responsible for the content of the paper.

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2rpa (2018) 2018. Package rpart .
3Alsabti et al . (1997) Khaled Alsabti, Sanjay Ranka, and Vineet Singh. 1997. An efficient k-means clustering algorithm. Electrical Engineering and Computer Science (January 1997).
4Altman (1992) N. S. Altman. 1992. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. The American Statistician 46, 3 (August 1992), 175–185.
5Bhuyan et al . (2014) Monowar H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita. 2014. Network Anomaly Detection: Methods, Systems and Tools. IEEE Communications Surveys & Tutorials 16, 1 (2014), 303–336. https://doi.org/10.1109/SURV.2013.052213.00046 · doi ↗
6Bliss et al . (1995) James Bliss, Richard D. Gilson, and John E. Deaton. 1995. Human probability matching behaviour in response to alarms of varying reliability. Ergonomics 38, 11 (December 1995). https://doi.org/10.1080/00140139508925269 · doi ↗
7Boser et al . (1992) Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. 1992. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT ’92) . ACM, New York, NY, USA, 144–152. https://doi.org/10.1145/130385.130401 · doi ↗
8Breiman (2001) Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (Octoober 2001), 5–32. https://doi.org/10.1023/A:1010933404324 · doi ↗