Machine Learning-based Link Fault Identification and Localization in   Complex Networks

Srinikethan Madapuzi Srinivasan; Tram Truong-Huu; Mohan Gurusamy

arXiv:1812.03650·cs.NI·March 28, 2019

Machine Learning-based Link Fault Identification and Localization in Complex Networks

Srinikethan Madapuzi Srinivasan, Tram Truong-Huu, Mohan Gurusamy

PDF

TL;DR

This paper presents ML-LFIL, a passive machine learning-based method for identifying and localizing link faults in complex networks by analyzing normal traffic measurements, offering high accuracy and lower delay than active probing.

Contribution

The paper introduces a novel three-stage ML-LFIL technique that uses passive traffic analysis for fault detection and localization in heterogeneous networks, reducing time and overhead.

Findings

01

High accuracy in fault detection and localization

02

Significantly lower fault localization time compared to active probing

03

Effective in complex, heterogeneous network environments

Abstract

With the proliferation of network devices and rapid development in information technology, networks such as Internet of Things are increasing in size and becoming more complex with heterogeneous wired and wireless links. In such networks, link faults may result in a link disconnection without immediate replacement or a link reconnection, e.g., a wireless node changes its access point. Identifying whether a link disconnection or a link reconnection has occurred and localizing the failed link become a challenging problem. An active probing approach requires a long time to probe the network by sending signaling messages on different paths, thus incurring significant communication delay and overhead. In this paper, we adopt a passive approach and develop a three-stage machine learning-based technique, namely ML-LFIL that identifies and localizes link faults by analyzing the measurements…

Tables3

Table 1. TABLE I : Summary of ML-LFIL performance with the test dataset for the three different topologies (in %)

Topology	Precision	Recall	$F_{1}$ -Score
$30$ nodes	97.52	96.46	97.00
$60$ nodes	94.56	92.40	93.47
$100$ nodes	93.20	91.22	92.20

Table 2. TABLE II : Performance comparison (in %)

Topo.	Algorithms	Precision	Recall	$F_{1}$ -Score
$30$ -node	ML-LFIL-S1	98.63	98.14	98.38
$30$ -node	Ping-based	99.45	99.12	99.28
	approach	99.45	99.12	99.28
$60$ -node	ML-LFIL-S1	97.71	95.89	96.79
$60$ -node	Ping-based	99.38	98.91	99.14
	approach	99.38	98.91	99.14
$100$ -node	ML-LFIL-S1	96.38	95.37	95.82
$100$ -node	Ping-based	99.12	98.50	98.81
	approach	99.12	98.50	98.81

Table 3. TABLE III : Comparison of fault localization time

Methods	Time (in $μ$ s)
ML-LFIL with SVM	$178.02$
ML-LFIL with MLP	$302.73$
ML-LFIL with RF	$286.41$
Ping-based approach ( $30$ -node network)	$2960.24$
Ping-based approach ( $60$ -node network)	$8266.40$
Ping-based approach ( $100$ -node network)	$56776575.82$

Equations10

C = b_{1, 2}, b_{1, 3}, \dots, b_{i, j}, \dots, b_{V - 1, V}, d_{1, 2}, d_{1, 3}, \dots, d_{i, j}, \dots, d_{V - 1, V}, l_{1, 2}, l_{1, 3}, \dots, l_{i, j}, \dots, l_{V - 1, V}

C = b_{1, 2}, b_{1, 3}, \dots, b_{i, j}, \dots, b_{V - 1, V}, d_{1, 2}, d_{1, 3}, \dots, d_{i, j}, \dots, d_{V - 1, V}, l_{1, 2}, l_{1, 3}, \dots, l_{i, j}, \dots, l_{V - 1, V}

R = [b_{1, 2}, b_{1, 3}, \dots, b_{i, j}, \dots, b_{V - 1, V}, s_{l}, d_{l}]

R = [b_{1, 2}, b_{1, 3}, \dots, b_{i, j}, \dots, b_{V - 1, V}, s_{l}, d_{l}]

P = \frac{T _{P}}{T _{P} + F _{P}}

P = \frac{T _{P}}{T _{P} + F _{P}}

R = \frac{T _{P}}{T _{P} + F _{N}}

R = \frac{T _{P}}{T _{P} + F _{N}}

F_{1} -Score = \frac{2 PR}{P + R} .

F_{1} -Score = \frac{2 PR}{P + R} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Machine Learning-based Link Fault Identification and Localization in Complex Networks

Srinikethan Madapuzi Srinivasan, Tram Truong-Huu,

and Mohan Gurusamy Manuscript received November 22, 2018; revised February 28, 2019; accepted March 22, 2019. Date of publication March day, 2019; date of current version Month day, 2019. This work was supported by Singapore MoE AcRF Tier 1 Grant, NUS WBS No. R-263-000-C04-112. *(Corresponding author: Tram Truong-Huu.)*The authors are with the Department of Electrical and Computer Engineering, National University of Singapore. Singapore 117583. (e-mail: [email protected], [email protected], [email protected]).Copyright (c) 2019 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

Abstract

With the proliferation of network devices and rapid development in information technology, networks such as Internet of Things are increasing in size and becoming more complex with heterogeneous wired and wireless links. In such networks, link faults may result in a link disconnection without immediate replacement or a link reconnection, e.g., a wireless node changes its access point. Identifying whether a link disconnection or a link reconnection has occurred and localizing the failed link become a challenging problem. An active probing approach requires a long time to probe the network by sending signaling messages on different paths, thus incurring significant communication delay and overhead. In this paper, we adopt a passive approach and develop a three-stage machine learning-based technique, namely ML-LFIL that identifies and localizes link faults by analyzing the measurements captured from the normal traffic flows, including aggregate flow rate, end-to-end delay and packet loss. ML-LFIL learns the traffic behavior in normal working conditions and different link fault scenarios. We train the learning model using support vector machine, multi-layer perceptron and random forest. We implement ML-LFIL and carry out extensive experiments using Mininet platform. Performance studies show that ML-LFIL achieves high accuracy while requiring much lower fault localization time compared to the active probing approach.

Index Terms:

Internet of Things, complex networks, machine learning, fault identification, fault localization.

I Introduction

The Internet of Things (IoT), comprising billions of intelligent devices with sensing and processing capabilities along with the ability to connect to the Internet through wired or wireless connections, is being increasingly used to monitor and respond to events in real-time. By $2020$ , the number of such Internet-connected devices is expected to exceed $50$ billion [1], raising new challenges in network management. For instance, the traffic generated by Internet vehiculars is expected to reach 300000 Exabyte by the year of 2020 [2]. The large number of devices and high network complexity lead to a higher chance of link faults. Such link faults may lead to: (i) a link disconnection without an immediate replacement, e.g., link cut and switch ports down, or (ii) a link reconnection, e.g., a wireless node changes its access point due to poor wireless channel quality. A fault occurring on a crucial link, if not recovered in a timely manner, could lead to the disruption of services to customers. It can take hours or days to repair a link fault by using a manual recovery approach based on the combination of ping, trace-route and other functionalities for Ethernet and MPLS [3]. Thus, it is essential to have an efficient fault management system, which is able to diagnose faults as soon as possible, and quickly recover the network from such faults, e.g., using a proactive fault recovery mechanism [4].

The accuracy of fault identification and localization depends on the algorithm used and the correctness of the data (i.e., network state) captured from the network, which in turn depends on the frequency of data collection (i.e., data sampling frequency) and the method by which the data is captured. The network state data can be captured by two methods: active and passive measurements. Active measurements using signaling messages (also known as active probing) have been extensively studied in the literature [5, 6, 7] and is more commonly used in practice. In an active probing approach, several measurement endpoints (MEPs) or measurement intermediate points (MIPs) are deployed in the network, which inject and exchange additional control packets among themselves to identify and localize network failures. Thus, the accuracy of the active probing approach depends on the number of paths probed and the number of messages injected into the network. On one hand, this increases the communication overhead due to additional traffic injected. On the other hand, it increases the latency in fault identification and localization due to the propagation delay of signalling messages on the probed paths. In contrast to active probing methods, passive monitoring does not inject additional traffic to the network to obtain traffic attributes upon a link fault. Instead, it leverages readily available metrics (traffic attributes) such as end-to-end delay, packet loss, etc. from the normal traffic generated by users and performs necessary analysis to identify and localize link faults. Traffic attributes are continuously collected and analyzed by a network monitor and analyzer. Thus, the fault manager is able to quickly react upon a change in traffic behavior.

The emergence of machine learning techniques such as deep learning is attracting a great deal of attention in many research and development efforts. Dealing with complex problems is one of the most important advantages of machine learning [8]. With a huge amount of traffic exchanged in the networks, applying machine learning techniques to analyze those traffic attributes could provide useful insights to traffic engineering and fault management. In this paper, we develop a traffic engineering (TE)-based machine learning technique that captures the network state and learns the traffic behavior by passively monitoring the network at egress and ingress nodes. The passive monitoring approach enables a fast link fault identification and localization without additional communication overhead. The traffic features used for machine learning model include the aggregate flow rate, packet loss and round-trip time delay of traffic packets among the monitored nodes. Upon a change in traffic behavior, a link fault could be quickly identified and localized. We consider IoT networks, which could be modeled as complex networks that are able to reflect the randomness and growth characteristics of the real world [9, 10, 11]. We develop a three-stage machine learning-based technique for link fault identification and localization (ML-LFIL). Given traffic information captured from the network, the first stage detects if a link fault has occurred. If a link fault has occurred, the second stage identifies whether a link reconnection has occurred along with a link disconnection or not. Finally, the third stage localizes the link fault, i.e., the location of the disconnected link and/or reconnected link. We train the machine learning model using support vector machine (SVM), random forest (RF) and Multi-Layer Perceptron (MLP, one of the neural network architectures), which have been widely used in the literature for classification and regression problems. We note that training of the learning model can be carried out offline or in a parallel process while fault identification and localization are done by applying the learning model with the data point collected at the time of failure, i.e., a data point is represented by a set of traffic features captured from the network at a time instant. Thus, ML-LFIL enables a fast link fault identification and localization even with large networks. We demonstrate the effectiveness of ML-LFIL in terms of identification and localization accuracy, and fault localization time on different networks, including two random complex networks and a network from the Internet Topology Zoo [12].

The contributions of the paper are summarized as follows:

•

We adopt passive monitoring approach to address the problem of link fault identification and localization in complex networks.

•

We develop a machine learning model that can learn traffic behavior in normal working condition and in different link fault scenarios.

•

We develop a TE-based machine learning technique for link fault identification and localization.

•

We carry out extensive experiments with two random complex networks and one network from the Internet Topology Zoo to demonstrate the effectiveness of the proposed approach.

The rest of the paper is organized as follows. We review the related work in Section II. We present our proposed approach in Section III. We carry out performance evaluation in Section IV before we conclude the paper in Section V.

II Related Work

II-A Network Fault Localization

In [13, 14], the authors presented a consolidated taxonomy on different techniques that have been developed for localizing link faults. These techniques are broadly categorized as rule-based techniques [15, 16], case-based techniques [17, 18], probability-based techniques [19, 20, 21, 22] and model-based techniques [23, 24]. The rule-based techniques rely on a knowledge base developed by the system experts, which is essentially a set of if-then statements i.e., the rules of the system. Nevertheless, these rule-based techniques can neither learn adaptively from the past experience nor from the network dynamics observed from previously unseen traffic behavior. Further, updating and enriching the knowledge base is more complex. Similarly, fault diagnosis by case-based techniques depends on the expert and experience obtained from the past experience. Thus, these solutions can only be applied to those faults that are similar to the ones that have occurred previously.

In [21, 22], the authors presented probability-based techniques for fault diagnosis. The location of the link faults in the network is indicated by the corresponding probability mass functions of the links. In comparison with [19, 20], these two works require less computational resources. Model-based techniques [23, 24] build a mathematical model from a knowledge base to describe the network behaviors. The newly-observed traffic behaviors from the network are compared to those predicted by the model. If the observed behaviors fail to conform to the predicted ones, faults are detected in the network. Thus, the model-based techniques require accurate information about the connections in the network to efficiently diagnose link faults. However, obtaining such information is not always feasible with complex networks owing to their highly-dynamic nature. Differing from all the above techniques, our proposed approach enables a fast fault localization without requiring knowledge about past failures.

II-B Machine Learning for Network Fault Management

Recently, several works propose to use machine learning techniques for network fault management [25, 26, 7]. In [25], the authors presented a usage-based failure (service disruption) detection method in mobile networks. The authors proposed to monitor aggregated customer usage data and derive a usage pattern for a given geographic region, device type and service. A drop in aggregated usage (lower than expected) will be interpreted as a sign of potential service disruption experienced in that region. This approach, however, requires the additional deployment of service monitoring on top of network monitoring. Further, it requires an accurate user grouping such that the users in the same group have the same usage pattern. The work presented in [27] uses an online fault detection model using support vector machine (SVM) to detect faults in clouds. Similarly, the work presented in [26] uses support vector machine (SVM) to classify the received sensor data to detect faults through abnormal behavior in data. This approach requires the traffic to be redirected to the server where the classifier is deployed, thus incurring a longer delay in data processing and additional communication overhead. Our approach requires only the traffic attributes whose size (usually only a few KB) is much smaller than the data traffic.

In [7], the authors proposed a machine learning-based link fault localization coupled with active probing by sending signaling messages to obtain data sets for the machine learning model. Upon a failure, signaling messages are injected into the network with different source and destination node pairs to capture the traffic information such as the number of hops, propagation delay, etc. This information is analyzed by a machine learning model to localize the link fault. On one hand, this incurs additional communication overhead due to signaling messages injected into the network. On the other hand, it incurs an additional delay in fault localization due to the propagation delay of signaling messages across the network. Our work differs from [7] in that we use a passive monitoring approach that identifies and localizes link fault by analyzing the information captured from the normal traffic in normal working conditions and failure scenarios. This enables a fast fault identification and localization without any communication overhead.

A recent work in the context of 5G networks uses deep learning for link failure mitigation [28]. The authors proposed to use deep learning to analyze the signal conditions of a handover when mobile devices move from one coverage area to another area under a different base station. Based on the signal conditions and the status of the handovers that happened in the past, the model can classify whether the handover will be successful or not in advance. Another work presented in [29] uses system logs as input for failure detection and diagnosis for solar-powered wireless mesh networks. The authors used the knowledge discovery in database methodology and a pre-defined dictionary of failures based on their previous experience with the deployment of wireless mesh networks. The fault detection and diagnosis are solved as a pattern classification problem. In [30], the authors described an online failure prediction system built over Apache Spark that takes a repository of network management events, trains a random forest model and uses this model to predict the appearance of future events in near real time. However, for some failures, e.g., silent failures, no event will happen in the network, making it hard for the system to detect the failure. Different from these works, we propose to use machine learning techniques to analyze the attributes of normal traffic generated by real users or applications.

II-C Our Work

Our work addresses the link fault identification and localization problem using passive monitoring. We consider two link fault scenarios: link disconnection and link reconnection. In our previous work [31], we focused on localizing only link disconnections. In this work, we extend our earlier work by considering both link disconnection and reconnection, which increases the complexity of the problem. Our proposed machine learning approach (ML-LFIL) can achieve high accuracy and localize link faults faster.

III Data Analytics for Link Fault Identification and Localization

In this section, we develop our machine learning-based technique for link fault identification and localization (ML-LFIL). In Fig. 1, we present the architecture of ML-LFIL. The network includes both wired and wireless (dashed lines) links. A wireless link can be replaced by a new one with a different end nodes, i.e., a wireless-enabled node can change its existing connection to another node in the network with better wireless channel quality [32]. Due to a link reconnection, certain flows change their paths in the network, thus affecting the end-to-end traffic measurements between the nodes in the network. The network monitor periodically probes the network to capture end-to-end traffic information from flows traversing the network. Traffic information extracted from the flows will be analyzed by the fault manager using a machine learning technique. To avoid the waiting time between instants of traffic sampling, an event-driven approach can be additionally used to trigger link fault identification and localization. When a node experiences abnormal traffic behavior, it sends a request along with traffic measurements for link fault identification and localization. This means that while the monitor keeps probing the network periodically to obtain more data for training the learning model and improving its accuracy, every abnormal traffic behavior will be processed additionally so as to react to link faults in real-time.

III-A Traffic Features

ML-LFIL carries out link fault identification and localization by analyzing the end-to-end traffic features captured from the network. The accuracy of our machine learning technique depends on the features that are used for training the model. Thus, it is imperative to identify the features to be extracted from network traffic measurements for localizing link disconnections in the network. At every source/destination node, the following traffic measurements are extracted:

•

Aggregate transmission rate of flows that destine to other nodes in the network. We denote $b_{s,d}$ as the aggregate rate of the flow that originates from node $s$ to node $d$ .

•

End-to-end delay denoted as $d_{s,d}$ that is computed as the round-trip time (RTT) delay of a packet sent from node $s$ to node $d$ .

•

Packet loss rate denoted as $l_{s,d}$ is the ratio between the number of packets lost on the path between node $s$ and node $d$ and the total number of packets transmitted between two successive sampling instants.

While the aggregate rate between every node pair gives us the information about the network load, the end-to-end delay and packet loss features provide indirect information on the path taken by each flow and congestion status in the path. The end-to-end delay can be captured at the hosts in a real network with the timestamp information carried in the packet header. With the same aggregate rate of a flow between a source and destination, longer delay and higher packet loss implicitly mean that a certain link has failed. These traffic measurements captured for different pairs of source and destination nodes help machine learning algorithms to learn the correlation between the learning features to different link fault scenarios. These features are fed to machine learning algorithms as a vector $\mathcal{C}$ as defined below:

[TABLE]

where $V$ is the number of nodes in the network while $b_{s,d},d_{s,d}$ and $l_{s,d}$ are values of the traffic features defined above. It is to be noted that the number of features depends on the number of aggregate flows in the network, which in turn depends on the number of nodes in the network. Given a network with $V$ nodes, the total number of aggregate flows denoted as $\mathcal{N}$ is given by $V(V-1)$ . We extract three traffic features for each aggregate flow as discussed above, and therefore, the size of vector $\mathcal{C}$ is three times the number of aggregate flows, i.e., $3\mathcal{N}=3V(V-1)$ . At each sampling instant, a vector $\mathcal{C}$ also called a data point is captured from the network and will be evaluated by ML-LFIL.

III-B ML-LFIL

We now present the proposed three-stage machine learning technique for link fault identification and localization (ML-LFIL). In Fig. 2, we present the functional block diagram of ML-LFIL. As mentioned earlier, ML-LFIL composes of three stages. Given traffic information captured from the network, the first stage detects if a link disconnection has occurred using a link disconnection classifier. Given that a link disconnection has occurred, the second stage uses a delay regressor to identify the link fault: only a link disconnection has occurred or both link disconnection and link reconnection have occurred. Finally, the third stage uses a link reconnection classifier to localize the link reconnection and may correct the disconnected link resulted by the first stage. Below, we describe the details of each stage.

III-B1 Stage 1 – Link Disconnection Classification

Since different link disconnections may cause different traffic behaviors represented by the traffic measurements, the problem can be considered as a multi-class machine learning classification problem. The first stage, denoted as ML-LFIL-S1, not only identifies whether a link fault has occurred but also identifies which link has failed. Given a network with $|\mathcal{E}|$ links and a data point, the first stage classifies the data point into one of the link fault classes ( $|\mathcal{E}|$ classes) or the “no-link-fault” class, where $\mathcal{E}$ is the set of links in the network. Thus, the total the number of classes required for training the machine learning algorithm to detect and localize a link fault in the network is $|\mathcal{E}|+1$ . In our work, we consider a single link fault scenario, i.e., a link fault can be detected and recovered before another fault occurs. This is an acceptable assumption since protection and recovery of the network from multiple simultaneous faults require high-complex algorithms and a large amount of resources to be reserved even when such faults are not frequent. We also note that all the links in the network are equally treated without any priority.

We use all three traffic features and train the learning model using one of the following machine learning algorithms:

•

Support Vector Machine (SVM) [33] is a supervised machine learning technique that tries to separate data points into two different classes by identifying the best possible separating hyperplane. It can be extended to multi-class problems by constructing multiple hyperplanes.

•

Multi-Layer Perceptron (MLP) [34] is a class of artificial neural networks. The number of layers and the number of neurons in each layer depends on the complexity of the machine learning problem. A back-propagation algorithm is used by MLP for training and obtaining the weights of the neurons in the neural network.

•

Random Forest (RF) [35] is a classifier algorithm that constructs multiple decision trees during the training phase and outputs the mode of the individual trees as the class label. It suits well for multi-class classification problems as the link fault identification and localization.

We note that while there exist many machine learning algorithms that can be used for a classification problem, SVM, RF and MLP have demonstrated their best performance over other algorithms [31]. Hence, in this work, we only adopt these three machine learning algorithms to train our learning model to achieve the best performance.

The output of the first stage is a tentative disconnected link ( $L_{1}$ ) in case a link disconnection has occurred or a message stating the normal working condition of the network if no link fault has occurred. The “tentative” term means that the disconnected link $L_{1}$ may not be accurately determined due to a link reconnection that has occurred along with the link disconnection but causing similar behavior as in case of link disconnection alone. Even though $L_{1}$ may be wrongly classified, it shows ML-LFIL is sensitive with link fault and it triggers the other two stages for further analysis to identify and correctly localize the link fault.

III-B2 Stage 2 – Link Fault Identification (ML-LFIL-S2)

To identify the link fault, we estimate the end-to-end delay of the network traffic caused by the disconnection of the tentative link $L_{1}$ , using aggregate flow rates captured from the network. The estimated end-to-end delay is compared with the actual delay captured from the network. We use the mean square error to compute the difference between the estimated delay and actual delay. If the difference is less than the threshold value (say $10\%$ ), we can confirm that link reconnection has not occurred along with the link disconnection and $L_{1}$ is the exact link that has been disconnected. The rationale behind is that if only a link disconnection has occurred, the flows affected by the disconnected link have to traverse a long path, thus experiencing a longer delay, compared to the case where the disconnected link is replaced by a new link.

We develop a regression learning model to estimate the end-to-end delay of all the flows in the network. The model is trained with the aggregate flow rates and the tentative disconnected link $L_{1}$ . These features are fed to machine learning algorithms as vector $\mathcal{R}$ given by,

[TABLE]

where $V$ is the number of nodes in the network, $b_{i,j}$ is aggregate flow rate between node $i$ and node $j$ , and ( $s_{l},d_{l}$ ) are source and sink of the tentative disconnected link $L_{1}$ . We train the regression model using MLP. In this work, we use an MLP with $3$ hidden layers and $400$ neurons in each hidden layer with ReLU activation functions. We note that packet loss information can also be used as an input feature to estimate the end-to-end delay. However, this unnecessarily increases the complexity of the problem and design of MLP.

III-B3 Stage 3 – Link Reconnection Classification

Given that a link reconnection has been identified by the second stage, the third stage of ML-LFIL localizes both disconnected link ( $L_{2}$ ) and reconnected link ( $L_{3}$ ) using a link reconnection classifier. The disconnected link $L_{2}$ might be different or the same as the tentative disconnected link $L_{1}$ depending on the accuracy of the link disconnection classifier. Similar to link disconnection classification, link reconnection classification in the third stage of ML-LFIL (ML-LFIL-S3) is also a multi-class machine learning classification problem. However, each class in this problem includes a pair of a disconnected link and a reconnected link. All the three traffic features are used in the learning model that is trained using SVM, MLP or RF.

III-C Illustrative Example

We illustrate the working of ML-LFIL with an example. Given a $10$ -node network as depicted in Fig. 3, all the flows are routed on the shortest paths. Consider three flows with source and destination as $\langle{}1,3\rangle$ , $\langle{}1,4\rangle$ and $\langle{}1,8\rangle$ . Fig. 3(a) depicts the paths traversed by the flows in normal working conditions. Upon disconnection of the link ( $1-2$ ), the affected flows are rerouted and the paths traversed by the affected flows are depicted in Fig. 3(b). We can see that the two affected flows $\langle{}1,3\rangle$ and $\langle{}1,4\rangle$ are rerouted through alternate paths, which are longer than the paths used in normal working conditions, leading to longer propagation delay. Further, all of the flows now traverse through the same link ( $1-8$ ) and share a limited amount of bandwidth. Thus, they experience additional delay due to congestion on link ( $1-8$ ), which is a sign of link fault. Similarly, packet loss is a useful measurement to identify link disconnections. Upon a link disconnection, all the packets sent through the disconnected link are dropped before an alternate path is found, thus leading to increased packet loss. The longer the time needed for the network to find an alternate path, the higher the packet loss. In the scenario shown in Fig. 3(b), upon disconnection of link ( $1-2$ ), the two flows $\langle{}1,3\rangle$ and $\langle{}1,4\rangle$ experience higher packet loss.

Fig. 3(c) depicts routing solutions when link ( $1-2$ ) is replaced by link ( $1-9$ ). Except flow $\langle{}1,8\rangle$ , both flows $\langle{}1,3\rangle$ and $\langle{}1,4\rangle$ traverse through different paths. This change in the routing path of the flows affects traffic measurements of the flows. However, the effect might not be significant when compared to the case of link disconnection alone. Indeed, even though link ( $1-2$ ) is disconnected, the end-to-end delay of flow $\langle{}1,4\rangle$ on the alternate path remains unchanged ( $3$ hops) and the alternate path of flow $\langle 1,3\rangle$ increases only $1$ hop. Thus, the link disconnection classifier (the first stage of ML-LFIL) might not be able to correctly localize the disconnected link. Using the regression model to estimate the end-to-end delay of flows given a link fault allows us to infer whether a link reconnection has occurred, thus being able to localize both disconnected link and reconnected link.

IV Performance Study

IV-A Simulation Settings and Data Collection

We implement ML-LFIL and carry out experiments to evaluate its performance using the Mininet platform. We consider two complex network topologies: a $30$ -node network with $36$ links and a $60$ -node network with $68$ links. The two networks are created based on the small-world complex network model that emulates each network node with $4$ neighbors and the probability of adding another edge for each node is $0.35$ . The traffic between the nodes in the networks is generated using iperf3 tool with the rates between each node pair randomly chosen in the range $[1,300]$ Mbps. We also evaluate our proposed ML-LFIL method with the Interroute topology with $100$ nodes and $120$ links with a realistic traffic trace. We use Wireshark to capture the traffic measurements, i.e., aggregate flow rate, end-to-end delay and packet loss in normal working conditions and different link fault scenarios. To emulate link disconnections, we randomly remove a link in the network topology in Mininet platform while traffic flows are being forwarded across the network. All the affected flows will be rerouted through alternate paths. We collect the traffic measurements for multiple link disconnection scenarios to train the machine learning model in the first two stages of ML-LFIL (link disconnection classification and link fault identification). Similarly, to emulate link reconnections, we randomly remove one link and add to the network a new link that has same source node as the removed link. Different link reconnection scenarios are generated by removing and adding different links between different nodes in the network. The traffic measurements captured with different link reconnection scenarios are used to train the third stage of ML-LFIL (link reconnection classification). Following the data collection, the data is preprocessed using normalization techniques and Principal Component Analysis (PCA) to enable better performance of machine learning algorithms.

For the $30$ -node topology, we train the link disconnection classifier in the first stage and the delay regressor in the second stage of ML-LFIL with $28,000$ data points for each class of link disconnections, i.e., we use about $800,000$ data points in the training data set. The test data set includes $200,000$ data points. The link reconnection classifier in the third stage of ML-LFIL is trained and tested using $500,000$ data points and $100,000$ data points, respectively. For the $60$ -node topology, we train the link disconnection classifier and the delay estimator with $60,000$ data points for each class, i.e., about $3,200,000$ data points in the training data set. The test data set has $800,000$ data points. Similarly, we train and test the link reconnection classifier with $800,000$ and $200,000$ data points, respectively. For the 100-node Interroute topology, we train the link disconnection classifier (ML-LFIL-S1) and the delay estimator (ML-LFIL-S2) with $8,000$ data points for each class, i.e., about $672,000$ data points in the training data set. The test data set has $160,000$ data points. We train and test the link reconnection classifier with $600,000$ and $100,000$ data points, respectively.

IV-B Performance Metrics

We use the following performance metrics to evaluate the performance of different machine learning algorithms:

•

Precision: The ratio of the number of link faults correctly classified over the total number of data points classified as faults. The precision value is computed as follows:

[TABLE]

where $\mathcal{P}$ is the precision value, $T_{P}$ is the number of “true positives” and $F_{P}$ is the number of “false positives”.

•

Recall: The ratio of the number of data points associated with link faults correctly classified over the total number of data points associated with link faults that have occurred. The recall value is given by:

[TABLE]

where $F_{N}$ is the number of “false negatives”.

•

$F_{1}$ -Score: The $F_{1}$ -Score is the harmonic average of the precision and recall values. It takes a value in the range $[0,1]$ . The higher the value of $F_{1}$ -Score, the better the performance of the machine learning technique, i.e., we obtain perfect precision and recall values when $F_{1}$ -Score reaches $1$ . It is computed as follows:

[TABLE]

•

$R^{2}$ -Score: The goodness of the MLP used in the delay regressor (ML-LFIL-S2). It takes the values in the range $[0,1]$ . The higher the value of $R^{2}$ score, the better the performance of the learning model.

•

Fault detection accuracy: The ratio of the number of data points associated with link faults detected (regardless of the correctness of the tentative disconnected link) over the total number of data points associated with link faults that have actually occurred.

•

Fault localization time: The time is taken to localize the link upon its fault. We compare ML-LFIL with a ping-based active probing approach that sends signaling messages to all the nodes in the network to obtain traffic information before analyzing to localize the link fault.

IV-C Analysis of Results

IV-C1 Performance with the $30$ -node network

In this section, we evaluate the performance of ML-LFIL with the $30$ -node network. We first evaluate the performance of the first stage of ML-LFIL, denoted as ML-LFIL-S1since its performance affects the overall performance of ML-LFIL. Indeed, if it cannot classify the traffic features associated with a link fault and returns a “no-link-fault” message, ML-LFIL will stop without further analysis. We first consider the fault scenarios where only link disconnections are present. Both the precision and recall values on the training data set are close to $100\%$ for all the machine learning algorithms (SVM, MLP and RF) used to train ML-LFIL-S1. This demonstrates that the link disconnection classifier in ML-LFIL-S1 has been well trained.

In Fig. 4, we plot the precision, recall and $F_{1}$ -Score values of all the algorithms used to train ML-LFIL-S1 on the test data set. The results show that ML-LFIL-S1 achieves high performance of at least $90\%$ of precision, recall and $F_{1}$ -Score values. The results also show that all the algorithms have a high precision value. This means that all the algorithms have a low false positive rate. Similarly, the high recall value implies a low false negative rate. We can observe from Fig. 4 that there is a lower false positive rate compared to the false negative rate. Among the machine learning algorithms, RF algorithm outperforms SVM and MLP algorithms with a precision of about $98.6\%$ , a recall of about $98.1\%$ and $F_{1}$ -Score of about $98.4\%$ . This shows that the learning model trained with RF algorithm classifies link disconnection with minimal misclassification or noise.

We now consider the fault scenarios where both link disconnections and link reconnections are present. We observe that the precision, recall and $F_{1}$ -Score values of all the algorithms decrease due to misclassification as shown in Fig. 5. This is because the link disconnection classifier in ML-LFIL-S1 is trained only with the traffic features captured from the link disconnection scenarios. Thus, it will misclassify a data point if the traffic behavior in a link reconnection scenario is similar to that in a link disconnection scenario. Nevertheless, RF algorithm always outperforms other algorithms with a precision of $88.1\%$ , a recall of $86.5\%$ and an $F_{1}$ -Score of $87.3\%$ . This explains why we develop the regression model in the second stage of ML-LFIL to identify the link fault whether or not a link reconnection has occurred. We note that since RF has the best performance among the machine algorithms used to train ML-LFIL-S1, when evaluating the performance of the subsequent stages, we use RF for the first stage, ML-LFIL-S1.

In Fig. 6, we present the accuracy for different MLP architectures, i.e., different number of hidden layers and number of units in each hidden layer. MLP with 3 hidden layers and 400 neurons perform better than the other combinations, having an $R^{2}$ -Score of $98\%$ , $93\%$ and $87\%$ for the 30-node, 60-node and 100-node Interroute topologies, respectively. We can have even a deeper and larger neural network, but it would lead to similar accuracy and much more complexity and thus, would be a overkill for our problem. Using the best architecture of MLP with with 3 hidden layers and 400 neurons, in Fig. 7, we present the accuracy of ML-LFIL-S2 in identifying the link fault, i.e., whether a link reconnection has occurred or not, for different threshold values. It can be seen that the $F_{1}$ -Score is maximum for the threshold value of $10\%$ in the threshold comparator module, with $98.5\%$ , $97.2\%$ and $96\%$ for the 30-node, 60-node and 100-node Interroute topologies, respectively. Having a high threshold value will lead to link reconnections go undetected and to be identified as just link disconnections, whereas having a low threshold value will lead to link disconnections being identified as link reconnections. Thus, the threshold value is crucial in distinguishing between link reconnections and link disconnections in the network. We use the threshold value of $10\%$ for the remaining experiments.

When we use ML-LFIL with all three stages on the test data set that includes both link disconnections and link reconnections, we obtain a significant performance improvement. In Fig. 8, we present the precision, recall and $F_{1}$ -Score of ML-LFIL. We note that, since we need to have high accuracy in the initial stages to achieve high accuracy of the overall model, we present the results of the overall ML-LFIL model, by evaluating the third stage with different classifiers in Fig. 8, with the first stage trained with RF and the second stage trained with MLP. The results show that RF algorithm always outperforms the other algorithms with $97.5\%$ of precision, $96.46\%$ of recall and $97\%$ of $F_{1}$ -Score. It is to be noted that we use the same algorithm for both link disconnection classifier in the first stage and link reconnection classifier in the third stage of ML-LFIL. SVM algorithm has the worst performance among the algorithms. Nevertheless, it achieves $87.8\%$ of precision, $85.5\%$ of recall and $86.7\%$ of $F_{1}$ -Score.

IV-C2 Performance with the $60$ -node network

In this section, we evaluate the performance of ML-LFIL with a $60$ -node network. Similar to the analysis with the $30$ -node topology, we evaluate the performance of ML-LFIL-S1 and then ML-LFIL as a whole. In Fig. 9, we present the precision, recall and $F_{1}$ -Score values of ML-LFIL-S1 for the test data set that includes only link disconnections. We obtain similar performance trend as shown in the previous section. It is evident that RF algorithm performs the best with $97.7\%$ , $95.8\%$ and $96.7\%$ of precision, recall and $F_{1}$ -Score, respectively.

When considering both link disconnections and link reconnections, we also observe a performance degradation due to the misclassification. As shown in Fig. 10, RF attains only $86.3\%$ , $83.06\%$ and $84.65\%$ for precision, recall and $F_{1}$ -Score, respectively. In Fig. 11, we present the precision, recall and $F_{1}$ -Score of all the three algorithms used to train the third stage of ML-LFIL in the presence of both link disconnections and link reconnections. We obtain high precision and recall values of $94.56\%$ and $92.4\%$ for ML-LFIL with the RF algorithm. Similarly, we obtain high $F_{1}$ -Score of at least $82.8\%$ for SVM and $93.47\%$ for RF algorithm. This demonstrates the effectiveness of ML-LFIL in identification and localization of link faults even with large-scale complex networks.

IV-C3 Performance with the $100$ -node Interroute network

In this section, we evaluate the performance of ML-LFIL with the Interroute network with $100$ nodes. Similar to the analysis with the previous two topologies, we evaluate the performance of ML-LFIL-S1 and then ML-LFIL as a whole. In Fig. 12, we present the precision, recall and $F_{1}$ -Score values of ML-LFIL-S1 for the test data set that includes only link disconnections. We obtain similar performance trend as seen in the previous sections. It is evident that RF algorithm performs the best with $96.37\%$ , $95.37\%$ and $95.87\%$ of precision, recall and $F_{1}$ -Score, respectively.

When considering both link disconnections and link reconnections, we also observe a performance degradation due to the misclassification. As shown in Fig. 13, RF attains only $84.3\%$ , $81.1\%$ and $82.7\%$ for precision, recall and $F_{1}$ -Score, respectively. With the output from ML-LFIL-S1 trained by RF and ML-LFIL-S2 trained by MLP, in Fig. 14, we present the precision, recall and $F_{1}$ -Score of all the three algorithms used to train the third stage of ML-LFIL in the presence of both link disconnections and link reconnections. We obtain high precision and recall values of $93.2\%$ and $91.22\%$ for ML-LFIL with the RF algorithm. We also obtain high $F_{1}$ -Score of at least $81.87\%$ for SVM and $92.2\%$ for RF algorithm. This demonstrates the effectiveness of ML-LFIL in identification and localization of link faults even with large networks and realistic traffic traces. A summary of precision, recall and $F_{1}$ -Score values with the test data sets for all the three networks, using the best algorithm at each stage of ML-LFIL is given in Table I.

IV-C4 Fault Detection Accuracy

As discussed earlier, the first stage of ML-LFIL identifies whether or not a link disconnection has occurred, using a link disconnection classifier. Instead of resulting in a “no link fault” message, the tentative disconnected link, $L_{1}$ , (regardless of its correctness) will trigger the execution of the two subsequent stages for further analysis to identify and localize the link fault. It also should not be too sensitive since many false alarms could occur and unnecessarily trigger the analysis. We define the fault detection accuracy of ML-LFIL as the ratio of the number of data points associated with link faults that have been detected and triggered the execution of the second and third stages of ML-LFIL over the total number of data points associated with link faults that have actually occurred. In Fig. 15, we present the fault detection accuracy of ML-LFIL for the three networks. It can be seen that the RF algorithm performs better than the other algorithms with fault detection accuracy of about $98.9\%$ , $97.2\%$ and $96.1\%$ for the $30$ -node network, $60$ -node network and the 100-node Interroute network, respectively. This high fault detection accuracy results in the superior performance in identifying and localizing link fault in the two subsequent stages of ML-LFIL.

IV-C5 Performance comparison with ping-based active probing

We note that the recent literature considers only link disconnections. Thus, we only compare the performance of the first stage of the proposed method (ML-LFIL-S1) with the existing work [7]. In [7], upon a link fault detection, the fault is localized by realizing two stages: (i) pinging all the nodes in the network to obtain sufficient data, and (ii) analyzing the obtained data. In Table II, we present the accuracy of ML-LFIL-S1 and the ping-based approach for different network topologies in localizing link disconnections in the network. We can see that for the 30-node network the ping-based approach and our ML-LFIL-S1 perform comparably, while for the 60-node and 100-node Interroute network, the ping-based approach performs slightly better than our ML-LFIL-S1. However, the ping-based approach requires significantly longer time than the proposed method to localize a link fault as discussed below.

IV-C6 Fault Localization Time

The time taken to localize a link fault in the network upon its occurrence is an important metric as this affects the failure recovery time. The faster the link fault localization, the less the impact of the fault on the network. In Table III, we present the time taken by ML-LFIL with different machine learning algorithms to localize a link fault in the network. Given a data point, the time taken by ML-LFIL is computed as the time to run all three stages to localize the disconnected and reconnected links. We compare the fault localization time with that of the ping-based active probing approach. Upon a link fault, we measure the time taken by the ping-based active probing approach to: (i) ping all the nodes in the network to obtain sufficient data, and (ii) analyze the obtained data to localize the fault. We note that the time taken to ping all the nodes depends on the propagation delay of the signalling messages on the links. In our experiments with two random complex networks, we consider short connections where the propagation delay of links is randomly chosen in the range $[0.1,0.5]$ $\mu$ s. This corresponds to the distance between nodes being in the range $[20,100]$ meters. For the 100-node network topology, we set the propagation delay based on the actual distance among nodes. The results show that ML-LFIL can localize a link fault in the order of microseconds. The worst case time of ML-LFIL when using MLP algorithm is $302.73\mu$ s, whereas the ping-based approach requires significantly longer time to localize a link fault. It is worth mentioning that the fault localization time incurred by ML-LFIL does not vary much with the size of networks, i.e., increase in the size of feature vector evaluated by ML-LFIL. Whereas, the ping-based approach incurs increased localization time from $2.9$ ms for the $30$ -node network to $8.2$ ms for the $60$ -node network and $57$ s for the $100$ -node. It is to be also noted that the fault localization time of the ping-based approach will increase with the increasing link lengths. This shows that ML-LFIL enables a fast link fault identification and localization.

V Conclusion

In this paper, we developed a three-stage machine learning-based technique for link fault identification and localization (ML-LFIL) in complex networks. ML-LFIL learns the traffic behavior from the measurements captured from the network in normal working conditions and different fault scenarios that include link disconnections and link reconnections. The learning model takes into account the aggregate flow rate, end-to-end delay and packet loss captured at ingress and egress nodes. We trained ML-LFIL using different learning algorithms that include SVM, MLP and RF. We carried out comprehensive experiments in Mininet platform with two small-world complex networks and the 100-node Interroute network from the Internet topology zoo to study the performance of ML-LFIL. The results show that ML-LFIL achieves high performance in identification and localization of link faults with up to $97\%$ of accuracy. We compare ML-LFIL with a ping-based active probing approach. The results show that ML-LFIL requires significantly shorter time compared to the ping-based approach to achieve similar accuracy in link fault localization.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. Evans, “The Internet of Things: How the Next Evolution of the Internet Is Changing Everything,” White Paper, Cisco, Apr. 2011.
2[2] W. Xu et al. , “Internet of Vehicles in Big Data Era,” IEEE/CAA Journal of Automatica Sinica , vol. 5, no. 1, pp. 19–35, 2018.
3[3] A. Banerjee, “Assuarance of Real-Time Cloud Services Requires Insights from Correlated Content, Sessions and IP Topology Planes,” White Paper, Heavy Reading, Aug. 2012.
4[4] P. Murali Mohan et al. , “Fault tolerance in TCAM-limited software defined networks,” Computer Networks , vol. 116, pp. 47–62, 2017.
5[5] D. Staessens et al. , “Software Defined Networking: Meeting Carrier Grade Requirements,” in IEEE LANMAN 2011 , Chapel Hill, USA, 2011.
6[6] N. L. M. V. Adrichem, B. J. V. Asten, and F. A. Kuipers, “Fast Recovery in Software-Defined Networks,” in EWSDN 2014 , London, Sep. 2014.
7[7] M. X. Cheng and W. B. Wu, “Data Analytics for Fault Localization in Complex Networks,” IEEE Internet Things J. , vol. 3, no. 5, Oct. 2016.
8[8] M. Wang, Y. Cui, X. Wang, S. Xiao, and J. Jiang, “Machine Learning for Networking: Workflow, Advances and Opportunities,” IEEE Network , vol. 32, no. 2, pp. 92–99, Mar. 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Machine Learning-based Link Fault Identification and Localization in Complex Networks

Abstract

Index Terms:

I Introduction

II Related Work

II-A Network Fault Localization

II-B Machine Learning for Network Fault Management

II-C Our Work

III Data Analytics for Link Fault Identification and Localization

III-A Traffic Features

III-B ML-LFIL

III-B1 Stage 1 – Link Disconnection Classification

III-B2 Stage 2 – Link Fault Identification (ML-LFIL-S2)

III-B3 Stage 3 – Link Reconnection Classification

III-C Illustrative Example

IV Performance Study

IV-A Simulation Settings and Data Collection

IV-B Performance Metrics

IV-C Analysis of Results

IV-C1 Performance with the 303030-node network

IV-C2 Performance with the 606060-node network

IV-C3 Performance with the 100100100-node Interroute network

IV-C4 Fault Detection Accuracy

IV-C5 Performance comparison with ping-based active probing

IV-C6 Fault Localization Time

V Conclusion

IV-C1 Performance with the $30$ -node network

IV-C2 Performance with the $60$ -node network

IV-C3 Performance with the $100$ -node Interroute network