Detecting Network Soft-failures with the Network Link Outlier Factor (NLOF)
Christopher Mendoza, Venkat Dasari, Michael P. McGarry

TL;DR
This paper introduces NLOF, a method using NetFlow data to detect soft-failures in communication networks by identifying outliers in flow throughput, enabling early failure detection across the network.
Contribution
The paper presents NLOF, a novel outlier detection approach based on clustering throughput data to identify soft-failures in network links.
Findings
NLOF scores correlate strongly with actual link failures.
NLOF effectively detects soft-failures across large networks.
Experimental evaluation confirms NLOF's accuracy and robustness.
Abstract
In this paper, we describe and experimentally evaluate the performance of our Network Link Outlier Factor (NLOF) for detecting soft-failures in communication networks. The NLOF is computed using the throughput values derived from NetFlow records. The flow throughput values are clustered in two stages, outlier values are determined within each cluster, and the flow outliers are used to compute the outlier factor or score for each network link. When sampling NetFlow records across the full span of a network, NLOF enables the detection of soft-failures across the span of the network; large NLOF scores correlate well with links experiencing failure.
| Test | Links with Errors | Error Rate | Topology | Throughput Classes | ||||||||
| 1 | None | 0 | 1 | 100 Kbps, 1Mbps | ||||||||
| 2 | (’129.108.40.2’, ’US1’) | 0.1 | 1 | 100 Kbps, 1Mbps | ||||||||
| 3 |
|
|
1 | 100 Kbps, 1Mbps | ||||||||
| 4 | None | 0 | 2 | 100 Kbps, 1Mbps | ||||||||
| 5 | (’BS’, ’R4’) | 0.1 | 2 | 100 Kbps, 1Mbps | ||||||||
| 6 |
|
|
2 |
|
| Test 1 | Test 2 | Test 3 | |||
|---|---|---|---|---|---|
| \ulLink | \ulNLOF | \ulLink | \ulNLOF | \ulLink | \ulNLOF |
| (’128.163.217.2’, ’UKS’) | 0 | (’129.108.40.2’, ’US1’) | 0.084057971 | (’129.108.42.4’, ’US3’) | 0.462962963 |
| (’128.163.217.3’, ’UKS’) | 0 | (’US1’, ’Router’) | 0.014066496 | (’129.108.41.3’, ’US2’) | 0.324675325 |
| (’128.163.217.4’, ’UKS’) | 0 | (’129.108.40.4’, ’US1’) | 0.005957447 | (’US3’, ’Router’) | 0.293251534 |
| (’129.108.40.2’, ’US1’) | 0 | (’200.17.30.4’, ’BS’) | 0.005957447 | (’129.108.40.2’, ’US1’) | 0.259668508 |
| (’129.108.40.3’, ’US1’) | 0 | (’129.108.42.4’, ’US3’) | 0.005076142 | (’129.108.40.3’, ’US1’) | 0.12716763 |
| Test 4 | Test 5 | Test 6 | |||
|---|---|---|---|---|---|
| \ulLink | \ulNLOF | \ulLink | \ulNLOF | \ulLink | \ulNLOF |
| (’128.163.217.2’, ’UKS’) | 0 | (’200.17.30.4’, ’BS’) | 0.383966245 | (’128.163.217.2’, ’UKS’) | 0.108504399 |
| (’128.163.217.3’, ’UKS’) | 0 | (’BS’, ’R4’) | 0.330143541 | (’129.108.40.2’, ’US1’) | 0.084057971 |
| (’128.163.217.4’, ’UKS’) | 0 | (’R1’, ’R4’) | 0.330143541 | (’129.108.42.4’, ’US3’) | 0.043147208 |
| (’129.108.40.2’, ’US1’) | 0 | (’R1’, ’R2’) | 0.179347826 | (’US3’, ’R2’) | 0.032704403 |
| (’129.108.40.3’, ’US1’) | 0 | (’129.108.40.3’, ’US1’) | 0.147058824 | (’UKS’, ’R3’) | 0.024746193 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\useunder
\ul
Detecting Network Soft-failures with the Network Link Outlier Factor (NLOF)
Christopher Mendoza
Department of Electrical and
Computer Engineering
University of Texas at El Paso
El Paso, Texas 79968
Email: [email protected]
Venkat Dasari
U.S. Army Research Laboratory
Aberdeen Proving Ground, MD
Email: [email protected]
Michael P. McGarry
Department of Electrical and
Computer Engineering
University of Texas at El Paso
El Paso, Texas 79968
Email: [email protected]
Abstract
In this paper, we describe and experimentally evaluate the performance of our Network Link Outlier Factor (NLOF) for detecting soft-failures in communication networks. The NLOF is computed using the throughput values derived from NetFlow records. The flow throughput values are clustered in two stages, outlier values are determined within each cluster, and the flow outliers are used to compute the outlier factor or score for each network link. When sampling NetFlow records across the full span of a network, NLOF enables the detection of soft-failures across the span of the network; large NLOF scores correlate well with links experiencing failure.
I Introduction
The collection of network data and the application of data analytics (including machine learning) allow the development of technologies to automate network management. Network management is best characterized using the FCAPS model from the ITU:
- •
Fault detection and correction
- •
Configuration and operation
- •
Accounting and billing
- •
Performance assessment and optimization
- •
Security assurance and protection
In this work we seek to advance the automation of network fault detection. Specifically for network soft-failures that result in diminished performance. The symptoms of soft-failures are subtle and are therefore difficult to diagnose manually: increased bit errors, occasional packet loss, unnecessarily long paths through the network, or congestion control mechanisms unnecessarily reducing throughput. In this work, we utilize a suite of data analytics (e.g., clustering and outlier detection) to detect the occurrence of network soft-failures: bit errors, packet loss. The end result of these data analytics is an outlier score for each network link called the Network Link Outlier Factor (NLOF). The NLOF score is an indicator of how likely a link is experiencing a network soft-failure.
I-A Related work
A survey [1] of recent fault localization techniques expands on the taxonomy presented in [2]. The taxonomy presented consists of three categories of fault localization: Artificial Intelligence, Model Traversing, and Graph-theoretic.
Much of the related work implements active probing techniques [3, 4, 5]. These techniques use probe messages to infer the state of links and require optimal probe placement [6, 7, 8] to trade off measurement with resource consumption. More recent work [9] uses passive data (e.g., number of: flows, lost packets, average packet delay) and compares the performance of several machine learning techniques (e.g., random forests and multi-layer perceptrons) to localize faults. Their results are compared to the active probing technique in [5]. Other recent work in the domain of optical networking uses passively collected physical layer data and machine learning to detect and/or localize faults [10, 11].
Some related work uses a hybrid-approach [12, 13] that mixes active probing with passive data collection. In [12] the authors present a fault localization framework named Active Integrated fault Reasoning (AIR). AIR uses passive monitoring to compile a set of observed symptoms. The framework then generates sets of faults that may be causing the symptoms. Each of these fault sets is tested to validate if any of them are credible. If none are credible then there is likely to be a symptom that was not observed by the passive monitoring. To identify if the likely symptom is present active probes are used. Afterward, the sets of faults go through the credibility test again.
Software Defined Networks [14, 15] including OpenFlow [16] along with advances in machine learning are sparking a resurgence in network fault localization. Software Defined Networking allows for a broader view of the network, providing a simpler way to obtain network topology information [17] for fault localization.
As far as we know, our work is the first to use passive NetFlow and topology data to detect/localize network faults.
I-B Outline
In Section II we describe our suite of data analytics resulting in the NLOF score for each network link. In Section III we describe our NS-3 experiments to evaluate the failure detection performance of NLOF and in Section IV we present and discuss the results of those experiments. Finally, we discuss our conclusions and outline paths for future work in Section V.
II Detecting Network Link Soft Failures
We propose a method to detect network link soft-failures using NetFlow data. Specifically, we use the average throughput of flows from their collected NetFlow records. If the collected data consists of flows traversing the full span of the network, we believe it will allow us to detect any soft-failures in the full topology. We propose using flow throughput outlier detection to assist with the detection of network link soft-failures. Using topology and routing information we can correlate flows with the network links they traverse. We hypothesize that a network link experiencing a soft-failure will cause the flows traversing that link to exhibit abnormal throughput. Therefore, a network link carrying many flows that are throughput outliers is one that is experiencing a soft-failure.
Using outlier detection directly on the average throughput of all flows requires the assumption that a majority of flows do not have abnormal throughput. Since this may not be a reasonable assumption, we first cluster the throughput of flows into the set of clusters we believe will naturally exist in a network and then identify the outliers within those throughput clusters. Our full technique to detect network link soft-failures consists of: 1) flow throughput clustering, 2) flow throughput outlier detection using an outlier score, 3) tracing flows on the network topology using routing information, and 4) network link outlier score computation from flow outlier scores. Figure 1 illustrates this 4-stage analytics pipeline for detecting network link soft-failures.
II-A Flow Throughput Clustering (DBSCAN and TPCluster)
Network link soft-failures could cause a majority of flows to exhibit reduced throughput. As a result, we cannot immediately apply outlier detection techniques to the average flow throughput values. We propose to first organize average flow throughput into clusters. Network topologies will generally employ several network link transmission rates and the average flow throughput values will be limited by those values. In isolation, the average throughput of a flow will be limited by the bottleneck network link that it traverses. Let be either the original generated throughput (or bitrate) of the flow or the bitrate at which the flow enters the network of interest, be the network link rate of the th network link a flow traverses in the network of interest. Then, in isolation, the average throughput of that flow will be:
[TABLE]
Flows will often share network links with other flows. Let be the number of flows sharing the th network link a flow traverses. Let’s assume that flows share network links equally. Then, while sharing network links with other flows, the average throughput of that flow will be:
[TABLE]
Suppose we had a network topology with two different link rates, 1 Gbps and 100 Mbps and a network link was never shared by more than 4 flows at a time. In this case we would have 8 different values for average flow throughput in descending order (1 Gbps, 500 Mbps, 333 Mbps, 250 Mbps, 100 Mbps, 50 Mbps, 33.3 Mbps, and 25 Mbps). Network links experiencing soft-failures will reduce these average flow throughput values for flows traversing those links. Therefore, any average flow throughput values deviating significantly from these 8 discrete values are likely affected by a network soft-failure.
Since we generally do not know the set of network link transmission rates nor the number of flows sharing network links, we let unsupervised machine learning (specifically, clustering) find these values for us. We select Density-Based Spatial Clustering of Applications with Noise (DBSCAN)[18] as our first stage of clustering since it naturally finds dense and potentially non-convex clusters without knowing the number of clusters a-priori. DBSCAN produces a number of clusters as well as a set of data points labeled as “noise” that do not fit into any of the clusters.
A second stage of clustering is performed since the flows affected by soft-failures may begin to form their own dense cluster. In this second stage that we call TPCluster, we want to combine adjacent clusters if they are within a proximity to each other that suggests one may be a performance degraded set of the other. TPCluster provides a dynamic range based on the throughput context of the DBSCAN clusters. TPCluster uses two parameters, throughput ratio (tpr) and throughput deviation (tpdev). tpr should be set to the maximum reasonable performance degradation of a throughput class and tpdev should be set to the deviation that you might expect to see from a throughput class, to help cluster the flows labeled as “noise” into the appropriate cluster. Algorithm 1 shows how TPClusters are formed.
II-B Flow Outlier Factor (FOF)
Now that TPClusters have been defined we must now choose a point in each cluster to be the representative ”normal” point () i.e. the point with a reasonable desirable performance. A method is to just use the point in the cluster with the highest performance however, we propose to use k-means clustering to use the cluster center as the representative point and as k increases the more aggressive the representative point will be. Algorithm 2 shows how the Flow Outlier Factor (FOF) of each flow is computed.
II-C Topology Flow Tracing
In this step we associate network links with the flows that traverse them. To make this association, flows are traced on the network topology using routing information. We use the NetworkX Python package to trace flows on the topology of a network assuming shortest path routing.
II-D Network Link Outlier Factor (NLOF)
In this final step, we compute the outlier score for each network link (i.e., the NLOF). The flow outlier scores (i.e., FOFs) are used to compute the NLOF for the network links the flows traverse. Outlier flows are determined by a threshold on their FOF and the NLOF is computed to be the ratio of outlier flows to total flows traversing the network link. The NLOF is computed using Algorithm 3.
III Experiments
To evaluate the performance of NLOF, we utilize NS-3 simulation experiments. Figure 2 show the two topologies we simulated. For each topology 3 experiments were run, for a total of 6 experiments. For each experiment all links were set to have a data rate of 10 Mbps. US1, US2, US3, UKS and BS are all OpenFlow switches implemented with the OpenFlow 1.3 module. The nodes and routers populate their routing tables using Routing Information Protocol (RIP). During each simulation 5000 On/Off flows were produced at a rate of either 1Mbps or 1Kbps between two randomly selected hosts in the network, the only exception is test 6, it had 2 additional throughput classes which are 10 Kbps and 2 Mbps for a total of 4. To collect the data from the simulation the built-in flow monitor model library was used. The probes were installed on all nodes to capture all the traffic in the network. Table I shows the configuration of the simulation for each test.
The flow monitor library outputs the files in XML format, we then parsed the XML file to construct a pandas DataFrame to resemble flow records. The produced DataFrame will be in an acceptable format for SciKitLearn’s DBSCAN clustering method. The DBSCAN clustering was done using the parameters eps = 100 and min_samples = 50 which produced clusters that could then be combined to form TPClusters. TPClusters were formed using Algorithm 1 with parameter values of tpr = 0.3, tpdev = 0.1 and k = 2. The flows were traced to put each flow into every network link that it traversed, assuming that the flow will take the shortest path which can be obtained using the NetworkX shortest path function. Finally the NLOF for each object was calculated using Algorithm 3 with an FOF threshold value of 0.1.
IV Results
Figures 3 and 4 show the flow throughput distribution (left-side sub-plot) and the corresponding TPClusters in a violin plot (right-side sub-plot) for tests 1 and 6 respectively. Figures 5 and 6 show the flow throughput distributions of each TPCluster for tests 1 and 6 respectively. As shown in Figures 3 and 4, the TPClusters formed as expected i.e. one cluster for every throughput class. For test 1 we have the two throughput classes 100 Kbps and 1 Mbps with 2 corresponding clusters. For test 6 we have 4 TPClusters one for each of the throughput classes. More importantly the points labeled as noise by DBSCAN are moved into their appropriate TPCluster. For test 1 the cluster distributions have a small range that clearly indicates none of the flows have poor throughput performance within the context of their cluster. Figure 6 shows more interesting flow throughput distributions, this time there are four separate clusters, which all have flows farther away from the which can be seen visually by the larger range of each cluster. in this case is located in the upper half of the cluster distribution. The large range of the clusters indicates that there are flows with poor throughput performance belonging to these clusters. Our two-step clustering organizes the flows to properly identify those experiencing poor throughput performance.
Tables II and III show the results obtained for each test for topology 1 and 2 respectively. The links and NLOF scores that are in bold are the links that were set to have errors in the simulation, which correspond to Table I. The tables show that for Tests 1 and 4, both of which have no poor performing links, there is a NLOF score of 0 for all links. Test 2 has one link with a packet error rate of 0.1, that link has the highest NLOF score by a wide margin. Test 5 also has one link with a packet error rate of 0.1. However, this time it is a link connected between two nodes with high centrality rather than a link near the edge of the network. A large portion of the network traffic will go through this errored link, which explains the much higher NLOF scores in general compared to Test 2. Something interesting to note in Test 5 is that the link with a non-zero error rate does not have the highest NLOF score. This is likely caused by the fact that all the traffic to or from node ”200.17.30.4” must go through the link with a non-zero error rate if communicating with a node not connected to the ”BS” switch. One last thing to note about Test 5 is that the edges (BS,R4) and (R1,R4) have an identical NLOF score, which is easily explained since all the traffic that goes through one of those links must go through the other. Test 3 shows a different scenario now with 3 separate links all having a packet error rate of 0.1. Due to the fact that there are more links with non-zero error rates the NLOF scores will have larger values since there will be more poor performing traffic. Even in this scenario the NLOF score gives an idea as to which links are the ones with non-zero error rates, as the 3 of the top 4 NLOF scores are the links we are looking for as shown in bold in Table II. For Test 6 the packet error rates were lowered by a significant amount and the 3 links with non-zero errors are all different. Test 6 also had the added change of 2 extra throughput classes. The results are as expected, the 3 links with the highest NLOF scores are the 3 errored links.
V Conclusion
By using multiple simulations in the NS-3 environment we have shown that it is possible to detect and localize soft-failures in a network using the Network Link Outlier Factor (NLOF). The results in Tables II and III show that the links with failures have the highest NLOF score which indicates where a fault in the network likely is. Using a new clustering technique, named TPCluster, we are able to provide a context for the performance of each individual flow. Our simulation experiments show that TPCluster yields meaningful clusters for identifying faults using outlier detection techniques. For future work we plan on studying the thresholds on NLOF scores for declaring a link failure. We also plan to expand the types of soft-failures we can detect.
VI Acknowledgements
This material is based upon work supported by both the U.S. Army Research Laboratory (USARL) under Cooperative Agreement W911NF-18-2-0287 and the National Science Foundation under Grant No. OAC-1450997.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Dusia and A. S. Sethi, “Recent advances in fault localization in computer networks,” IEEE Communications Surveys Tutorials , vol. 18, no. 4, pp. 3030–3051, May 2016.
- 2[2] M. Steinder and A. S. Sethi, “A survey of fault localization techniques in computer networks,” Science of computer programming , vol. 53, no. 2, pp. 165–194, July 2004.
- 3[3] M. Mukamoto, T. Matsuda, S. Hara, K. Takizawa, F. Ono, and R. Miura, “Adaptive boolean network tomography for link failure detection,” in 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM) , May 2015, pp. 646–651.
- 4[4] N. Duffield, “Network tomography of binary network performance characteristics,” IEEE Transactions on Information Theory , vol. 52, no. 12, pp. 5373–5388, Dec 2006.
- 5[5] M. X. Cheng and W. B. Wu, “Data analytics for fault localization in complex networks,” IEEE Internet of Things Journal , vol. 3, no. 5, pp. 701–708, Oct 2016.
- 6[6] M. Natu, A. S. Sethi, and E. L. Lloyd, “Efficient probe selection algorithms for fault diagnosis,” Telecommunication Systems , vol. 37, no. 1-3, pp. 109–125, March 2008.
- 7[7] L. Cheng, X. Qiu, L. Meng, Y. Qiao, and R. Boutaba, “Efficient active probing for fault diagnosis in large scale and noisy networks,” in 2010 Proceedings IEEE INFOCOM , March 2010, pp. 1–9.
- 8[8] M. Natu and A. S. Sethi, “Probe station placement for fault diagnosis,” in IEEE GLOBECOM 2007 - IEEE Global Telecommunications Conference , Nov 2007, pp. 113–117.
