Corrected overlap weight and clustering coefficient
Vladimir Batagelj

TL;DR
This paper identifies limitations in the traditional overlap weight and clustering coefficient measures for network analysis and proposes corrected definitions that better identify important network elements, demonstrated on the US Airports network.
Contribution
The authors introduce corrected versions of the overlap weight and clustering coefficient measures to improve their usefulness in data analysis tasks.
Findings
Corrected measures provide more meaningful identification of important nodes and links.
Application on US Airports network demonstrates the effectiveness of the corrected measures.
Traditional measures tend to highlight small maximal subgraphs, which can be misleading.
Abstract
We discuss two well known network measures: the overlap weight of an edge and the clustering coefficient of a node. For both of them it turns out that they are not very useful for data analytic task to identify important elements (nodes or links) of a given network. The reason for this is that they attain their largest values on maximal subgraphs of relatively small size that are more probable to appear in a network than that of larger size. We show how the definitions of these measures can be corrected in such a way that they give the expected results. We illustrate the proposed corrected measures by applying them on the US Airports network using the program Pajek.
| Chicago O’hare Intl | Pittsburgh Intll | 80 | 139 | 94 | 0.57971 |
| Chicago O’hare Intl | Lambert-St Louis Intl | 80 | 139 | 94 | 0.57971 |
| Chicago O’hare Intl | Dallas/Fort Worth Intl | 78 | 118 | 139 | 0.55714 |
| Chicago O’hare Intl | The W B Hartsfield Atlanta | 77 | 101 | 139 | 0.54610 |
| The W B Hartsfield Atlanta | Charlotte/Douglas Intl | 76 | 101 | 87 | 0.73077 |
| The W B Hartsfield Atlanta | Dallas/Fort Worth Intl | 73 | 101 | 118 | 0.58871 |
| airport | airport | ||||
|---|---|---|---|---|---|
| 1 | 7 | Lehigh Valley Intll | 8 | 4 | Gunnison County |
| 2 | 5 | Evansville Regional | 9 | 4 | Aspen-Pitkin Co/Sardy Field |
| 3 | 5 | Stewart Int’l | 10 | 4 | Hector Intll |
| 4 | 5 | Rio Grande Valley Intl | 11 | 4 | Burlington Regional |
| 5 | 5 | Tallahassee Regional | 12 | 4 | Rafael Hernandez |
| 6 | 4 | Myrtle Beach Intl | 13 | 4 | Wilkes-Barre/Scranton Intl |
| 7 | 4 | Bishop Intll | 14 | 4 | Toledo Express |
| Rank | Value | deg | Id |
|---|---|---|---|
| 1 | 0.3739 | 45 | Cleveland-Hopkins Intl |
| 2 | 0.3700 | 50 | General Edward Lawrence Logan |
| 3 | 0.3688 | 56 | Orlando Intl |
| 4 | 0.3595 | 42 | Tampa Intl |
| 5 | 0.3488 | 61 | Cincinnati/Northern Kentucky Intl |
| 6 | 0.3457 | 70 | Detroit Metropolitan Wayne County |
| 7 | 0.3455 | 67 | Newark Intl |
| 8 | 0.3429 | 53 | Baltimore-Washington Intl |
| 9 | 0.3415 | 47 | Miami Intl |
| 10 | 0.3405 | 42 | Washington National |
| 11 | 0.3379 | 56 | Nashville Intll |
| 12 | 0.3359 | 46 | John F Kennedy Intl |
| 13 | 0.3347 | 62 | Philadelphia Intl |
| 14 | 0.3335 | 41 | Indianapolis Intl |
| 15 | 0.3335 | 50 | La Guardia |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Corrected overlap weight and clustering coefficient
Vladimir Batagelj
Institute of Mathematics, Physics and Mechanics,
Department of Theoretical Computer Science,
Jadranska 19, 1 000 Ljubljana, Slovenia
and
University of Primorska, Andrej Marušič Institute,
Muzejski trg 2, Koper, Slovenia
and
National Research University Higher School of Economics,
Myasnitskaya, 20, 101000 Moscow, Russia
e-mail: [email protected]
ORCID: 0000-0002-0240-9446
Abstract
We discuss two well known network measures: the overlap weight of an edge and the clustering coefficient of a node. For both of them it turns out that they are not very useful for data analytic task to identify important elements (nodes or links) of a given network. The reason for this is that they attain their largest values on maximal subgraphs of relatively small size that are more probable to appear in a network than that of larger size. We show how the definitions of these measures can be corrected in such a way that they give the expected results. We illustrate the proposed corrected measures by applying them on the US Airports network using the program Pajek.
Keywords: social network analysis, importance measure, triangular weight, overlap weight, clustering coefficient.
Mathematics Subject Classification 2010: 91D30, 91C05, 05C85, 68R10, 05C42.
1 Introduction
1.1 Network element importance measures
To identify important / interesting elements (nodes, links) in a network we often try to express our intuition about their importantance using an appropriate measure (node index, link weight) following the scheme
larger is the measure value of an element, more important / interesting is this element.
Too often, in analysis of networks, researchers uncritically pick some measure from the literature (degrees, closeness, betweenness, hubs and authorities, clustering coefficient, etc. (Wasserman and Faust, 1995; Todeschini and Consonni, 2009)) and apply it to their network.
In this paper we discuss two well known network local density measures: the overlap weight of an edge (Onnela et al., 2007) and the clustering coefficient of a node (Holland and Leinhardt, 1971; Watts and Strogatz, 1998).
For both of them it turns out that they are not very useful for data analytic task to identify important elements of a given network. The reason for this is that they attain their largest values on maximal subgraphs of relatively small size – they are more probable to appear in a network than that of larger size. We show how their definitions can be corrected in such a way that they give the expected results. We illustrate the proposed corrected measures by applying them on the US Airports network using the program Pajek. We will limit our attention to undirected simple graphs .
Many similar indices and weights were proposed by graph drawing community for disentanglement in visualization of hairball networks (Melançon an Sallaberry, 2008; Nocaj et al., 2015, 2016).
When searching for important subnetworks in a given network we often assume a model that in the evolution of the network the increased activities in a part of the network create new nodes and edges in that part increasing its local density. We expect from a local density measure for an element (node/link) of network the following properties:
- ld1.
adding an edge, , to the local neighborhood, , does not decrease the local density
.
- ld2.
normalization: .
- ld3.
can attain value 1, , on the largest subnetwork of certain type in the network.
2 Overlap weight
2.1 Overlap weight
A direct measure of the overlap of an edge in an undirected simple graph is the number of common neighbors of its end nodes and (see Figure 1). It is equal to – the number of triangles (cycles of length 3) to which the edge belongs. The edge neighbors subgraph is labeled – the subgraph in Figure 1 is labeled . There are two problems with this measure:
- •
it is not normalized (bounded to );
- •
it does not consider the ‘potentiality’ of nodes and to form triangles – there are
[TABLE]
nodes in the smaller set of neighbors that are not in the other set of neighbors.
Two simple normalizations are:
[TABLE]
where is the number of nodes, and is the maximum number of triangles on an edge in the graph .
The (topological) overlap weight of an edge considers also the degrees of edge’s end nodes and is defined as
[TABLE]
In the case we set . It somehow resolves both problems.
The overlap weight is essentially a Jaccard similarity index (Wikipedia, 2018)
[TABLE]
for and where is the set of neighbors of a node . In this case we have and
[TABLE]
Note also that is the normalized Hamming distance (Wikipedia, 2018). The operation denotes the symmetric difference .
Another normalized overlap measure is the overlap index (Wikipedia, 2018)
[TABLE]
Both measures and , applied to networks, have some nice properties. For example: a pair of nodes and are structurally equivalent iff . Therefore the overlap weight measures the substitutiability of one edge’s end node by the other.
Introducing two auxiliary quantities
[TABLE]
we can rewrite the definiton of the overlap weight
[TABLE]
and if then .
For every edge it holds . Therefore
[TABLE]
showing that .
The value is attained exactly in the case when ; and the value exactly when .
In simple directed graphs without loops different types of triangles exist over an arc . We can define overlap weights for each type. For example: the transitive overlap weight
[TABLE]
and the cyclic overlap weight
[TABLE]
where and are the number of transitive / cyclic triangles containing the arc . In this paper we will limit our discussion to overlap weights in undirected graphs.
2.2 US Airports links with the largest overlap weight
Let us apply the overlap weight to the network of US Airports 1997 (Batagelj and Mrvar, 2006). It consists of 332 airports and 2126 edges among them. There is an edge linking a pair of airports iff in the year 1997 there was a flight company providing flights between those two airports.
The size of a circle representing an airport in Figure 2 is proportional to its degree – the number of airports linked to it. The airports with the largest degree are:
[TABLE]
For the overlap weight the edge cut at level 0.8 (a subnetwork of all edges with overlap weight at least 0.8) is presented in Figure 3. It consists of two triangles, a path of length 2, and 17 separate edges.
A tetrahedron (Kwigillingok, Kongiganak,Tuntutuliak, Bethel), see Figure 4, gives the first triangle in Figure 3 – attached with the node Bethel to the rest of network.
From this example we see that in real-life networks edges with the largest overlap weight tend to be edges with relatively small degrees in their end nodes ( implies ) – the overlap weight does not satisfy the condition ld3. Because of this the overlap weight is not very useful for data analytic tasks in searching for important elements of a given network. We would like to emphasize here that there are many applications in which overlap weight proves to be useful and appropriate; we question only its appropriateness for determining the most overlaped edges. We will try to improve the overlap weight definition to better suit the data analytic goals.
2.3 Corrected overlap weight
We define a corrected overlap weight as
[TABLE]
By the definiton of for every it holds . Since also and therefore ld2, . exactly when , and exactly when . For ld3, the corresponding maximal edge neighbors subgraph contains . The end nodes of the edge are structurally equivalent.
To show that ld1 also holds let denote the edge neighbors subgraph of the edge . Let be the edge added to . We can assume that , . Therefore . We have to consider some cases:
a. : then and .
b. :
b1. : then . It creates new triangle . We have and . We get
[TABLE]
b2. : then . It creates new triangle . We have and . We get
[TABLE]
b3. and : No new triangle on is created. We have and . Therefore .
The corrected overlap weight is a kind of local density measure, but it is primarly a substitutiability measure. To get a better local density measure we have to consider besides triangles also quadrilaterals (4-cycles).
2.4 US Airports 1997 links with the largest corrected overlap weight
For the US Airports 1997 network we get . For the corrected overlap weight the edge cut at level 0.5 is presented in Figure 5. Six links with the largest triangular weights are given in Table 1.
In Figure 6 all the neighbors of end nodes WB Hartsfield Atlanta and Charlotte/Douglas Intl of the link with the largest corrected overlap weight value are presented. They have 76 common (triangular) neighbors. The node WB Hartsfield Atlanta has 11 and the node Charlotte/Douglas Intl has 25 additional neighbors. Note (see Table 1) that there are some links with higher triangular weight, but also with much higher number of additional neighbors – therefore with smaller corrected overlap weights.
2.5 Comparisons
In Figure 7 the set is displayed for the US Airports 1997 network. For most edges it holds . It is easy to see that . Edges with the overlap value have the corrected overlap weight .
In Figure 8 the sets and are displayed for the US Airports 1997 network. With increasing of the corresponding overlap weight is decreasing; and the corresponding corrected overlap weight is also increasing.
We can observe similar tendencies if we compare both weights with respect to the number of triangles (see Figure 9).
3 Clustering coefficient
3.1 Clustering coefficient
For a node in an undirected simple graph its (local) clustering coefficient (Wikipedia, 2018) is measuring a local density in the node and is defined as a proportion of the number of existing edges between ’s neighbors to the number of all possible edges between ’s neighbors
[TABLE]
where . If then .
It is easy to see that
[TABLE]
where is the star in node .
It holds ; exactly when is isomorphic to – a complete graph on nodes. Therefore it seems that the clustering coefficient could be used to identify nodes with the densest neighborhoods.
The notion of clustering coefficient can be extended also to simple directed graphs (with loops).
3.2 US Airports with the largest clustering coefficient
Let us apply also the clustering coefficient to the US Airports 1997 network.
In Table 2 airports with the clustering coefficient equal to 1 and the degree at least 4 are listed. There are 28 additional such airports with a degree 3, and 38 with a degree 2.
Again we see that the clustering coefficient attains its largest value in nodes with relatively small degree. The probability that we get a complete subgraph on is decreasing very fast with increasing of . The clustering coefficient does not satisfy the condition ld3.
3.3 Corrected clustering coefficient
To get a corrected version of the clustering coefficient we proposed in Pajek (De Nooy et al., 2018) to replace in the denominator with . In this paper we propose another solution – we replace with :
[TABLE]
If then . Note that, if then .
To verify the property ld1 we add to a new edge with its end nodes in . Then and . Therefore
[TABLE]
To show the property ld2, , we have to consider two cases:
- a.
: then for we have and therefore
[TABLE]
- b.
: then and therefore
[TABLE]
For the property ld3, the value is attained in the case a on a -core, and in the case b on .
3.4 US Airports nodes with the largest corrected clustering coefficient
In Table 3 US Airports with the largest corrected clustering coefficient are listed. The largest value 0.3739 is attained for Cleveland-Hopkins Intl airport. In Figure 10 the adjacency matrix of a subnetwork on its 45 neighbors is presented. The subnetwork is relatively complete. A small value of corrected clustering coefficient is due to relatively small with respect to .
3.5 Comparisons
In Figure 11 the set is displayed for the US Airports 1997 network. The correlation between both coefficients is very small. An important observation is that edges with the largest value of the clustering coefficient have relatively small values of the corrected clustering coefficient. We also see that the number of edges in a node’s neighborhood is almost functionally dependent on its degree.
From Figure 12 we see that the clustering coefficient is decreasing with the increasing degree. Nodes with large degree have small values of clustering coefficient. The values of corrected clustering coefficient are large for nodes of large degree.
4 Conclusions
In the paper we showed that two network measures, the overlap weight and clustering coefficient, are not suitable for the data analytic task of determining important elements in a given network. We proposed corrected versions of these two measures that give expected results.
Because we can replace in the corrected measures with . Its advantage is that it can be easier computed; but the corresponding corrected index is less ‘sensitive’.
An interesting task for future research is a comparision of the proposed measures with measures from graph drawing (Melançon an Sallaberry, 2008; Nocaj et al., 2015, 2016).
Acknowledgments
The computations were done combining Pajek (De Nooy et al., 2018) with short programs in Python and R (Batagelj, 2016).
This work is supported in part by the Slovenian Research Agency (research program P1-0294 and research projects J1-9187, and J7-8279) and by Russian Academic Excellence Project ’5-100’.
The paper is a detailed and extended version of the talk presented at the CMStatistics (ERCIM) 2015 Conference. The author’s attendance on the conference was partially supported by the COST Action IC1408 – CRoNoS.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Batagelj and Mrvar (2006) Batagelj, V., Mrvar, A. (2006), Pajek data sets: US Airports network: http://vlado.fmf.uni-lj.si/pub/networks/data/mix/US Air 97.net .
- 2Batagelj (2016) Batagelj, V. (2016), Corrected. https://github.com/bavla/corrected .
- 3De Nooy et al. (2018) De Nooy, W., Mrvar, A., Batagelj, V. (2018). Exploratory Social Network Analysis with Pajek; Revised and Expanded Edition for Updated Software. Structural Analysis in the Social Sciences, Cambridge University Press.
- 4Holland and Leinhardt (1971) Holland, P.W. and Leinhardt, S. (1971). Transitivity in structural models of small groups. Comparative Group Studies 2: 107–124.
- 5Melançon an Sallaberry (2008) Melançon, G. and Sallaberry, A. (2008). Edge Metrics for Visual Graph Analytics: A Comparative Study. 12th International Conference Information Visualisation, 610-615.
- 6Nocaj et al. (2015) Nocaj, A., Ortmann, M. and Brandes, U. (2015). Untangling the Hairballs of Multi-Centered, Small-World Online Social Media Networks. Journal of Graph Algorithms and Applications 19(2), 595-618.
- 7Nocaj et al. (2016) Nocaj, A., Ortmann, M. and Brandes, U. (2016). Adaptive Disentanglement Based on Local Clustering in Small-World Network Visualization. IEEE Transactions on Visualization and Computer Graphics 22 (6), 1662 - 1671.
- 8Onnela et al. (2007) Onnela, J.P., Saramaki, J., Hyvonen, J., Szabo, G., Lazer, D., Kaski, K., Kertesz, J., Barabasi, A.L. (2007). Structure and tie strengths in mobile communication networks. Proceedings of the National Academy of Sciences 104(18), 7332.
