
TL;DR
This paper introduces a new graph decomposition method based on local density, providing a polynomial-time exact algorithm and a linear-time approximation, which better captures dense subgraph structures than traditional $k$-core analysis.
Contribution
It defines locally-dense subgraphs, develops algorithms for their decomposition, and compares this approach to $k$-core analysis, highlighting improved density alignment.
Findings
Locally-dense decomposition can be computed in polynomial time.
A linear-time 2-approximation algorithm for locally-dense decomposition.
$k$-core decomposition is also a 2-approximation but less aligned with density in practice.
Abstract
Decomposing a graph into a hierarchical structure via -core analysis is a standard operation in any modern graph-mining toolkit. -core decomposition is a simple and efficient method that allows to analyze a graph beyond its mere degree distribution. More specifically, it is used to identify areas in the graph of increasing centrality and connectedness, and it allows to reveal the structural organization of the graph. Despite the fact that -core analysis relies on vertex degrees, -cores do not satisfy a certain, rather natural, density property. Simply put, the most central -core is not necessarily the densest subgraph. This inconsistency between -cores and graph density provides the basis of our study. We start by defining what it means for a subgraph to be locally-dense, and we show that our definition entails a nested chain decomposition of the graph, similar toâŚ
| running time | |||||
|---|---|---|---|---|---|
| Name | Core | GreedyLD | ExactLD | ||
| dolphins | 62 | 159 | 1ms | 1ms | 2ms |
| karate | 34 | 78 | 1ms | 1ms | 2ms |
| lesmis | 77 | 254 | 2ms | 2ms | 3ms |
| astro | 18â772 | 396â160 | 0.4s | 0.4s | 2s |
| enron | 36â692 | 183â831 | 0.3s | 0.3s | 2s |
| fb1912 | 747 | 30â025 | 44ms | 44ms | 0.2s |
| hepph | 12â008 | 237â010 | 0.2s | 0.2s | 0.9s |
| dblp | 317â080 | 1â049â866 | 2s | 2s | 14s |
| gowalla | 196â591 | 950â327 | 2s | 2s | 9s |
| roadnet | 1â965â206 | 2â766â607 | 7s | 8s | 1m6s |
| skitter | 1â696â415 | 11â095â298 | 21s | 21s | 1m46s |
| airports | 294 | 3â995 | 11ms | 10ms | 27ms |
| trains | 363 | 1â357 | 7ms | 7ms | 23ms |
| Name | Core | GreedyLD | Core | GreedyLD | |
|---|---|---|---|---|---|
| dolphins | 0.94 | 0.83 | 0.98 | 0.98 | |
| karate | 0.95 | 0.99 | 0.95 | 0.99 | |
| lesmis | 0.86 | 0.87 | 0.96 | 1.00 | |
| astro | 0.85 | 0.85 | 0.87 | 0.92 | |
| enron | 0.83 | 0.82 | 0.94 | 1.00 | |
| fb1912 | 0.69 | 0.74 | 0.91 | 1.00 | |
| hepph | 0.74 | 0.75 | 1.00 | 1.00 | |
| dblp | 0.80 | 0.86 | 1.00 | 1.00 | |
| gowalla | 0.89 | 0.92 | 0.87 | 1.00 | |
| roadnet | 0.81 | 0.87 | 0.84 | 0.87 | |
| skitter | 0.73 | 0.84 | 0.84 | 1.00 | |
| airports | 0.75 | 0.90 | 0.93 | 1.00 | |
| trains | 0.60 | 0.84 | 0.82 | 0.96 | |
| Name | Core | GreedyLD | ExactLD | c-vs-e | g-vs-e | c-vs-g |
|---|---|---|---|---|---|---|
| dolphins | 4 | 6 | 7 | 0.76 | 0.77 | 0.99 |
| karate | 4 | 3 | 4 | 0.80 | 0.95 | 0.78 |
| lesmis | 8 | 8 | 9 | 0.94 | 0.99 | 0.95 |
| astro | 52 | 83 | 435 | 0.93 | 0.93 | 0.99 |
| enron | 43 | 162 | 357 | 0.92 | 0.92 | 0.99 |
| fb1912 | 87 | 55 | 75 | 0.95 | 0.98 | 0.97 |
| hepph | 64 | 63 | 283 | 0.93 | 0.93 | 0.98 |
| dblp | 47 | 97 | 1087 | 0.88 | 0.89 | 0.97 |
| gowalla | 51 | 161 | 899 | 0.97 | 0.96 | 0.98 |
| roadnet | 3 | 43 | 2710 | 0.57 | 0.80 | 0.68 |
| skitter | 111 | 266 | 3501 | 0.98 | 0.97 | 0.99 |
| airports | 221 | 200 | 219 | 0.99 | 0.99 | 0.996 |
| trains | 187 | 59 | 156 | 0.87 | 0.89 | 0.98 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Density-friendly Graph Decomposition
Nikolaj Tatti
HIIT, University of Helsinki, Aalto UniversityHelsinkiFinland
(2009)
Abstract.
Decomposing a graph into a hierarchical structure via -core analysis is a standard operation in any modern graph-mining toolkit. -core decomposition is a simple and efficient method that allows to analyze a graph beyond its mere degree distribution. More specifically, it is used to identify areas in the graph of increasing centrality and connectedness, and it allows to reveal the structural organization of the graph.
Despite the fact that -core analysis relies on vertex degrees, -cores do not satisfy a certain, rather natural, density property. Simply put, the most central -core is not necessarily the densest subgraph. This inconsistency between -cores and graph density provides the basis of our study.
We start by defining what it means for a subgraph to be locally-dense, and we show that our definition entails a nested chain decomposition of the graph, similar to the one given by -cores, but in this case the components are arranged in order of increasing density. We show that such a locally-dense decomposition for a graph can be computed in polynomial time. The running time of the exact decomposition algorithm is but is significantly faster in practice. In addition, we develop a linear-time algorithm that provides a factor-2 approximation to the optimal locally-dense decomposition. Furthermore, we show that the -core decomposition is also a factor-2 approximation, however, as demonstrated by our experimental evaluation, in practice -cores have different structure than locally-dense subgraphs, and as predicted by the theory, -cores are not always well-aligned with graph density.
The research described in this paper builds upon and extends the work appearing in WWW 2015 by Tatti and Gionis (2015).
â â journal: TKDDâ â journalvolume: 1â â journalnumber: 1â â article: 1â â journalyear: 2017â â publicationmonth: 1â â copyright: acmlicensedâ â doi: 0000001.0000001
1. Introduction
Finding dense subgraphs and communities is one of the most well-studied problems in graph mining. Techniques for identifying dense subgraphs are used in a large number of application domains, from biology, to web mining, to analysis of social and information networks. Among the many concepts that have been proposed for discovering dense subgraphs, -cores are particularly attractive for the simplicity of their definition and the fact that they can be identified in linear time.
The -core of a graph is defined as a maximal subgraph in which every vertex is connected to at least other vertices within that subgraph. A -core decomposition of a graph consists of finding the set of all -cores. A nice property is that the set of all -cores forms a nested sequence of subgraphs, one included in the next. This makes the -core decomposition of a graph a useful tool in analyzing a graph by identifying areas of increasing centrality and connectedness, and revealing the structural organization of the graph. As a result, -core decomposition has been applied to a number of different applications, such as modeling of random graphs (Bollobås, 1984), analysis of the internet topology (Carmi et al., 2007), social-network analysis (Seidman, 1983), bioinformatics (Bader and Hogue, 2003), analysis of connection matrices of the human brain (Hagmann et al., 2008), graph visualization (Alvarez-Hamelin et al., 2005), as well as influence analysis (Kitsak et al., 2010; Ugander et al., 2012) and team formation (Bonchi et al., 2014).
The fact that the -core decomposition of a graph gives a chain of subgraphs where vertex degrees are higher in the inner cores, suggests that we should expect that the inner cores are, in certain sense, more dense or more connected than the outer cores. As we will show shortly, this statement is not true. Furthermore, in this paper we show how to obtain a graph decomposition for which the statement is true, namely, the inner subgraphs of the decomposition are denser than the outer ones. To quantify density, we adopt a classic notion used in the densest-subgraph problem (Charikar, 2000; Goldberg, 1984), where density is defined as the ratio between the edges and the vertices of a subgraph. This density definition can be also viewed as the average degree divided by 2.
Our motivating observation is that -cores are not ordered according to this density definition. The next example demonstrates that the most inner core is not necessarily the densest subgraph, and in fact, we can increase the density by either adding or removing vertices.
Example 1.1.
Consider the graph shown in Figure 1, consisting of 6 vertices and 9 edges. The density of the whole graph is . The graph has three -cores: a -core marked as , a -core marked as , and a -core, corresponding the the whole graph and marked as . The core has density (it contains edges and vertices), while the core has density (it contains edges and vertices). In other words, has lower density than , despite being an inner core.
Let us now consider shown in Figure 1. This graph has a single core, namely a -core, containing the whole graph. The density of this core is equal to . However, a subgraph contains edges and vertices, giving us density , which is higher than the density of the only core.
This example motivates us to define an alternative, more density-friendly, graph decomposition, which we call locally-dense decomposition. We are interested in a decomposition such that () the density of the inner subgraphs is higher than the density of the outer subgraphs, () the most inner subgraph corresponds to the densest subgraph, and () we can compute or approximate the decomposition efficiently.
We achieve our goals by first defining a locally-dense subgraph, essentially a subgraph whose density cannot be improved by adding and deleting vertices. We show that these subgraphs are arranged into a hierarchy such that the density decreases as we go towards outer subgraphs and that the most inner subgraph is in fact the densest subgraph.
We provide two efficient algorithms to discover this hierarchy. The first algorithm extends the exact algorithm for discovering the densest subgraph given by Goldberg (1984). This algorithm is based on solving a minimum cut problem on a certain graph that depends on a parameter . Goldberg showed that for a certain value (which can be found by binary search), the minimum cut recovers the densest subgraph. One of our contributions is to shed more light into Goldbergâs algorithm and show that the same construction allows to discover all locally-dense subgraphs by varying .
Our second algorithm extends the linear-time algorithm by Charikar (2000) for approximating dense subgraphs. This algorithm first orders vertices by deleting iteratively a vertex with the smallest degree, and then selects the densest subgraph respecting the order. We extend this idea by using the same order, and finding first the densest subgraph respecting the order, and then iteratively finding the second densest subgraph containing the first subgraph, and so on. We show that this algorithm can be executed in linear time and it achieves a factor- approximation guarantee.
Charikarâs algorithm and the algorithm for discovering a -core decomposition are very similar: they both order vertices by deleting vertices with the smallest degree. We show that this connection is profoundly deep and we demonstrate that a -core decomposition provides a factor- approximation for locally-dense decomposition. On the other hand, our experimental evaluation shows that in practice -cores have different structure than locally-dense subgraphs, and as predicted by the theory, -cores are not always well-aligned with graph density.
It is possible that the decomposition results a significant amount of subgraphs. In such a case it may be useful to constraint the number of the subgraphs. We approach this problem by defining an optimization criterion for a segmentation of nested subgraphs. The objective function will be based on a statistical model. We will show that to optimize this particular objective, we need to (i) find locally-dense subgraphs, and (ii) reduce the number with a dynamic program. We also show that if we replace the first step with the greedy algorithm, then the resulting algorithm yields a factor-2 approximation guarantee.
The remainder of paper is organized as follows. We give preliminary notation in Section 2. We introduce the locally-dense subgraphs in Section 3, present algorithms for discovering the subgraphs in Section 4, and describe the connection to -core decomposition in Section 5. We introduce the constrained version of the problem in Section 6. We present the related work in Section 7 and present the experiments in Section 8. Finally, we conclude the paper with discussion in Section 9.
2. Preliminaries
Graph density. Let be a graph with vertices and edges. Given a subset of vertices , it is common to define , that is, the edges of that have both end-points in . The density of the vertex set is then defined to be
[TABLE]
that is, half of the average degree of the subgraph induced by . The set of vertices that maximizes the density measure is the densest subgraph of .111We should point out that density is also often defined as . This is not the case for this paper.
The problem of finding the densest subgraph can be solved in polynomial time. A very elegant solution that involves a mapping to a series of minimum-cut problems was given by Goldberg (1984). As the fastest algorithm to solve the minimum-cut problem runs in time, this approach is not scalable to very large graphs. On the other hand, there exists a linear-time algorithm that provides a factor- approximation to the densest-subgraph problem (Asahiro et al., 1996; Charikar, 2000). This is a greedy algorithm, which starts with the input graph, and iteratively removes the vertex with the lowest degree, until left with an empty graph. Among all subgraphs considered during this vertex-removal process, the algorithm returns the densest.
Next we will provide graph-density definitions that relate pairs of vertex sets. Given two non-overlapping sets of vertices and we first define the cross edges between and as
[TABLE]
We then define the marginal edges from with respect to . Those are the edges that have one end-point in and the other end-point in either or , that is,
[TABLE]
The set represents the additional edges that will be included in the induced subgraph of if we expand by adding .
Assume that and are non-overlapping. Then, we define the outer density of with respect to as
[TABLE]
That is, these are the extra edges, on average, that we bring to if we expand it by appending .
Now that we have defined a special case when and are disjoint, we can now consider a more general case, that is, when and are overlapping. Here we would be interested in the outer density of vertices in that are not already included in . Hence, we will expand the definition of outer density to a more general case by defining
[TABLE]
-cores. We briefly review the basic background regarding -cores. The concept was introduced by Seidman (1983).
Given a graph , a set of vertices is a -core if every vertex in the subgraph induced by has degree at least , and is maximal with respect to this property. A -core of can be obtained by recursively removing all the vertices of degree less than , until all vertices in the remaining graph have degree at least .
It is not hard to see that if is the set of all distinct -cores of then forms a nested chain
[TABLE]
Furthermore, the set of vertices that belong in a -core but not in a -core is called -shell.
The -core decomposition of is the process of identifying all -cores (and all -shells). Therefore, the -core decomposition of a graph identifies progressively the internal cores and decomposes the graph shell by shell. A linear-time algorithm to obtain the -core decomposition was given by Matula and Beck (1983). The algorithm starts by provisionally assigning each vertex to a core of index , an upper bound to the correct core of a vertex. It then repeatedly removes the vertex with the smallest degree, and updates the core index of the neighbors of the removed vertex. Note the similarity of this algorithm, with the -approximation algorithm for the densest-subgraph problem (Charikar, 2000).
3. Locally-dense graph decomposition
In this section we present the main concept introduced in this paper, the locally-dense decomposition of a graph. We also discuss the properties of this decomposition. We start by defining the concept of a locally-dense subgraph.
Definition 3.1.
A set of vertices is locally dense if there are no and satisfying such that
[TABLE]
In other words, for to be locally dense there should not be an âinsideâ and a âoutsideâ so that the density that brings to is larger than the density that brings.
Due to the notational simplicity, we will often refer to these sets of vertices as subgraphs.
Interestingly, the property of being locally dense induces a nested chain of subgraphs in .
Proposition 3.2.
Let and be locally-dense subgraphs. Then either or .
Proof.
Assume otherwise. Define and . Both and should be non-empty sets. Then either or . Assume the former. This implies
[TABLE]
which contradicts the fact that is locally dense. For the first equality we used the fact that , while for the last inequality we used the fact that .
The case is similar. â
The proposition implies that the set of locally-dense subgraphs of a graph forms a nested chain, in the same way that the set of -cores does.
Corollary 3.3.
A set of locally-dense subgraphs can be arranged into a sequence , where . Moreover, for .
The chain of locally-dense subgraphs of a graph , as specified by Corollary 3.3, defines the locally-dense decomposition of .
Example 3.4.
The locally-dense composition of given in Figure 1 is , This is the -core decomposition without . The locally-dense composition of given in Figure 1 is . Note that both and are the densest subgraphs in their respective graphs.
We proceed to characterize the locally-dense subgraphs of the decomposition with respect to their global density in the whole graph . We want to characterize the global density of subgraph of the decomposition. cannot be denser than the previous subgraph in the decomposition, however, we want to measure the density that the additional vertices bring. This density involves edges among vertices of and edges from to the previous subgraph . This is captured precisely by the concept of outer density defined in the previous section. As the following proposition shows the outer density of with respect to is maximized over all subgraphs that contain . In other words, is the densest subgraph we can choose after , given the containment constraint.
Proposition 3.5.
Let be the chain of locally-dense subgraphs. Then , , and is the densest subgraph properly containing ,
[TABLE]
To prove the proposition we will use the following three lemmas.
Lemma 3.6.
Let be two sets of vertices with . Assume a third non-empty set with . Then one of the following three cases follows:
- â˘
, or
- â˘
, or
- â˘
.
Proof.
Write . We can rewrite as
[TABLE]
This shows that either or . Since it follows that if and only if . The three cases follows. â
Let be the sequence defined as , in case of a tie, select a larger graph, and .
Lemma 3.7.
* for .*
Proof.
We only need to show that the lemma holds . Assume otherwise: .
Write , , and . Since , Lemma 3.6 implies that
[TABLE]
violating the optimality of . â
Lemma 3.8.
If and , then .
Proof.
Assume otherwise: . Write , . Lemma 3.6 implies that
[TABLE]
violating the optimality of . â
Proof of Proposition 3.5.
We need to show that . Fix and assume inductively that for all .
We will first show that is locally dense: we argue that there are no sets and with and that can serve as certificates for being non locally-dense.
Fix any . Define and for .
We claim that . Let . If , then . Assume that . If , then . If , then . Thus, , which in turns implies that .
This leads to
[TABLE]
This inequality leads to
[TABLE]
Consider also any set with . Due to the optimality of and Lemma 3.6 we must have .
We conclude that for any and with and it is , which shows that is locally dense.
Now, we can safely assume for some . We need to show that . By induction we know that . This guarantees that . Assume . Since is maximal, we have .
Since is locally-dense, we have . Lemma 3.6 now implies that
[TABLE]
which contradicts the optimality of . Thus . â
As a consequence of the previous proposition we can characterize the first subgraph in the decomposition.
Corollary 3.9.
Let be a locally-dense decomposition of a graph . Then is the densest subgraph of .
The above discussion motivates the problem of locally-dense graph decomposition, which is the focus of this paper.
Problem 1.
Given a graph find a maximal sequence of locally-dense subgraphs
[TABLE]
4. Decomposition algorithms
In this section we propose two algorithms for the problem of locally-dense graph decomposition (Problem 1). The first algorithm gives an exact solution, and runs in worst-case time , but it is significantly faster in practice. The second algorithm is a linear-time algorithm that provides a factor- approximation guarantee.
Both algorithms are inspired by corresponding algorithms for the densest-subgraph problem. The first algorithm by the exact algorithm of Goldberg (1984), and the second algorithm by the greedy linear-time algorithm of Charikar (2000).
4.1. Exact algorithm
We start our discussion on the exact algorithm for locally-dense graph decomposition by reviewing Goldbergâs algorithm (Goldberg, 1984) for the densest-subgraph problem.
Recall that the densest-subgraph problem asks to find the subset of vertices that maximizes . Given a graph and a positive number define a function
[TABLE]
and the maximizer
[TABLE]
where ties are resolved by picking the largest . Note that decreases as increases, and as exceeds a certain value, becomes [math] by taking . Goldberg observed that the densest-subgraph problem is equivalent to the problem of finding the largest value of for which the maximizer set is non empty.222This observation is an instance of fractional programming (Dinkelbach, 1967). The densest subgraph is precisely this maximizer set . Furthermore, Goldberg showed how to find the vertex set , for a given value of . This is done by mapping the problem to an instance of the min-cut problem, which can be solved in time, in a recent breakthrough by Orlin (2013). We will present an extension of this transformation in the next section, where we discuss how to speed-up the algorithm.
Thus, Goldbergâs algorithm uses binary search over and finds the largest value of for which the maximizer set is non empty. Each iteration of the binary search involves a call to a min-cut instance for the current value of .
Our algorithm for finding the locally-dense decomposition of a graph builds on Goldbergâs algorithm (Goldberg, 1984). We show that Goldbergâs construction has the following, rather remarkable, property: there is a sequence of values , for , which gives all the distinct values of the function . Furthermore, the corresponding set of subgraphs is exactly the set of all locally-dense subgraphs of , and thus the solution to our decomposition problem.
Therefore, our algorithm is a simple extension of Goldbergâs algorithm: instead of searching only for the optimal value , we find the whole sequence of âs and the corresponding subgraphs.
Next we prove the claimed properties and discuss the algorithm in more detail.
We first show that the distinct maximizers of the function correspond to the set of locally-dense subgraphs.
Proposition 4.1.
Let be the set of locally-dense subgraphs. Then
[TABLE]
Proof.
We first show that is a locally-dense subgraph, for any . Note that for any , we must have , otherwise we can delete from and obtain a better solution which violates the optimality of . This implies that . Similarly, for any such that , we have or, equivalently, . Thus, is locally-dense.
Fix and select such that . Let . If , then, due to Corollary 3.3, which we can rephrase as
[TABLE]
If we delete from , then we improve the quality exactly by , that is, we obtain a better solution which violates the optimality of . If , then Corollary 3.3 implies that , so we can add to obtain a better solution. It follows that . â
Next we need to show that it is possible to search efficiently for the sequence of âs that give the set of locally-dense subgraphs. To that end we will show that if we have obtained two subgraphs of the decomposition (corresponding to values ), it is possible to pick a new value so that computing allows us to make progress in the search process: we either find a new locally-dense subgraph or we establish that no such subgraph exists between and , in other words, and are consecutive subgraphs in our decomposition.
Proposition 4.2.
Let be the set of locally-dense subgraphs. Let be two subgraphs. Set and let . If , then . If , then .
Lemma 4.3.
, for . The equality holds if and only if and .
Proof.
Corollary 3.3 states that is monotonically strictly decreasing as a function of . Lemma 3.6, applied recusively, states that
[TABLE]
The inequality is strict if and only if or . â
Proof of Proposition 4.2.
Lemma 4.3 states that . Proposition 4.1 now implies that .
Assume that . Lemma 4.3 implies that . Write
[TABLE]
Let us now bound the difference between the densities as
[TABLE]
This implies that . Proposition 4.1 now implies that .
Assume that . Lemma 4.3 implies that , and the same argument as above shows that and, consequently, . This guarantees that . â
The exact decomposition algorithm uses Proposition 4.2 to guide the search process. Starting by the two extreme subgraphs of the decomposition, and , the algorithm maintains a sequence of locally-dense subgraphs. Recursively, for any two currently-adjacent subgraphs in the sequence, we use Proposition 4.2 to check whether the two subgraphs are consecutive or not in the decomposition. If they are consecutive, the recurrence at that branch of the search is terminated. If they are not, a new subgraph between the two is discovered and it is added in the decomposition. The algorithm is named ExactLD and it is illustrated as Algorithm 1.
With the next propositions we prove the correctness of the algorithm and we bound its running time.
Proposition 4.4.
The algorithm ExactLDÂ initiated with input visits all non-trivial locally-dense subgraphs of .
Proof.
Let be the set of locally-dense subgraphs. We will prove the proposition by showing that for , the algorithm visits all monotonic subgraphs that are between and . We will prove this by induction over . The first step is trivial. Assume that . Then Proposition 4.2 implies that , where . The inductive assumption now guarantees that and will visit all monotonic subgraphs between and . â
Proposition 4.5.
The worst-case running time of algorithm ExactLDÂ is .
Proof.
We will show that the algorithm ExactLD, initiated with input makes calls to the function , where is the number of locally-dense subgraphs.
Let be the number of calls of when the input parameter . Out of these calls one call will result in . There are such calls, since is never tested. Each of the remaining calls will discover a new locally-dense subgraph. Since there are new subgraphs to discover, it follows that calls to are needed.
Since a call to corresponds to a min-cut computation, which has running time  (Orlin, 2013), and since , the claimed running-time bound follows. â
4.2. Speeding up the exact algorithm
Our next step is to speed-up ExactLD. This speed-up does not improve the theoretical bound for the computational time but, in practice, it improves the performance of the algorithm dramatically.
The speed-up is based on the following observation. We know from Proposition 4.2 that visits only subgraphs with the property . This gives us immediately the first speed-up: we can safely ignore any vertex outside , that is, will yield the same output.
Our second observation is that any subgraph visited by must contain vertices . However, we cannot simply delete them because we need to take into account the edges between and . To address this let us consider the following maximizer
[TABLE]
We can replace the original in Algorithm 1 with . To compute we will use a straightforward extension of the Goldbergâs algorithm (Goldberg, 1984) and transform this problem into a problem of finding a minimum cut.
In order to do this, given a graph , let us define a weighted graph that consists of vertices and edges with weights of 1. Add two auxiliary vertices and into and connect these vertices to every vertex in . Given a vertex , assign a weight of to the edge and a weight of
[TABLE]
to the edge , where stands for the number of neighbors of in . We claim that solving a minimum cut such that and are in different cuts will solve . This cut can be obtained by constructing a maximum flow from to .
To prove this claim let be a subset of vertices containing and not containing . Let and also let . There are three types of cross-edges from to : (i) edges from to , (ii) edges from to , and (iii) edges from to . The total cost of is then
[TABLE]
We claim that the last two terms of the cost are equal to . To see this, consider an edge in . This implies that at least one of the end points, assume it is , has to be in . There are three different cases for : (i) if , then contributes 2 to the cost: 1 to and 1 to , (ii) if , then contributes to , and (iii) if , then contributes to and to the third term. Thus, we can write the cut as
[TABLE]
The first two terms in the right-hand side are constant which implies that that finding the minimum cut is equivalent of maximizing . Consequently, if is the min-cut solution, then .
Note that the graph does not have vertices included in . By combining both speed-ups we are able to reduce the running time of by considering only the vertices that are in .
4.3. Linear approximation algorithm
As we saw in the last section, the exact algorithm can be significantly accelerated, and indeed, our experimental evaluation shows that it is possible to run the exact algorithm for a graph of millions of vertices and edges within 2 minutes. Nevertheless, the worst-case complexity of the algorithm is cubic, and thus, it is not truly scalable for massive graphs.
Here we present a more lightweight algorithm for performing a locally-dense decomposition of a graph. The algorithm runs in linear time and offers a factor- approximation guarantee. As the exact algorithm builds on Goldbergâs algorithm for the densest-subgraph problem, the linear-time algorithm builds on Charikarâs approximation algorithm for the same problem (Charikar, 2000). As already explained in Section 2, Charikarâs approximation algorithm iteratively removes the vertex with the lowest degree, until left with an empty graph, and returns the densest graph among all subgraphs considered during this process.
Our extension to this algorithm, called GreedyLD, is illustrated in Algorithm 2, and it operates in two phases. The first phase is identical to the one in Charikarâs algorithm: all vertices of the graph are iteratively removed, in increasing order of their degree in the current graph. In the second phase, the algorithm proceeds to discover approximate locally-dense subgraphs, in an iterative manner, from to . The first subgraph is the approximate densest subgraph, the same one returned by Charikarâs algorithm. In the -th step of the iteration, having discover subgraphs the algorithm selects the subgraph that maximizes the density . To select the algorithm considers subsets of vertices only in the degree-based order that was produced in the first phase.
Discovering from the ordered vertices takes time, if done naively. However, it is possible to implement this step in time. In order to do this, sort vertices in the reverse visit order, and define to be the number of edges of from the earlier neighbors. Then, we can we express the density as an average,
[TABLE]
Consequently, we can see that recovering is an instance of the following problem,
Problem 2.
Given a sequence , compute the maximal interval
[TABLE]
Luckily, Calders et al. (2014) demonstrated that we can use the classic PAVA algorithm by Ayer et al. (1955) to solve this problem for every value of in total time.
To quantify the approximation guarantee of GreedyLD, note that the sequence of approximate locally-dense subgraphs produced by the algorithm are not necessarily aligned with the locally-dense subgraphs of the optimal decomposition. In other words, to assess the quality of the density of an approximate locally-dense subgraph produced by GreedyLD, there is no direct counterpart in the optimal decomposition to compare. To overcome this difficulty we develop a scheme of âvertex-wiseâ comparison, where for any , the density of the smallest approximate locally-dense subgraph of size at least is compared with the density of the smallest optimal locally-dense subgraph of size at least . This is defined below via the concept of profile.
Definition 4.6.
Let be a nested chain of subgraphs, the first subgraph being the empty graph and the last subgraph being the full graph. For an integer , define
[TABLE]
to be the index of the smallest subgraph in whose size is at least . We define a profile function to be
[TABLE]
Our approximation guarantee is now expressed as a guarantee of the profile function of the approximate decomposition with respect to the optimal decomposition.
Proposition 4.7.
Let be the set of locally-dense subgraphs. Let be the subgraphs obtained by GreedyLD. Then
[TABLE]
First, we need the following lemma.
Lemma 4.8.
, for ,
Proof.
Assume otherwise. Lemma 3.6 now states that , which violates the optimality of as indicated by Proposition 3.5. â
Proof of Proposition 4.7.
Sort the set of vertices according to the reverse visiting order of GreedyLD and let be the number of edges of from earlier neighbors.
Fix to be an integer, and let be the smallest subgraph such that . Let be the last vertex occurring in . We must have , and, due to Lemma 4.8, . In summary, we have
[TABLE]
Let be the smallest subgraph such that . Let be the vertex with the smallest index that is still in and define . Let be the degree of right before is removed during GreedyLD. Note that, by definition, , and that
[TABLE]
This leads to
[TABLE]
where the optimality of implies the first inequality. â
We should point out that is equal to the density of the densest subgraph, while is equal to the density of the subgraph discovered by the Charikarâs algorithm. Consequently, Proposition 4.7 provides automatically the 2-approximation guarantee of the Charikarâs algorithm.
We should also point out that can be larger than . However, if is the first index, for which , then Proposition 3.5 guarantees that .
5. Locally-dense subgraphs and core decomposition
Here we study the connection of graph cores, obtained with the well-known -core decomposition algorithms, with local-density, studied in this paper. We are able to show that from the theory point-of-view, graph cores are as good approximation to the optimal locally-dense graph decomposition as the subgraphs obtained by the GreedyLD algorithm. In particular we show a similar result to Proposition 4.7, namely, a factor- approximation on the profile function of the core decomposition.
However, as we will see in our empirical evaluation, the behavior of the two algorithms, GreedyLD and -core decomposition are different in practice, with GreedyLD giving in general more dense subgraphs and closer to the ones given by exact locally-dense decomposition.
Before stating and proving the result regarding -cores, recall that a set of vertices is a -core if every vertex in the subgraph induced by has degree at least , and is maximal with respect to this property. A linear-time algorithm for obtaining all -cores is illustrated in Algorithm 3.
It is a well-known fact that the set of all -cores of a graph forms a nested chain of subgraphs, in the same way that locally-dense subgraphs do.
Proposition 5.1.
Let be the set of all -cores of a graph . Then forms a nested chain,
[TABLE]
Similar to Proposition 4.7, -cores provide a factor- approximation with respect to the locally-dense subgraphs. The proof is in fact quite similar to that of Proposition 4.7.
Proposition 5.2.
Let be the set of locally-dense subgraphs. Let be the set of -cores. Then
[TABLE]
Proof.
Sort according to the reverse visiting order of Core and let be the number of edges of from earlier neighbors.
Fix to be an integer, and let be the smallest subgraph such that . Let be the last vertex occurring in . We must have , and, due to Lemma 4.8, . In summary, we have
[TABLE]
Let be the smallest core such that , and write . Let be the vertex with the smallest index that is still in , and let be the vertex with the largest index that is still in , that is, .
If , then , otherwise is not a core. If , then , otherwise , and since , then is not the smallest core with at least vertices, which is a contradiction. Hence, .
Let be the degree of right before is removed during Core. We now have
[TABLE]
which proves the proposition. â
6. Segmentation problem: constraining the number of subgraphs
It is possible that the decomposition yields a significant amount of subgraphs. In such a case it may be useful to constraint the number of the subgraphs. In order to do so we need to define an optimization criterion, which will be our first step. We then demonstrate how to solve the problem exactly, and how to estimate the solution efficiently.
6.1. Problem definition
Our goal is to discover nested subgraphs that minimize a certain cost. We base the cost on the degree of a node, relative to the subgraph. A natural approach here is to model the degree, that is, our goal is to maximize the log-likelihood , where is the smallest subgraph containing and is a parameter of the distribution. Unfortunately, this is problematic due to the following reason: an edge , where increases the degrees of both and , whereas an edge , with and increases the degrees only for and not for . The distribution we will consider favors small degrees, so this leads to a scenario where the cost function implicitly favors having a lot of cross-edges. To rectify this problem we introduce the notion of adjusted degree, where we count each cross-edge twice.
Definition 6.1.
Assume a sequence of nested subgraphs . Let be a vertex and let be the smallest set containing . Define the adjusted degree as
[TABLE]
To reduce the clutter, we typically omit from the notation and write .
Next we give a formal definition of the problem.
Definition 6.2.
Assume that we are given a distribution for the adjusted degree. This distribution has one parameter ; small values indicate the likelihood of high degrees. Given a graph and an integer , find a -segmentation, a sequence of nested subgraphs and parameters , minimizing the negative log-likelihood
[TABLE]
where is the index of the smallest containing .
The reason why we write this problem as a minimization problem is because typically the log-likelihood is negative, and in order to establish approximation guarantees we need to have the cost function to be positive.
We are specifically interested in geometric and exponential distributions. Both distributions can be written as , where is the normalization constant333The geometric distribution is defined over the integers whereas the exponential distribution is defined over the real domain. This results in different normalization constants.. Moreover, smaller values of will result in a distribution favoring larger degrees, that is, inner subgraphs should be denser.
6.2. Exact algorithm
In this section we demonstrate how to find an optimal segmentation using locally-dense subgraphs. First we prove the key proposition that states that it is enough to use locally-dense subgraphs when looking for the optimal segmentation.
Proposition 6.3.
Assume that is either exponential or geometric distribution. Then there is an optimal segmentation such that each is locally-dense.
To prove the proposition, we need the following technical lemma.
Lemma 6.4.
Let be the optimal solution, and assume some of the subgraphs are not locally-dense. Then there is that is not locally-dense along with the violating sets and such that and .
Proof.
Let be a set that is not locally-dense, and let and be the violating sets. Next we argue that we can safely assume that and . We will split the argument in two cases: Case (): and Case (): .
Assume Case (). If , then redefine as . In such case, and are still violating the local density but now we can use Case (). Assume that . Define and . Note that . Assume that . Then
[TABLE]
Redefine as , as , and increase by 1. The previous arguments show that new and violate the local density of , so we repeat our argument with either Case () or Case ().
Assume now that . This forces . Since is a weighted average of and , we have . Redefine as , and apply Case ().
Assume Case (). Write and . If , then we are done; assume otherwise. If , then we can replace with to complete the argument. Assume that .
Assume . If , then we can replace with to complete the argument. Assume . Note that is a weighted average of and . This implies that .
On the other hand, if , then , and .
Combining everything gives us
[TABLE]
Redefine as , as and decrease by one, and repeat Case ().
Note that we do first at most repetitions of Case (), and then at most repetitions of Case (). After a finite numer of repetitions we end up with that satisfies the conditions. This completes the proof. â
Proof of Proposition 6.3.
Both geometric and exponential distributions can be written as , where is the normalization constant (depending on ).
Write . We can write the optimization function as
[TABLE]
where is the normalization constant for the parameter .
Assume that is not locally-dense, that is, there is and that violate the local density. Lemma 6.4 states that we can safely assume that and . This allows us to either remove from or add to without changing the other sets.
The cost of the th and the th segment is equal to
[TABLE]
Let us define . Due to the equality
[TABLE]
the cost can be rewritten as
[TABLE]
by setting , and . We would like to vary while keeping the remaining variables constant; let us define
[TABLE]
Note that the last two terms do not depend on . Due to optimality of , we have , or
[TABLE]
where the last equality is due to Eq. 1. We can rewrite the inequality as
[TABLE]
where the last inequality follows from the fact that and violate the local density of , and since . We can rewrite the left-hand side and the right-hand side as
[TABLE]
or .
We have shown that if there is that is not locally-dense, we can delete some vertices from without sacrificing the quality. We continue this until all are locally-dense; the process must end because at each step we reduce the size of some . â
The proposition gives us means to compute the optimal segmentation. First we discover locally-dense decomposition, say, . If the number of subgraphs is less or equal than , we are done. Otherwise, we group subgraphs until we reach . The optimal grouping can be done with a dynamic program. Write to be the cost of partial -segmentation using only . We have the identity
[TABLE]
and is the optimal parameter for modeling . This identity allows us to compute recursively with a dynamic program. Note that the monotonicity of the segmentationâthat is, the inner subgraphs should be more denseâis automatically guaranteed. We will refer to this algorithm as .
Computing can be done in constant time. To see this, let be the number of nodes in . Let also
[TABLE]
be the sum of all adjusted degrees in . Note and can be maintained in constant time. Then the corresponding costs for the geometric and exponential distributions are
[TABLE]
Let us consider computational complexity. Discovering locally-dense decomposition can be done in time, whereas the actual segmentation can be done in time, where is the number of subgraphs in locally-dense decomposition. In practice, so the segmentation step is relatively cheap. However, if is large, it is possible to achieve approximation for the segmentation in linear time (Guha et al., 2006; Tatti, 2019).
6.3. Approximation algorithm
As pointed out above, the bottleneck of the exact algorithm is the locally-dense decomposition step. For large graphs we can significantly speed-up this step by using the faster algorithm GreedyLD. The next proposition shows that this yields 2-approximation guarantee, if we use the geometric distribution.
Proposition 6.5.
Let be the geometric distribution. Let be the optimal segmentation, and let be the optimal segmentation using the sets obtained from GreedyLD. Then .
Before proving the result, we need to introduce some notation. The geometric distribution can be written as
[TABLE]
where is the normalization constant.
To prove the result let us enumerate the vertices, that is, , and assume that this order respects the optimal segmentation , implies that . Let be the optimal parameters for . We write to be the parameter that is used to model , where is the smallest subgraph containing . Write to be the sum of normalization constants. Note that . Given a sequence , we define
[TABLE]
Define with . Note that .
Define an order for vertex indices , vertices with high degree first, that is, . Define a sequence with .
Lemma 6.6.
**
Proof.
Define as . We argue first that . We can rewrite
[TABLE]
Since , we have . To prove , note that
[TABLE]
That is, let be any vertex order, and let be the degree sequence . Then sorting the vertices with bubble sort from to will not increase the sum in at any step. Consequently, . Since this holds for any order, , which proves the lemma. â
Let be the reverse order of indices in which GreedyLD removes the vertices, and let be the degree of during its removal.
Lemma 6.7.
.
Proof.
Consider two sets and . Assume that when treated as sets, that is, there are indices and with such that . Let be the degree of when deleting . Since GreedyLD deletes the vertex with the smallest degree, . Consequently, .
Assume the opposite case: . Due to pigeonhole principle, there is such that . Thus, . â
Proof of Proposition 6.5.
Define as . Note that . Thus,
[TABLE]
Consider a segmentation respecting the order and having the same sizes as , . The value corresponds to the log-likelihood of and the parameters , and corresponds to the log-likelihood of and the optimized parameters. Thus, .
We have shown that there is a segmentation respecting the order chosen by GreedyLD that is at most . Thus, the optimal segmentation respecting the order is also at most . The argument in the proof of Proposition 6.3 can be now used to show that we can safely assume that the segmentation uses sets returned by GreedyLD. â
We can show a similar result for the exponential distribution as long as the original graph does not have any singletons.
Proposition 6.8.
Let be the exponential distribution. Assume that has no singletons. Let be the optimal segmentation, and let be the optimal segmentation using the sets obtained from GreedyLD. Then .
Proof.
Similarly to the geometric distribution, exponential distribution can be written as
[TABLE]
Let be as defined in proof of Proposition 6.5, that is, it is total sum of the normalization constants. To prove the result we only need to show that , and we can use the proof of Proposition 6.5. Note that , and the optimal for a segment is . This leads to
[TABLE]
To prove the result we will show that . It is enough to prove the case as due to Proposition 6.3 the densities are monotonic.
Let be any subset of vertices. As there are no singletons, . This leads to
[TABLE]
Set to complete the proof. â
We should point out that these results also work if the graph has weights on the edges. However, in such a case, Proposition 6.8 requires weights to be larger than or equal to 1.
7. Related work
This paper is an extension of previouly published work (Tatti and Gionis, 2015), and in this extension we introduce the segmentation problem, where we constrain the number of subgraphs. Danisch et al. (2017) introduced an alternative iterative technique for computing locally-dense decomposition that scales well in practice.
Our paper is related to previous work on discovering dense subgraphs, clique-like structures, and hierarchical communities. We review some representative work on these topics.
Clique relaxations. The densest possible subgraph is a clique. Unfortunately finding large cliques is computationally intractable (Hüstad, 1996). Additionally, the notion of clique does not provide a robust definition for practical situations, as a few absent edges may completely destroy the clique. To address these issues, researchers have come up with relaxed clique definitions. A relaxation, -plex was suggested by Seidman and Foster (2010). In a -plex a vertex can have at most absent edges. Unfortunately, discovering maximal -plexes is also an NP-hard problem (Balasundaram et al., 2011). An alternative relaxation for a clique is the one of an -clique, a maximal subgraph where each vertex is connected to every vertex with a path, possibly outside of the subgraph, of at most -length (Bron and Kerbosch, 1973). So, according to this definition a clique is an -clique. As maximal -cliques may produce sparse graphs, the concept of -clans was also proposed by limiting the diameter of the subgraph to be at most  (Mokken, 1979). Since -clan corresponds to a maximal clique, discovering -clans is a computationally intractable problem.
Quasi-cliques. For the definition of graph density we have chosen to work with , the average degree of the subgraph induced by . While this is a popular density definition, there are other alternatives. One such alternative would be to divide the number of edges present in the subgraph with the total number of possible edges, that is, divide by . This would give us a normalized density score that is between [math] and . Subgraphs that maximize this density definition are called quasi-cliques, and algorithms for enumerating all quasi-cliques, which can be exponentially many, have been proposed by Abello et al. (2002) and Uno (2010). However, the definition of quasi-cliques is problematic. Note that a single edge already provides maximal density. Consequently additional objectives are needed. One natural objective is to maximize the size of a graph with density of , however, this makes the problem equivalent to finding a maximal clique which, as mentioned above, is a computationally-intractable problem (Hüstad, 1996).
Alternative definitions for density. Other definitions of graph density have been proposed. Recently, Tsourakakis proposed to measure density by counting triangles, instead of counting edges (Tsourakakis, 2015). Interestingly enough, it is possible to find an approximate densest subgraph under this definition. An interesting future direction for our work is to study if the decomposition proposed in this paper can be extended for the triangle-density definition. Density definitions of the form , where and are some increasing functions were studied by Tsourakakis et al. (2013), with specific focus on . It not known whether the densest-subgraph problem according to this definition is polynomially-time solvable or NP-hard. Finally, a variant for adopted for directed graph, along with polynomial-time discovery algorithm, was suggested by Khuller and Saha (2009). Such a definition could serve for defining decompositions of directed graphs, which is also left for future work.
Hierarchical communities. A classic technique for modelling hierarchical nature of communities is with a hierarchical blockmodel (Clauset et al., 2008). Here we are given a tree, where the leaves are the vertices of the original graph and each vertex in a tree is given a probablility. We then model an edge with a probability given to the lowest common ancestor of and . Tatti and Gionis (2013) studied a restricted version of this problem where the tree yields a nested structure; inner communities being denser. Unfortunately, no exact polynomial-time algorithm is known for the restricted or general problem. On other hand, in the segmentation problem we based the model on degrees and not individual edges. This allowed to us to solve the problem exactly.
8. Experimental evaluation
We will now present our experimental evaluation. We tested the two proposed algorithms, ExactLD and GreedyLD, for decomposing a graph into locally-dense subgraphs, and we contrast the resulting decompositions against -cores, obtained with the Core algorithm. We compare the three algorithms in terms of running time, decomposition size (number of subgraphs they provide), and relative density of the subgraphs they return. We also use the Kendall- to measure how similar are the decompositions in terms of the order they induce on the graph vertices.
8.1. Experimental setup
We performed our evaluation on 13 graphs of different sizes and densities. A short description of the graphs is given below, and their basic characteristics can be found in Table 1.
- â˘
dolphins: an undirected social network of frequent associations between dolphins in a community living off Doubtful Sound in New Zealand.
- â˘
karate: the social network of friendships between members of a karate club at a US university in the 1970.
- â˘
lesmis: co-appearance of characters in Les Miserables novel by Victor Hugo.
- â˘
astro: a co-authorship network among arXiv Astro Physics publications.
- â˘
enron: an e-mail communication network by Enron employees.
- â˘
fb1912: an ego-network obtained from Facebook.
- â˘
hepph: a co-authorship network among arXiv High Energy Physics publications.
- â˘
dblp: a co-authorship network among computer science researchers.
- â˘
gowalla: a friendship network of gowalla.com.
- â˘
roadnet: a road network of California, where vertices represent intersections and edges represent road segments.
- â˘
skitter: an internet topology graph, obtained from traceroutes run daily in 2005.
- â˘
airports: US flight traffic in January 2016444http://www.transtats.bts.gov/, where vertices represent airports and weighted edges flight routes. The weights represent the number of flights between two airports.
- â˘
trains: UK train routes.â555http://data.atoc.org/ The vertices represent medium or large exchange points (stations), while the weighted edges represent scheduled routes. The weights represent the number of routes in a single week.
The first three datasets were obtained from UCIrvine Network Data Repository,666http://networkdata.ics.uci.edu/index.php and the remaining datasets, except for airports and trains, were obtained from Stanford SNAP Repository.â777http://snap.stanford.edu/data
We applied Core, GreedyLD, and ExactLD to every dataset. We used a computer equipped with 3GHz Intel Core i7 and 8GB of RAM.â888The implementation is available at
https://version.helsinki.fi/dacs
8.2. Results
We begin by reporting the running times of the three algorithms for all of our datasets. They are shown in Table 1. As expected, the linear-time algorithms Core and GreedyLD are both very fast; the largest graph with 11 million edges and 1.7 million vertices is processed in 21 seconds. However, we are also able to run the exact decomposition for all the graphs in reasonable time, despite its running-time complexity of . It takes less than 2 minutes for ExactLD to process the largest graph. There are three reasons that contribute to achieving this performance. First, we need to compute the minimum cut only times, where is the number of locally-dense graphs. In practice, is much smaller than the number of vertices. Second, computing minimum cut in practice is faster than the theoretical bound. Third, as described in Section 4, most of the minimum cuts are computed using subgraphs. While in theory these subgraphs can be as large as the original graph, in practice these subgraphs are significantly smaller.
Next, we compare how well Core and GreedyLD approximate the exact locally-dense decomposition. In order to do that we compute the ratio
[TABLE]
where is the locally-dense decomposition and is obtained by either from GreedyLD or Core. These ratios are shown in Table 2. We also compare , that is, the ratio of density for the inner most subgraph in against the density of , the densest subgraph. Propositions 4.7 and 5.1 guarantee that there ratios are at least . In practice, the ratios are larger, typically over . In most cases, but not always, GreedyLD obtains better ratios than Core. When comparing the ratio for the inner most subgraph, GreedyLD, by design, will always be better or equal than Core. We see that only in three datasets Core is able to find the same subgraph as GreedyLD.
Let us now compare the different solutions found by the three algorithms. In Table 3 we report the sizes of discovered communities and their Kendall- statistics, which compares the ordering of the vertices induced by the decompositions. In particular, the Kendall- statistic is computed by assigning each vertex an index based on which subgraph the vertex belongs. To handle ties, we use the -version of Kendall-, as given by Agresti (2010). If the statistic is 1, the decompositions are equal.
Our first observation is that typically the locally-dense decomposition algorithms return more subgraphs than the -core decomposition. As an extreme example, roadnet contains only 3 -cores while GreedyLD finds 43 subgraphs and ExactLD finds 2710. This can be explained by the fact that the vertices in the graph have low degrees, which results in a very coarse -core decomposition. On the other hand, ExactLD and GreedyLD exploit density to discover more fine-grained decompositions. This result is similar to what we presented in the Example 1.1 in the introduction.
The Kendall- statistics are typically close to , especially for large datasets suggesting that all 3 methods result in similar decompositions. The statistic between Core and GreedyLD is typically larger than to the exact solution. This is expected since Core and GreedyLD use the exact same order for verticesâthe only difference between these two methods is how they partition the vertex order. In addition, decompositions produced by GreedyLD are closer to the exact solution than the decompositions produced by Core, which is also a natural result.
Let us now compare the solutions in terms of profile functions as defined in Definition 4.6. We illustrate several prototypical examples of such profile functions in Figure 2. We see that GreedyLD produces similar profiles as the exact locally-dense decomposition. We also see that Core does not respect the local density constraint. In fb1912, astro, and hepph there exist -shells that are denser than their inner shells, that is, joining these shells would increase the density of the inner shell. GreedyLD does not have this problem since by definition it will have a monotonically decreasing profile.
In Figure 3 we present the decompositions obtained by the three algorithms for the lesmis graph. We see that GreedyLD obtains very similar result to the exact solution, the only difference is the second subgraph and the third subgraph are merged and the th subgraph (in ExactLD) lends vertices to the 8th last subgraph. While GreedyLD has the same first subgraph as the exact solution, which is the densest subgraph, Core breaks this subgraph into 3 subgraphs. Interestingly enough, the protagonist of the book, Jean Valjean, is not placed into the first shell by Core.
Next, we present our result with segmentation. First we computed the cost of optimal segmentation as a function of the number of segments . Here, we used exponential distribution as the underlying model. The normalized scores are shown in left plot of Figure 4. The scores behave similarly for all datasets: they improve quickly at the very beginning (for ), after which they settle to a relatively stable value. This value depends on the dataset.
Next, we study how well can approximate the segmentation by using GreedyLD instead of the exact solution. The results are shown in the right plot of Figure 4. Here, we plot the relative difference between the approximate solution and the optimal solution. Ideally, the difference should be 0, and Proposition 6.8 states that it is at most 1. We see that in practice the estimates are really close to each other: all differences are within . The approximation is better for smaller . This is a natural result as there is less room for disagreement in more coarse segmentations.
Finally, let us look on segmentations obtained from trains and airports data. Our goal is to discover which locations, that is, train stations or airports, are central. Here, by centrality we mean that a central location is well-connected with others central locations. To quantify this notion we use locally-dense subgraphs. Note that the number of locally-dense subgraphs is relatively large in these graphs; this is due to the fact that the graphs are weighted. We were interested to group the locations in 4 categories. So to reduce the the size of decomposition, we solved segmentation problem with and the exponential model. The results are shown in Figure 5â7.
The discovered trains segmentation shows that the densest segment occurs in the vicinity of London, as expected. There is also a strong concentration of the second densest segment around Manchester/Liverpool area while the stations in Scotland, apart from the capital Edinburgh, are in outer segments. For airports, we see that the inner segments consists of large well-connected airports, such as JFK, DFW, ATL, or ORD, while the smaller, regional, airports are assigned to the outer segments.
9. Conclusions
Inspired by -core analysis and density-based graph mining, we propose density-friendly graph decomposition, a new tool for analyzing graphs. Like -core decomposition, our approach decomposes a given graph into a nested sequence of subgraphs These subgraphs have the property that the inner subgraphs are always denser than the outer ones; additionally the most inner subgraph is the densest oneâproperties that the -cores do not satisfy.
We provide two efficient algorithms to discover such a decomposition. The first algorithm is based on minimum cut and it extends the exact algorithm of Goldberg for the densest-subgraph problem. The second algorithm extends a linear-time algorithm by Charikar for approximating the same problem. The second algorithm runs in linear time, and thus, in addition to finding subgraphs that respect better the density structure of the graph, it is as efficient as the -core decomposition algorithm.
In addition to offering a new alternative for decomposing a graph into dense subgraphs, we significantly extend the analysis, the understanding, and the applicability of previous well-known graph algorithms: Goldbergâs exact algorithm and Charikarâs approximation algorithm for finding the densest subgraph, as well as the -core decomposition algorithm itself.
Finally, we considered a constrained version of the problem, where we restrict the number of subgraphs. We do this by designing a model based on segmentation. The likelihood of this model is then optimized, and we show that we can do this either exactly or estimate this efficiently by a factor of 2.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Abello et al . (2002) James Abello, Mauricio G.C. Resende, and Sandra Sudarsky. 2002. Massive Quasi-Clique Detection. In LATIN 2002: Theoretical Informatics . 598â612.
- 3Agresti (2010) Alan Agresti. 2010. Analysis of Ordinal Categorical Data (2nd ed.). John Wiley & Sons.
- 4Alvarez-Hamelin et al . (2005) J. Ignacio Alvarez-Hamelin, Luca DallâAsta, Alain Barrat, and Alessandro Vespignani. 2005. k đ k -core decomposition: a tool for the visualization of large scale networks. Co RR abs/cs/0504107 (2005).
- 5Asahiro et al . (1996) Yuichi Asahiro, Kazuo Iwama, Hisao Tamaki, and Takeshi Tokuyama. 1996. Greedily finding a dense subgraph. Scandinavian Workshop on Algorithm Theory (SWAT) (1996), 136â148.
- 6Ayer et al . (1955) M. Ayer, H. Brunk, G. Ewing, and W. Reid. 1955. An empirical distribution function for sampling with incomplete information. The Annals of Mathematical Statistics 26, 4 (1955), 641â647.
- 7Bader and Hogue (2003) Gary Bader and Christopher Hogue. 2003. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 1 (2003).
- 8Balasundaram et al . (2011) Balabhaskar Balasundaram, Sergiy Butenko, and Illya V. Hicks. 2011. Clique Relaxations in Social Network Analysis: The Maximum k đ k -Plex Problem. Operations Research 59, 1 (2011), 133â142.
