Density-friendly Graph Decomposition

Nikolaj Tatti

arXiv:1904.03467·cs.DS·January 16, 2020

Density-friendly Graph Decomposition

Nikolaj Tatti

PDF

TL;DR

This paper introduces a new graph decomposition method based on local density, providing a polynomial-time exact algorithm and a linear-time approximation, which better captures dense subgraph structures than traditional $k$-core analysis.

Contribution

It defines locally-dense subgraphs, develops algorithms for their decomposition, and compares this approach to $k$-core analysis, highlighting improved density alignment.

Findings

01

Locally-dense decomposition can be computed in polynomial time.

02

A linear-time 2-approximation algorithm for locally-dense decomposition.

03

$k$-core decomposition is also a 2-approximation but less aligned with density in practice.

Abstract

Decomposing a graph into a hierarchical structure via $k$ -core analysis is a standard operation in any modern graph-mining toolkit. $k$ -core decomposition is a simple and efficient method that allows to analyze a graph beyond its mere degree distribution. More specifically, it is used to identify areas in the graph of increasing centrality and connectedness, and it allows to reveal the structural organization of the graph. Despite the fact that $k$ -core analysis relies on vertex degrees, $k$ -cores do not satisfy a certain, rather natural, density property. Simply put, the most central $k$ -core is not necessarily the densest subgraph. This inconsistency between $k$ -cores and graph density provides the basis of our study. We start by defining what it means for a subgraph to be locally-dense, and we show that our definition entails a nested chain decomposition of the graph, similar to…

Tables3

Table 1. Table 1 . Basic characteristics of the datasets and the running times of the algorithms.

Name	$n$	$m$	Core	GreedyLD	ExactLD
			running time
dolphins	62	159	1ms	1ms	2ms
karate	34	78	1ms	1ms	2ms
lesmis	77	254	2ms	2ms	3ms
astro	18 772	396 160	0.4s	0.4s	2s
enron	36 692	183 831	0.3s	0.3s	2s
fb1912	747	30 025	44ms	44ms	0.2s
hepph	12 008	237 010	0.2s	0.2s	0.9s
dblp	317 080	1 049 866	2s	2s	14s
gowalla	196 591	950 327	2s	2s	9s
roadnet	1 965 206	2 766 607	7s	8s	1m6s
skitter	1 696 415	11 095 298	21s	21s	1m46s
airports	294	3 995	11ms	10ms	27ms
trains	363	1 357	7ms	7ms	23ms

Table 2. Table 2 . Smallest ratio of the profile function, and the profile function of the exact solution as defined in Equation ( 2 ), and the ratio of the most inner discovered subgraph versus the actual densest subgraph.

	$r (𝒞, ℬ)$		$d (C_{1}) / d (B_{1})$
Name	Core	GreedyLD	Core	GreedyLD
dolphins	0.94	0.83	0.98	0.98
karate	0.95	0.99	0.95	0.99
lesmis	0.86	0.87	0.96	1.00
astro	0.85	0.85	0.87	0.92
enron	0.83	0.82	0.94	1.00
fb1912	0.69	0.74	0.91	1.00
hepph	0.74	0.75	1.00	1.00
dblp	0.80	0.86	1.00	1.00
gowalla	0.89	0.92	0.87	1.00
roadnet	0.81	0.87	0.84	0.87
skitter	0.73	0.84	0.84	1.00
airports	0.75	0.90	0.93	1.00
trains	0.60	0.84	0.82	0.96

Table 3. Table 3 . Sizes of the discovered decompositions and Kendall- τ 𝜏 \tau statistics between the decompositions. E stands for ExactLD , G for GreedyLD , and C for Core .

Name	Core	GreedyLD	ExactLD	c-vs-e	g-vs-e	c-vs-g
dolphins	4	6	7	0.76	0.77	0.99
karate	4	3	4	0.80	0.95	0.78
lesmis	8	8	9	0.94	0.99	0.95
astro	52	83	435	0.93	0.93	0.99
enron	43	162	357	0.92	0.92	0.99
fb1912	87	55	75	0.95	0.98	0.97
hepph	64	63	283	0.93	0.93	0.98
dblp	47	97	1087	0.88	0.89	0.97
gowalla	51	161	899	0.97	0.96	0.98
roadnet	3	43	2710	0.57	0.80	0.68
skitter	111	266	3501	0.98	0.97	0.99
airports	221	200	219	0.99	0.99	0.996
trains	187	59	156	0.87	0.89	0.98

Equations129

d (X) = \frac{∣ E ( X ) ∣}{∣ X ∣},

d (X) = \frac{∣ E ( X ) ∣}{∣ X ∣},

E_{\times} (X, Y) = {(x, y) \in E ∣ x \in X, y \in Y} .

E_{\times} (X, Y) = {(x, y) \in E ∣ x \in X, y \in Y} .

E_{Δ} (X, Y) = E (X) \cup E_{\times} (X, Y) .

E_{Δ} (X, Y) = E (X) \cup E_{\times} (X, Y) .

d (X, Y) = \frac{∣ E _{Δ} ( X , Y ) ∣}{∣ X ∣} .

d (X, Y) = \frac{∣ E _{Δ} ( X , Y ) ∣}{∣ X ∣} .

d (X, Y) = d (X ∖ Y, Y) = \frac{∣ E _{Δ} ( X ∖ Y , Y ) ∣}{∣ X ∖ Y ∣} .

d (X, Y) = d (X ∖ Y, Y) = \frac{∣ E _{Δ} ( X ∖ Y , Y ) ∣}{∣ X ∖ Y ∣} .

\emptyset = C_{0} ⊊ C_{1} ⊊ \dots ⊊ C_{ℓ} = V .

\emptyset = C_{0} ⊊ C_{1} ⊊ \dots ⊊ C_{ℓ} = V .

d (X, W ∖ X) \leq d (Y, W) .

d (X, W ∖ X) \leq d (Y, W) .

d (X, U ∖ X) = d (X, U \cap W) \leq d (Y, U \cap W) \leq d (Y, U),

d (X, U ∖ X) = d (X, U \cap W) \leq d (Y, U \cap W) \leq d (Y, U),

B_{i} = ar g W ⊋ B_{i - 1} max d (W, B_{i - 1}) .

B_{i} = ar g W ⊋ B_{i - 1} max d (W, B_{i - 1}) .

d (Y \cup Z, X) = \frac{∣ E _{Δ} ( Y \cup Z , X ) ∣}{∣ Y ∣ + ∣ Z ∣} = \frac{∣ E _{Δ} ( Y , X ) ∣ + ∣ E _{Δ} ( Z , Y ) ∣}{∣ Y ∣ + ∣ Z ∣} = α d (Y, X) + (1 - α) d (Z, Y) .

d (Y \cup Z, X) = \frac{∣ E _{Δ} ( Y \cup Z , X ) ∣}{∣ Y ∣ + ∣ Z ∣} = \frac{∣ E _{Δ} ( Y , X ) ∣ + ∣ E _{Δ} ( Z , Y ) ∣}{∣ Y ∣ + ∣ Z ∣} = α d (Y, X) + (1 - α) d (Z, Y) .

d (C_{i}, C_{i - 2}) = d (Y \cup Z, X) \geq d (Y, X) = d (C_{i - 1}, C_{i - 2}),

d (C_{i}, C_{i - 2}) = d (Y \cup Z, X) \geq d (Y, X) = d (C_{i - 1}, C_{i - 2}),

d (C_{j}, C_{j - 1}) = d (Z \cup Y, X) < d (Y, X) = d (C_{j} ∖ Z, C_{j - 1}),

d (C_{j}, C_{j - 1}) = d (Z \cup Y, X) < d (Y, X) = d (C_{j} ∖ Z, C_{j - 1}),

d (X_{j}, U_{j} ∖ X_{j})

d (X_{j}, U_{j} ∖ X_{j})

\geq d (C_{j}, C_{j - 1})

> d (C_{i}, C_{i - 1}) .

d (X, C_{i} ∖ X)

d (X, C_{i} ∖ X)

d (B_{j}, B_{i - 1}) < d (B_{i}, B_{i - 1})

d (B_{j}, B_{i - 1}) < d (B_{i}, B_{i - 1})

\emptyset = B_{0} ⊊ B_{1} ⊊ \dots ⊊ B_{k} = V .

\emptyset = B_{0} ⊊ B_{1} ⊊ \dots ⊊ B_{k} = V .

f (α) = W \subseteq V max {∣ E (W) ∣ - α ∣ W ∣},

f (α) = W \subseteq V max {∣ E (W) ∣ - α ∣ W ∣},

F (α) = ar g W \subseteq V max {∣ E (W) ∣ - α ∣ W ∣},

F (α) = ar g W \subseteq V max {∣ E (W) ∣ - α ∣ W ∣},

B_{i} = F (α), for d (B_{i + 1}, B_{i}) < α \leq d (B_{i}, B_{i - 1}) .

B_{i} = F (α), for d (B_{i + 1}, B_{i}) < α \leq d (B_{i}, B_{i - 1}) .

c = ∣ E_{Δ} (B_{j} ∖ B_{j - 1}, B_{j - 1}) ∣ - α ∣ B_{j} ∖ B_{j - 1} ∣ < 0 .

c = ∣ E_{Δ} (B_{j} ∖ B_{j - 1}, B_{j - 1}) ∣ - α ∣ B_{j} ∖ B_{j - 1} ∣ < 0 .

d (B_{k}, B_{i}) \geq d (B_{k}, B_{j}) \geq d (B_{ℓ}, B_{j}) .

d (B_{k}, B_{i}) \geq d (B_{k}, B_{j}) \geq d (B_{ℓ}, B_{j}) .

a

a

c

d (B_{x + 1}, B_{x}) - d (B_{y}, B_{x}) = \frac{a}{b} - \frac{c}{d} = \frac{a d - b c}{b d} \geq \frac{1}{b d} > \frac{1}{n ^{2}} = α - d (B_{y}, B_{x}) .

d (B_{x + 1}, B_{x}) - d (B_{y}, B_{x}) = \frac{a}{b} - \frac{c}{d} = \frac{a d - b c}{b d} \geq \frac{1}{b d} > \frac{1}{n ^{2}} = α - d (B_{y}, B_{x}) .

F (α; X) = ar g X \subseteq W \subseteq V max {∣ E (W) ∣ - α ∣ W ∣} .

F (α; X) = ar g X \subseteq W \subseteq V max {∣ E (W) ∣ - α ∣ W ∣} .

w (y) = deg (y; V ∖ X) + 2 deg (y; X)

w (y) = deg (y; V ∖ X) + 2 deg (y; X)

2 ∣ Z ∣ α + y \in W \sum w (y) + ∣ E_{\times} (Z, W) ∣ .

2 ∣ Z ∣ α + y \in W \sum w (y) + ∣ E_{\times} (Z, W) ∣ .

2 ∣ Z ∣ α + 2 ∣ E ∣ - 2 ∣ E (X \cup Z) ∣ = 2 ∣ E ∣ - 2 ∣ X ∣ α - 2 (∣ E (X \cup Z) ∣ - α ∣ Z \cup X ∣) .

2 ∣ Z ∣ α + 2 ∣ E ∣ - 2 ∣ E (X \cup Z) ∣ = 2 ∣ E ∣ - 2 ∣ X ∣ α - 2 (∣ E (X \cup Z) ∣ - α ∣ Z \cup X ∣) .

d ({w_{1}, \dots, w_{i}}, {w_{1}, \dots, w_{j}}) = \frac{1}{i - j} k = j + 1 \sum i in (v_{k}) .

d ({w_{1}, \dots, w_{i}}, {w_{1}, \dots, w_{j}}) = \frac{1}{i - j} k = j + 1 \sum i in (v_{k}) .

m (j) = ar g j \leq i \leq n max \frac{1}{i - j + 1} k = j \sum i y_{k}, for every 1 \leq j \leq n .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Density-friendly Graph Decomposition

Nikolaj Tatti

HIIT, University of Helsinki, Aalto UniversityHelsinkiFinland

(2009)

Abstract.

Decomposing a graph into a hierarchical structure via $k$ -core analysis is a standard operation in any modern graph-mining toolkit. $k$ -core decomposition is a simple and efficient method that allows to analyze a graph beyond its mere degree distribution. More specifically, it is used to identify areas in the graph of increasing centrality and connectedness, and it allows to reveal the structural organization of the graph.

Despite the fact that $k$ -core analysis relies on vertex degrees, $k$ -cores do not satisfy a certain, rather natural, density property. Simply put, the most central $k$ -core is not necessarily the densest subgraph. This inconsistency between $k$ -cores and graph density provides the basis of our study.

We start by defining what it means for a subgraph to be locally-dense, and we show that our definition entails a nested chain decomposition of the graph, similar to the one given by $k$ -cores, but in this case the components are arranged in order of increasing density. We show that such a locally-dense decomposition for a graph $G=(V,E)$ can be computed in polynomial time. The running time of the exact decomposition algorithm is $\mathit{\mathcal{O}}\mathopen{}\left(|V|^{2}|E|\right)$ but is significantly faster in practice. In addition, we develop a linear-time algorithm that provides a factor-2 approximation to the optimal locally-dense decomposition. Furthermore, we show that the $k$ -core decomposition is also a factor-2 approximation, however, as demonstrated by our experimental evaluation, in practice $k$ -cores have different structure than locally-dense subgraphs, and as predicted by the theory, $k$ -cores are not always well-aligned with graph density.

The research described in this paper builds upon and extends the work appearing in WWW 2015 by Tatti and Gionis (2015).

††journal: TKDD††journalvolume: 1††journalnumber: 1††article: 1††journalyear: 2017††publicationmonth: 1††copyright: acmlicensed††doi: 0000001.0000001

1. Introduction

Finding dense subgraphs and communities is one of the most well-studied problems in graph mining. Techniques for identifying dense subgraphs are used in a large number of application domains, from biology, to web mining, to analysis of social and information networks. Among the many concepts that have been proposed for discovering dense subgraphs, $k$ -cores are particularly attractive for the simplicity of their definition and the fact that they can be identified in linear time.

The $k$ -core of a graph is defined as a maximal subgraph in which every vertex is connected to at least $k$ other vertices within that subgraph. A $k$ -core decomposition of a graph consists of finding the set of all $k$ -cores. A nice property is that the set of all $k$ -cores forms a nested sequence of subgraphs, one included in the next. This makes the $k$ -core decomposition of a graph a useful tool in analyzing a graph by identifying areas of increasing centrality and connectedness, and revealing the structural organization of the graph. As a result, $k$ -core decomposition has been applied to a number of different applications, such as modeling of random graphs (Bollobás, 1984), analysis of the internet topology (Carmi et al., 2007), social-network analysis (Seidman, 1983), bioinformatics (Bader and Hogue, 2003), analysis of connection matrices of the human brain (Hagmann et al., 2008), graph visualization (Alvarez-Hamelin et al., 2005), as well as influence analysis (Kitsak et al., 2010; Ugander et al., 2012) and team formation (Bonchi et al., 2014).

The fact that the $k$ -core decomposition of a graph gives a chain of subgraphs where vertex degrees are higher in the inner cores, suggests that we should expect that the inner cores are, in certain sense, more dense or more connected than the outer cores. As we will show shortly, this statement is not true. Furthermore, in this paper we show how to obtain a graph decomposition for which the statement is true, namely, the inner subgraphs of the decomposition are denser than the outer ones. To quantify density, we adopt a classic notion used in the densest-subgraph problem (Charikar, 2000; Goldberg, 1984), where density is defined as the ratio between the edges and the vertices of a subgraph. This density definition can be also viewed as the average degree divided by 2.

Our motivating observation is that $k$ -cores are not ordered according to this density definition. The next example demonstrates that the most inner core is not necessarily the densest subgraph, and in fact, we can increase the density by either adding or removing vertices.

Example 1.1.

Consider the graph $G_{1}$ shown in Figure 1, consisting of 6 vertices and 9 edges. The density of the whole graph is $9/6=1.5$ . The graph has three $k$ -cores: a $3$ -core marked as $C_{1}$ , a $2$ -core marked as $C_{2}$ , and a $1$ -core, corresponding the the whole graph and marked as $C_{3}$ . The core $C_{1}$ has density $6/4=1.5$ (it contains $6$ edges and $4$ vertices), while the core $C_{2}$ has density $8/5=1.6$ (it contains $8$ edges and $5$ vertices). In other words, $C_{1}$ has lower density than $C_{2}$ , despite being an inner core.

Let us now consider $G_{2}$ shown in Figure 1. This graph has a single core, namely a $2$ -core, containing the whole graph. The density of this core is equal to $11/8=1.375$ . However, a subgraph $B_{1}$ contains $7$ edges and $5$ vertices, giving us density $7/5=1.4$ , which is higher than the density of the only core.

This example motivates us to define an alternative, more density-friendly, graph decomposition, which we call locally-dense decomposition. We are interested in a decomposition such that ( $i$ ) the density of the inner subgraphs is higher than the density of the outer subgraphs, ( $ii$ ) the most inner subgraph corresponds to the densest subgraph, and ( $iii$ ) we can compute or approximate the decomposition efficiently.

We achieve our goals by first defining a locally-dense subgraph, essentially a subgraph whose density cannot be improved by adding and deleting vertices. We show that these subgraphs are arranged into a hierarchy such that the density decreases as we go towards outer subgraphs and that the most inner subgraph is in fact the densest subgraph.

We provide two efficient algorithms to discover this hierarchy. The first algorithm extends the exact algorithm for discovering the densest subgraph given by Goldberg (1984). This algorithm is based on solving a minimum cut problem on a certain graph that depends on a parameter $\alpha$ . Goldberg showed that for a certain value $\alpha$ (which can be found by binary search), the minimum cut recovers the densest subgraph. One of our contributions is to shed more light into Goldberg’s algorithm and show that the same construction allows to discover all locally-dense subgraphs by varying $\alpha$ .

Our second algorithm extends the linear-time algorithm by Charikar (2000) for approximating dense subgraphs. This algorithm first orders vertices by deleting iteratively a vertex with the smallest degree, and then selects the densest subgraph respecting the order. We extend this idea by using the same order, and finding first the densest subgraph respecting the order, and then iteratively finding the second densest subgraph containing the first subgraph, and so on. We show that this algorithm can be executed in linear time and it achieves a factor- $2$ approximation guarantee.

Charikar’s algorithm and the algorithm for discovering a $k$ -core decomposition are very similar: they both order vertices by deleting vertices with the smallest degree. We show that this connection is profoundly deep and we demonstrate that a $k$ -core decomposition provides a factor- $2$ approximation for locally-dense decomposition. On the other hand, our experimental evaluation shows that in practice $k$ -cores have different structure than locally-dense subgraphs, and as predicted by the theory, $k$ -cores are not always well-aligned with graph density.

It is possible that the decomposition results a significant amount of subgraphs. In such a case it may be useful to constraint the number of the subgraphs. We approach this problem by defining an optimization criterion for a segmentation of $k$ nested subgraphs. The objective function will be based on a statistical model. We will show that to optimize this particular objective, we need to (i) find locally-dense subgraphs, and (ii) reduce the number with a dynamic program. We also show that if we replace the first step with the greedy algorithm, then the resulting algorithm yields a factor-2 approximation guarantee.

The remainder of paper is organized as follows. We give preliminary notation in Section 2. We introduce the locally-dense subgraphs in Section 3, present algorithms for discovering the subgraphs in Section 4, and describe the connection to $k$ -core decomposition in Section 5. We introduce the constrained version of the problem in Section 6. We present the related work in Section 7 and present the experiments in Section 8. Finally, we conclude the paper with discussion in Section 9.

2. Preliminaries

Graph density. Let $G=(V,E)$ be a graph with $|V|=n$ vertices and $|E|=m$ edges. Given a subset of vertices $X\subseteq V$ , it is common to define $\mathit{E}\mathopen{}\left(X\right)=\left\{(x,y)\in E\mid x,y\in X\right\}$ , that is, the edges of $G$ that have both end-points in $X$ . The density of the vertex set $X$ is then defined to be

[TABLE]

that is, half of the average degree of the subgraph induced by $X$ . The set of vertices $X\subseteq V$ that maximizes the density measure $\mathit{d}\mathopen{}\left(X\right)$ is the densest subgraph of $G$ .111We should point out that density is also often defined as ${\left|E(X)\right|}/{{\left|X\right|}\choose 2}$ . This is not the case for this paper.

The problem of finding the densest subgraph can be solved in polynomial time. A very elegant solution that involves a mapping to a series of minimum-cut problems was given by Goldberg (1984). As the fastest algorithm to solve the minimum-cut problem runs in $\mathit{\mathcal{O}}\mathopen{}\left(mn\right)$ time, this approach is not scalable to very large graphs. On the other hand, there exists a linear-time algorithm that provides a factor- $2$ approximation to the densest-subgraph problem (Asahiro et al., 1996; Charikar, 2000). This is a greedy algorithm, which starts with the input graph, and iteratively removes the vertex with the lowest degree, until left with an empty graph. Among all subgraphs considered during this vertex-removal process, the algorithm returns the densest.

Next we will provide graph-density definitions that relate pairs of vertex sets. Given two non-overlapping sets of vertices $X$ and $Y$ we first define the cross edges between $X$ and $Y$ as

[TABLE]

We then define the marginal edges from $X$ with respect to $Y$ . Those are the edges that have one end-point in $X$ and the other end-point in either $X$ or $Y$ , that is,

[TABLE]

The set $\mathit{E_{\Delta}}\mathopen{}\left(X,Y\right)$ represents the additional edges that will be included in the induced subgraph of $Y$ if we expand $Y$ by adding $X$ .

Assume that $X$ and $Y$ are non-overlapping. Then, we define the outer density of $X$ with respect to $Y$ as

[TABLE]

That is, these are the extra edges, on average, that we bring to $Y$ if we expand it by appending $X$ .

Now that we have defined a special case when $X$ and $Y$ are disjoint, we can now consider a more general case, that is, when $X$ and $Y$ are overlapping. Here we would be interested in the outer density of vertices in $X$ that are not already included in $Y$ . Hence, we will expand the definition of outer density to a more general case by defining

[TABLE]

$\mathbf{k}$ -cores. We briefly review the basic background regarding $k$ -cores. The concept was introduced by Seidman (1983).

Given a graph $G=(V,E)$ , a set of vertices $X\subseteq V$ is a $k$ -core if every vertex in the subgraph induced by $X$ has degree at least $k$ , and $X$ is maximal with respect to this property. A $k$ -core of $G$ can be obtained by recursively removing all the vertices of degree less than $k$ , until all vertices in the remaining graph have degree at least $k$ .

It is not hard to see that if $\left\{C_{i}\right\}$ is the set of all distinct $k$ -cores of $G$ then $\left\{C_{i}\right\}$ forms a nested chain

[TABLE]

Furthermore, the set of vertices $S_{k}$ that belong in a $k$ -core but not in a $(k-1)$ -core is called $k$ -shell.

The $k$ -core decomposition of $G$ is the process of identifying all $k$ -cores (and all $k$ -shells). Therefore, the $k$ -core decomposition of a graph identifies progressively the internal cores and decomposes the graph shell by shell. A linear-time algorithm to obtain the $k$ -core decomposition was given by Matula and Beck (1983). The algorithm starts by provisionally assigning each vertex $v$ to a core of index $\mathrm{deg}({v})$ , an upper bound to the correct core of a vertex. It then repeatedly removes the vertex with the smallest degree, and updates the core index of the neighbors of the removed vertex. Note the similarity of this algorithm, with the $2$ -approximation algorithm for the densest-subgraph problem (Charikar, 2000).

3. Locally-dense graph decomposition

In this section we present the main concept introduced in this paper, the locally-dense decomposition of a graph. We also discuss the properties of this decomposition. We start by defining the concept of a locally-dense subgraph.

Definition 3.1.

A set of vertices $W$ is locally dense if there are no $X\subseteq W$ and $Y$ satisfying $Y\cap W=\emptyset$ such that

[TABLE]

In other words, for $W$ to be locally dense there should not be an $X$ “inside” $W$ and a $Y$ “outside” $W$ so that the density that $Y$ brings to $W$ is larger than the density that $X$ brings.

Due to the notational simplicity, we will often refer to these sets of vertices as subgraphs.

Interestingly, the property of being locally dense induces a nested chain of subgraphs in $G$ .

Proposition 3.2.

Let $U$ and $W$ be locally-dense subgraphs. Then either $U\subseteq W$ or $W\subseteq U$ .

Proof.

Assume otherwise. Define $X=U\setminus W$ and $Y=W\setminus U$ . Both $X$ and $Y$ should be non-empty sets. Then either $\mathit{d}\mathopen{}\left(X,U\cap W\right)\leq\mathit{d}\mathopen{}\left(Y,U\cap W\right)$ or $\mathit{d}\mathopen{}\left(X,U\cap W\right)>\mathit{d}\mathopen{}\left(Y,U\cap W\right)$ . Assume the former. This implies

[TABLE]

which contradicts the fact that $U$ is locally dense. For the first equality we used the fact that $U\setminus X=U\cap W$ , while for the last inequality we used the fact that $\mathit{E_{\times}}\mathopen{}\left(Y,U\cap W\right)\leq\mathit{E_{\times}}\mathopen{}\left(Y,U\right)$ .

The case $\mathit{d}\mathopen{}\left(X,U\cap W\right)>\mathit{d}\mathopen{}\left(Y,U\cap W\right)$ is similar. ∎

The proposition implies that the set of locally-dense subgraphs of a graph forms a nested chain, in the same way that the set of $k$ -cores does.

Corollary 3.3.

A set of locally-dense subgraphs can be arranged into a sequence $B_{0}\subsetneq B_{1}\subsetneq\cdots\subsetneq B_{k}$ , where $k\leq{\left|V\right|}$ . Moreover, $\mathit{d}\mathopen{}\left(B_{i},B_{i-1}\right)>\mathit{d}\mathopen{}\left(B_{i+1},B_{i}\right)$ for $1\leq i<k$ .

The chain of locally-dense subgraphs of a graph $G$ , as specified by Corollary 3.3, defines the locally-dense decomposition of $G$ .

Example 3.4.

The locally-dense composition of $G_{1}$ given in Figure 1 is $\emptyset\subsetneq C_{2}\subsetneq C_{3}=V$ , This is the $k$ -core decomposition without $C_{1}$ . The locally-dense composition of $G_{2}$ given in Figure 1 is $\emptyset\subsetneq B_{1}\subsetneq V$ . Note that both $C_{2}$ and $B_{1}$ are the densest subgraphs in their respective graphs.

We proceed to characterize the locally-dense subgraphs of the decomposition with respect to their global density in the whole graph $G$ . We want to characterize the global density of subgraph $B_{i}$ of the decomposition. $B_{i}$ cannot be denser than the previous subgraph $B_{i-1}$ in the decomposition, however, we want to measure the density that the additional vertices $S_{i}=B_{i}\setminus B_{i-1}$ bring. This density involves edges among vertices of $S_{i}$ and edges from $S_{i}$ to the previous subgraph $B_{i-1}$ . This is captured precisely by the concept of outer density $\mathit{d}\mathopen{}\left(B_{i},B_{i-1}\right)$ defined in the previous section. As the following proposition shows the outer density of $B_{i}$ with respect to $B_{i-1}$ is maximized over all subgraphs that contain $B_{i-1}$ . In other words, $B_{i}$ is the densest subgraph we can choose after $B_{i-1}$ , given the containment constraint.

Proposition 3.5.

Let $\left\{B_{i}\right\}$ be the chain of locally-dense subgraphs. Then $B_{0}=\emptyset$ , $B_{k}=V$ , and $B_{i}$ is the densest subgraph properly containing $B_{i-1}$ ,

[TABLE]

To prove the proposition we will use the following three lemmas.

Lemma 3.6.

Let $X\subseteq Y$ be two sets of vertices with $Y\neq\emptyset$ . Assume a third non-empty set $Z$ with $Z\cap Y=\emptyset$ . Then one of the following three cases follows:

•

$\mathit{d}\mathopen{}\left(Z,Y\right)>\mathit{d}\mathopen{}\left(Y\cup Z,X\right)>\mathit{d}\mathopen{}\left(Y,X\right)$ , or

•

$\mathit{d}\mathopen{}\left(Z,Y\right)<\mathit{d}\mathopen{}\left(Y\cup Z,X\right)<\mathit{d}\mathopen{}\left(Y,X\right)$ , or

•

$\mathit{d}\mathopen{}\left(Z,Y\right)=\mathit{d}\mathopen{}\left(Y\cup Z,X\right)=\mathit{d}\mathopen{}\left(Y,X\right)$ .

Proof.

Write $\alpha=\frac{{\left|Y\right|}}{{\left|Y\right|}+{\left|Z\right|}}$ . We can rewrite $\mathit{d}\mathopen{}\left(Y\cup Z,X\right)$ as

[TABLE]

This shows that either $\mathit{d}\mathopen{}\left(Z,Y\right)\geq\mathit{d}\mathopen{}\left(Y\cup Z,X\right)\geq\mathit{d}\mathopen{}\left(Y,X\right)$ or $\mathit{d}\mathopen{}\left(Z,Y\right)\leq\mathit{d}\mathopen{}\left(Y\cup Z,X\right)\leq\mathit{d}\mathopen{}\left(Y,X\right)$ . Since $0<\alpha<1$ it follows that $\mathit{d}\mathopen{}\left(Z,Y\right)=\mathit{d}\mathopen{}\left(Y\cup Z,X\right)$ if and only if $\mathit{d}\mathopen{}\left(Y\cup Z,X\right)=\mathit{d}\mathopen{}\left(Y,X\right)$ . The three cases follows. ∎

Let $C_{i}$ be the sequence defined as $C_{i}=\arg\max_{W\supsetneq C_{i-1}}\mathit{d}\mathopen{}\left(W\right)$ , in case of a tie, select a larger graph, and $C_{0}=\emptyset$ .

Lemma 3.7.

$\mathit{d}\mathopen{}\left(C_{j},C_{j-1}\right)>\mathit{d}\mathopen{}\left(C_{i},C_{i-1}\right)$ * for $j<i$ .*

Proof.

We only need to show that the lemma holds $j=i-1$ . Assume otherwise: $\mathit{d}\mathopen{}\left(C_{i},C_{i-1}\right)\geq\mathit{d}\mathopen{}\left(C_{i-1},C_{i-2}\right)$ .

Write $Z=C_{i}\setminus C_{i-1}$ , $Y=C_{i-1}$ , and $X=C_{i-2}$ . Since $\mathit{d}\mathopen{}\left(Z,Y\right)=\mathit{d}\mathopen{}\left(C_{i},C_{i-1}\right)$ , Lemma 3.6 implies that

[TABLE]

violating the optimality of $C_{i-1}$ . ∎

Lemma 3.8.

If $Z\subseteq C_{j}\setminus C_{j-1}$ and $Z\neq\emptyset$ , then $\mathit{d}\mathopen{}\left(Z,C_{j}\setminus Z\right)\geq\mathit{d}\mathopen{}\left(C_{j},C_{j-1}\right)$ .

Proof.

Assume otherwise: $\mathit{d}\mathopen{}\left(Z,C_{j}\setminus Z\right)<\mathit{d}\mathopen{}\left(C_{j},C_{j-1}\right)$ . Write $X=C_{j-1}$ , $Y=C_{j}\setminus Z$ . Lemma 3.6 implies that

[TABLE]

violating the optimality of $C_{j}$ . ∎

Proof of Proposition 3.5.

We need to show that $C_{i}=B_{i}$ . Fix $i$ and assume inductively that $C_{j}=B_{j}$ for all $j<i$ .

We will first show that $C_{i}$ is locally dense: we argue that there are no sets $X$ and $Y$ with $X\subseteq C_{i}$ and $Y\cap C_{i}=\emptyset$ that can serve as certificates for $C_{i}$ being non locally-dense.

Fix any $X\subseteq C_{i}$ . Define $X_{j}=X\cap(C_{j}\setminus C_{j-1})$ and $U_{j}=(C_{i}\setminus X)\cup C_{j-1}$ for $j\leq i$ .

We claim that $C_{j}\subseteq U_{j}\cup X_{j}$ . Let $x\in C_{j}$ . If $x\in C_{j-1}$ , then $x\in U_{j}$ . Assume that $x\in C_{j}\setminus C_{j-1}$ . If $x\in X$ , then $x\in X_{j}$ . If $x\notin X$ , then $x\in C_{j}\setminus X\subseteq C_{i}\setminus X\subseteq U_{j}$ . Thus, $C_{j}\subseteq U_{j}\cup X_{j}$ , which in turns implies that $C_{j}\setminus X_{j}\subseteq U_{j}\setminus X_{j}$ .

This leads to

[TABLE]

This inequality leads to

[TABLE]

Consider also any set $Y$ with $Y\cap C_{i}=\emptyset$ . Due to the optimality of $C_{i}$ and Lemma 3.6 we must have $\mathit{d}\mathopen{}\left(Y,C_{i}\right)<\mathit{d}\mathopen{}\left(C_{i},C_{i-1}\right)$ .

We conclude that for any $X$ and $Y$ with $X\subseteq U$ and $Y\cap C_{i}=\emptyset$ it is $\mathit{d}\mathopen{}\left(X,C_{i}\setminus X\right)>\mathit{d}\mathopen{}\left(Y,C_{i}\right)$ , which shows that $C_{i}$ is locally dense.

Now, we can safely assume $C_{i}=B_{j}$ for some $j$ . We need to show that $j=i$ . By induction we know that $C_{i-1}=B_{i-1}$ . This guarantees that $j\geq i$ . Assume $j>i$ . Since $C_{i}$ is maximal, we have $\mathit{d}\mathopen{}\left(B_{j}\setminus B_{i},B_{i}\right)<\mathit{d}\mathopen{}\left(B_{i},B_{i-1}\right)$ .

Since $B_{i}$ is locally-dense, we have $\mathit{d}\mathopen{}\left(B_{j}\setminus B_{i},B_{i}\right)<\mathit{d}\mathopen{}\left(B_{i},B_{i-1}\right)$ . Lemma 3.6 now implies that

[TABLE]

which contradicts the optimality of $C_{i}=B_{j}$ . Thus $i=j$ . ∎

As a consequence of the previous proposition we can characterize the first subgraph in the decomposition.

Corollary 3.9.

Let $\left\{B_{i}\right\}$ be a locally-dense decomposition of a graph $G$ . Then $B_{1}$ is the densest subgraph of $G$ .

The above discussion motivates the problem of locally-dense graph decomposition, which is the focus of this paper.

Problem 1.

Given a graph $G=(V,E)$ find a maximal sequence of locally-dense subgraphs

[TABLE]

4. Decomposition algorithms

In this section we propose two algorithms for the problem of locally-dense graph decomposition (Problem 1). The first algorithm gives an exact solution, and runs in worst-case time $\mathit{\mathcal{O}}\mathopen{}\left(n^{2}m\right)$ , but it is significantly faster in practice. The second algorithm is a linear-time algorithm that provides a factor- $2$ approximation guarantee.

Both algorithms are inspired by corresponding algorithms for the densest-subgraph problem. The first algorithm by the exact algorithm of Goldberg (1984), and the second algorithm by the greedy linear-time algorithm of Charikar (2000).

4.1. Exact algorithm

We start our discussion on the exact algorithm for locally-dense graph decomposition by reviewing Goldberg’s algorithm (Goldberg, 1984) for the densest-subgraph problem.

Recall that the densest-subgraph problem asks to find the subset of vertices $W$ that maximizes $\mathit{d}\mathopen{}\left(W\right)={\left|E(W)\right|}/{\left|W\right|}$ . Given a graph $G=(V,E)$ and a positive number $\alpha\geq 0$ define a function

[TABLE]

and the maximizer

[TABLE]

where ties are resolved by picking the largest $W$ . Note that $\mathit{f}$ decreases as $\alpha$ increases, and as $\alpha$ exceeds a certain value, $\mathit{f}$ becomes [math] by taking $W=\emptyset$ . Goldberg observed that the densest-subgraph problem is equivalent to the problem of finding the largest value of $\alpha^{*}$ for which the maximizer set $\mathit{F}\mathopen{}\left(\alpha^{*}\right)=W^{*}$ is non empty.222This observation is an instance of fractional programming (Dinkelbach, 1967). The densest subgraph is precisely this maximizer set $W^{*}$ . Furthermore, Goldberg showed how to find the vertex set $W=\mathit{F}\mathopen{}\left(\alpha\right)$ , for a given value of $\alpha$ . This is done by mapping the problem to an instance of the min-cut problem, which can be solved in $\mathit{\mathcal{O}}\mathopen{}\left(nm\right)$ time, in a recent breakthrough by Orlin (2013). We will present an extension of this transformation in the next section, where we discuss how to speed-up the algorithm.

Thus, Goldberg’s algorithm uses binary search over $\alpha$ and finds the largest value of $\alpha^{*}$ for which the maximizer set $W^{*}$ is non empty. Each iteration of the binary search involves a call to a min-cut instance for the current value of $\alpha$ .

Our algorithm for finding the locally-dense decomposition of a graph builds on Goldberg’s algorithm (Goldberg, 1984). We show that Goldberg’s construction has the following, rather remarkable, property: there is a sequence of values $\alpha^{*}=\alpha_{1}>\cdots>\alpha_{k}$ , for $k\leq n$ , which gives all the distinct values of the function $\mathit{f}$ . Furthermore, the corresponding set of subgraphs $\{\mathit{F}\mathopen{}\left(\alpha_{1}\right),\ldots,\mathit{F}\mathopen{}\left(\alpha_{k}\right)\}$ is exactly the set of all locally-dense subgraphs of $G$ , and thus the solution to our decomposition problem.

Therefore, our algorithm is a simple extension of Goldberg’s algorithm: instead of searching only for the optimal value $\alpha_{1}=\alpha^{*}$ , we find the whole sequence of $\alpha_{i}$ ’s and the corresponding subgraphs.

Next we prove the claimed properties and discuss the algorithm in more detail.

We first show that the distinct maximizers of the function $\mathit{F}$ correspond to the set of locally-dense subgraphs.

Proposition 4.1.

Let $\left\{B_{i}\right\}$ be the set of locally-dense subgraphs. Then

[TABLE]

Proof.

We first show that $U=\mathit{F}\mathopen{}\left(\beta\right)$ is a locally-dense subgraph, for any $\beta$ . Note that for any $X\subseteq U$ , we must have ${\left|\mathit{E_{\Delta}}\mathopen{}\left(X,U\setminus X\right)\right|}-\beta{\left|X\right|}\geq 0$ , otherwise we can delete $X$ from $U$ and obtain a better solution which violates the optimality of $U=\mathit{F}\mathopen{}\left(\beta\right)$ . This implies that $\mathit{d}\mathopen{}\left(X,U\setminus X\right)=\mathit{E_{\Delta}}\mathopen{}\left(X,U\setminus X\right)/{\left|X\right|}\geq\beta$ . Similarly, for any $Y$ such that $Y\cap U=\emptyset$ , we have ${\left|\mathit{E_{\Delta}}\mathopen{}\left(Y,U\right)\right|}-\beta{\left|Y\right|}<0$ or, equivalently, $\mathit{d}\mathopen{}\left(Y,U\right)<\beta$ . Thus, $U$ is locally-dense.

Fix $i$ and select $\alpha$ such that $\mathit{d}\mathopen{}\left(B_{i+1},B_{i}\right)<\alpha\leq\mathit{d}\mathopen{}\left(B_{i},B_{i-1}\right)$ . Let $B_{j}=\mathit{F}\mathopen{}\left(\alpha\right)$ . If $j>i$ , then, due to Corollary 3.3, $\mathit{d}\mathopen{}\left(B_{j},B_{j-1}\right)\leq\mathit{d}\mathopen{}\left(B_{i+1},B_{i}\right)<\alpha$ which we can rephrase as

[TABLE]

If we delete $B_{j}\setminus B_{j-1}$ from $U$ , then we improve the quality exactly by $-c$ , that is, we obtain a better solution which violates the optimality of $U$ . If $j<i$ , then Corollary 3.3 implies that $\mathit{d}\mathopen{}\left(B_{j+1},B_{j}\right)\geq\alpha$ , so we can add $B_{j+1}\setminus B_{j}$ to obtain a better solution. It follows that $B_{i}=\mathit{F}\mathopen{}\left(\alpha\right)$ . ∎

Next we need to show that it is possible to search efficiently for the sequence of $\alpha$ ’s that give the set of locally-dense subgraphs. To that end we will show that if we have obtained two subgraphs $B_{x}\subsetneq B_{y}$ of the decomposition (corresponding to values $\alpha_{x}>\alpha_{y}$ ), it is possible to pick a new value $\alpha$ so that computing $\mathit{F}\mathopen{}\left(\alpha\right)$ allows us to make progress in the search process: we either find a new locally-dense subgraph $B_{x}\subsetneq B_{z}\subsetneq B_{y}$ or we establish that no such subgraph exists between $B_{x}$ and $B_{y}$ , in other words, $B_{x}$ and $B_{y}$ are consecutive subgraphs in our decomposition.

Proposition 4.2.

Let $\left\{B_{i}\right\}$ be the set of locally-dense subgraphs. Let $B_{x}\subsetneq B_{y}$ be two subgraphs. Set $\alpha=\mathit{d}\mathopen{}\left(B_{y},B_{x}\right)+n^{-2}$ and let $B_{z}=\mathit{F}\mathopen{}\left(\alpha\right)$ . If $x+1<y$ , then $x<z<y$ . If $x+1=y$ , then $z=x$ .

Lemma 4.3.

$\mathit{d}\mathopen{}\left(B_{k},B_{i}\right)\geq\mathit{d}\mathopen{}\left(B_{\ell},B_{j}\right)$ , for $i\leq j<k\leq\ell$ . The equality holds if and only if $i=k$ and $j=\ell$ .

Proof.

Corollary 3.3 states that $\mathit{d}\mathopen{}\left(B_{o},B_{o-1}\right)$ is monotonically strictly decreasing as a function of $o$ . Lemma 3.6, applied recusively, states that

[TABLE]

The inequality is strict if and only if $i\neq k$ or $j\neq\ell$ . ∎

Proof of Proposition 4.2.

Lemma 4.3 states that $\mathit{d}\mathopen{}\left(B_{y},B_{y-1}\right)\leq\mathit{d}\mathopen{}\left(B_{y},B_{x}\right)<\alpha$ . Proposition 4.1 now implies that $z<y$ .

Assume that $x+1<y$ . Lemma 4.3 implies that $\mathit{d}\mathopen{}\left(B_{y},B_{x}\right)<\mathit{d}\mathopen{}\left(B_{x+1},B_{x}\right)$ . Write

[TABLE]

Let us now bound the difference between the densities as

[TABLE]

This implies that $\alpha\leq\mathit{d}\mathopen{}\left(B_{x+1},B_{x}\right)$ . Proposition 4.1 now implies that $z\geq x+1>x$ .

Assume that $x+1=y$ . Lemma 4.3 implies that $\mathit{d}\mathopen{}\left(B_{y},B_{y-1}\right)<\mathit{d}\mathopen{}\left(B_{x},B_{x-1}\right)$ , and the same argument as above shows that $\alpha\leq\mathit{d}\mathopen{}\left(B,B_{x-1}\right)$ and, consequently, $z\geq x$ . This guarantees that $x=z$ . ∎

The exact decomposition algorithm uses Proposition 4.2 to guide the search process. Starting by the two extreme subgraphs of the decomposition, $\emptyset$ and $V$ , the algorithm maintains a sequence of locally-dense subgraphs. Recursively, for any two currently-adjacent subgraphs in the sequence, we use Proposition 4.2 to check whether the two subgraphs are consecutive or not in the decomposition. If they are consecutive, the recurrence at that branch of the search is terminated. If they are not, a new subgraph between the two is discovered and it is added in the decomposition. The algorithm is named ExactLD and it is illustrated as Algorithm 1.

With the next propositions we prove the correctness of the algorithm and we bound its running time.

Proposition 4.4.

The algorithm ExactLD initiated with input $(G,\emptyset,V)$ visits all non-trivial locally-dense subgraphs of $G$ .

Proof.

Let $\left\{B_{i}\right\}$ be the set of locally-dense subgraphs. We will prove the proposition by showing that for $i<j$ , the algorithm $\textsc{ExactLD}(G,B_{i},B_{j})$ visits all monotonic subgraphs that are between $B_{i}$ and $B_{j}$ . We will prove this by induction over $j-i$ . The first step $j=i+1$ is trivial. Assume that $j>i+1$ . Then Proposition 4.2 implies that $B_{k}=\mathit{F}\mathopen{}\left(\alpha\right)$ , where $i<k<j$ . The inductive assumption now guarantees that $\textsc{ExactLD}(G,B_{i},B_{k})$ and $\textsc{ExactLD}(G,B_{k},B_{j})$ will visit all monotonic subgraphs between $B_{i}$ and $B_{j}$ . ∎

Proposition 4.5.

The worst-case running time of algorithm ExactLD is $\mathit{\mathcal{O}}\mathopen{}\left(n^{2}m\right)$ .

Proof.

We will show that the algorithm ExactLD, initiated with input $(G,\emptyset,V)$ makes $2k-3$ calls to the function $\mathit{F}$ , where $k$ is the number of locally-dense subgraphs.

Let $k_{i}$ be the number of calls of $\mathit{F}$ when the input parameter $Y=B_{i}$ . Out of these $k_{i}$ calls one call will result in $\mathit{F}\mathopen{}\left(\alpha\right)=X$ . There are $k-1$ such calls, since $Y=\emptyset$ is never tested. Each of the remaining calls will discover a new locally-dense subgraph. Since there are $k-2$ new subgraphs to discover, it follows that $2k-3$ calls to $\mathit{F}$ are needed.

Since a call to $\mathit{F}$ corresponds to a min-cut computation, which has running time $\mathit{\mathcal{O}}\mathopen{}\left(nm\right)$ (Orlin, 2013), and since $k\in\mathit{\mathcal{O}}\mathopen{}\left(n\right)$ , the claimed running-time bound follows. ∎

4.2. Speeding up the exact algorithm

Our next step is to speed-up ExactLD. This speed-up does not improve the theoretical bound for the computational time but, in practice, it improves the performance of the algorithm dramatically.

The speed-up is based on the following observation. We know from Proposition 4.2 that $\textsc{ExactLD}(G,X,Y)$ visits only subgraphs $Z$ with the property $X\subseteq Z\subseteq Y$ . This gives us immediately the first speed-up: we can safely ignore any vertex outside $Y$ , that is, $\textsc{ExactLD}(G(Y),X,Y)$ will yield the same output.

Our second observation is that any subgraph $Z$ visited by $\textsc{ExactLD}(G,X,Y)$ must contain vertices $X$ . However, we cannot simply delete them because we need to take into account the edges between $X$ and $Z$ . To address this let us consider the following maximizer

[TABLE]

We can replace the original $\mathit{F}\mathopen{}\left(\alpha\right)$ in Algorithm 1 with $\mathit{F}\mathopen{}\left(\alpha;X\right)$ . To compute $\mathit{F}\mathopen{}\left(\alpha;X\right)$ we will use a straightforward extension of the Goldberg’s algorithm (Goldberg, 1984) and transform this problem into a problem of finding a minimum cut.

In order to do this, given a graph $G=(V,E)$ , let us define a weighted graph $H$ that consists of vertices $V\setminus X$ and edges $E(V\setminus X)$ with weights of 1. Add two auxiliary vertices $s$ and $t$ into $H$ and connect these vertices to every vertex in $V\setminus X$ . Given a vertex $y\in V\setminus X$ , assign a weight of $2\alpha$ to the edge $(y,t)$ and a weight of

[TABLE]

to the edge $(s,y)$ , where $\mathit{\operatorname{deg}}\mathopen{}\left(y;U\right)$ stands for the number of neighbors of $y$ in $U$ . We claim that solving a minimum cut such that $s$ and $t$ are in different cuts will solve $\mathit{F}\mathopen{}\left(\alpha;X\right)$ . This cut can be obtained by constructing a maximum flow from $s$ to $t$ .

To prove this claim let $C\subsetneq V(H)$ be a subset of vertices containing $s$ and not containing $t$ . Let $Z=C\setminus\left\{s\right\}$ and also let $W=V\setminus(Z\cup X)$ . There are three types of cross-edges from $C$ to $V(H)\setminus C$ : (i) edges from $x\in Z$ to $t$ , (ii) edges from $s$ to $x\in W$ , and (iii) edges from $x\in Z$ to $y\in W$ . The total cost of $C$ is then

[TABLE]

We claim that the last two terms of the cost are equal to $2{\left|E\right|}-2{\left|E(X\cup Z)\right|}$ . To see this, consider an edge $e=(x,y)$ in $E\setminus E(X\cup Z)$ . This implies that at least one of the end points, assume it is $y$ , has to be in $W$ . There are three different cases for $x$ : (i) if $x\in W$ , then $e$ contributes 2 to the cost: 1 to $w(x)$ and 1 to $w(y)$ , (ii) if $x\in X$ , then $e$ contributes $2$ to $w(y)$ , and (iii) if $x\in Z$ , then $e$ contributes $1$ to $w(y)$ and $1$ to the third term. Thus, we can write the cut as

[TABLE]

The first two terms in the right-hand side are constant which implies that that finding the minimum cut is equivalent of maximizing ${\left|E(X\cup Z)\right|}-\alpha{\left|Z\cup X\right|}$ . Consequently, if $Z^{*}$ is the min-cut solution, then $\mathit{F}\mathopen{}\left(\alpha\right)=X\cup Z^{*}$ .

Note that the graph $H$ does not have vertices included in $X$ . By combining both speed-ups we are able to reduce the running time of $\textsc{ExactLD}(X,Y)$ by considering only the vertices that are in $Y\setminus X$ .

4.3. Linear approximation algorithm

As we saw in the last section, the exact algorithm can be significantly accelerated, and indeed, our experimental evaluation shows that it is possible to run the exact algorithm for a graph of millions of vertices and edges within 2 minutes. Nevertheless, the worst-case complexity of the algorithm is cubic, and thus, it is not truly scalable for massive graphs.

Here we present a more lightweight algorithm for performing a locally-dense decomposition of a graph. The algorithm runs in linear time and offers a factor- $2$ approximation guarantee. As the exact algorithm builds on Goldberg’s algorithm for the densest-subgraph problem, the linear-time algorithm builds on Charikar’s approximation algorithm for the same problem (Charikar, 2000). As already explained in Section 2, Charikar’s approximation algorithm iteratively removes the vertex with the lowest degree, until left with an empty graph, and returns the densest graph among all subgraphs considered during this process.

Our extension to this algorithm, called GreedyLD, is illustrated in Algorithm 2, and it operates in two phases. The first phase is identical to the one in Charikar’s algorithm: all vertices of the graph are iteratively removed, in increasing order of their degree in the current graph. In the second phase, the algorithm proceeds to discover approximate locally-dense subgraphs, in an iterative manner, from $B_{1}$ to $B_{k}$ . The first subgraph $B_{1}$ is the approximate densest subgraph, the same one returned by Charikar’s algorithm. In the $j$ -th step of the iteration, having discover subgraphs $B_{1},\ldots,B_{j-1}$ the algorithm selects the subgraph $B_{j}$ that maximizes the density $\mathit{d}\mathopen{}\left(B_{j},B_{j-1}\right)$ . To select $B_{j}$ the algorithm considers subsets of vertices only in the degree-based order that was produced in the first phase.

Discovering $\mathcal{C}$ from the ordered vertices takes $\mathit{\mathcal{O}}\mathopen{}\left(n^{2}\right)$ time, if done naively. However, it is possible to implement this step in $\mathit{\mathcal{O}}\mathopen{}\left(n\right)$ time. In order to do this, sort vertices in the reverse visit order, and define $\mathit{\mathrm{in}}\mathopen{}\left(v\right)$ to be the number of edges of $v$ from the earlier neighbors. Then, we can we express the density as an average,

[TABLE]

Consequently, we can see that recovering $\mathcal{C}$ is an instance of the following problem,

Problem 2.

Given a sequence $y_{1},\ldots,y_{n}$ , compute the maximal interval

[TABLE]

Luckily, Calders et al. (2014) demonstrated that we can use the classic PAVA algorithm by Ayer et al. (1955) to solve this problem for every value of $j$ in total $\mathit{\mathcal{O}}\mathopen{}\left(n\right)$ time.

To quantify the approximation guarantee of GreedyLD, note that the sequence of approximate locally-dense subgraphs produced by the algorithm are not necessarily aligned with the locally-dense subgraphs of the optimal decomposition. In other words, to assess the quality of the density of an approximate locally-dense subgraph $B_{j}$ produced by GreedyLD, there is no direct counterpart in the optimal decomposition to compare. To overcome this difficulty we develop a scheme of “vertex-wise” comparison, where for any $1\leq i\leq n$ , the density of the smallest approximate locally-dense subgraph of size at least $i$ is compared with the density of the smallest optimal locally-dense subgraph of size at least $i$ . This is defined below via the concept of profile.

Definition 4.6.

Let $\mathcal{B}=(\emptyset=B_{0}\subsetneq B_{1}\subsetneq\cdots\subsetneq B_{k}=V)$ be a nested chain of subgraphs, the first subgraph being the empty graph and the last subgraph being the full graph. For an integer $i$ , $1\leq i\leq n$ define

[TABLE]

to be the index of the smallest subgraph in $\mathcal{B}$ whose size is at least $i$ . We define a profile function ${\mathit{p}}:{\left\{1,\ldots,n\right\}}\to{}$ to be

[TABLE]

Our approximation guarantee is now expressed as a guarantee of the profile function of the approximate decomposition with respect to the optimal decomposition.

Proposition 4.7.

Let $\mathcal{B}=\left\{B_{i}\right\}$ be the set of locally-dense subgraphs. Let $\mathcal{C}=\left\{C_{i}\right\}$ be the subgraphs obtained by GreedyLD. Then

[TABLE]

First, we need the following lemma.

Lemma 4.8.

$\mathit{d}\mathopen{}\left(v,B_{i}\setminus\left\{v\right\}\right)\geq\mathit{d}\mathopen{}\left(B_{i},B_{i-1}\right)$ , for $v\in B_{i}\setminus B_{i-1}$ ,

Proof.

Assume otherwise. Lemma 3.6 now states that $\mathit{d}\mathopen{}\left(B_{i}\setminus\left\{v\right\},B_{i-1}\right)<\mathit{d}\mathopen{}\left(B_{i},B_{i-1}\right)$ , which violates the optimality of $B_{i}$ as indicated by Proposition 3.5. ∎

Proof of Proposition 4.7.

Sort the set of vertices $V$ according to the reverse visiting order of GreedyLD and let $\mathit{\mathrm{in}}\mathopen{}\left(v\right)$ be the number of edges of $v$ from earlier neighbors.

Fix $k$ to be an integer, $1\leq k\leq n$ and let $B_{i}$ be the smallest subgraph such that ${\left|B_{i}\right|}\geq k$ . Let $v_{j}$ be the last vertex occurring in $B_{i}$ . We must have $\mathit{\mathrm{in}}\mathopen{}\left(v_{j}\right)\geq\mathit{d}\mathopen{}\left(v_{j},B_{i}\setminus\left\{v_{j}\right\}\right)$ , and, due to Lemma 4.8, $\mathit{d}\mathopen{}\left(v_{j},B_{i}\setminus\left\{v_{j}\right\}\right)\geq\mathit{d}\mathopen{}\left(B_{i},B_{i-1}\right)$ . In summary, we have

[TABLE]

Let $C_{x}$ be the smallest subgraph such that ${\left|C_{x}\right|}\geq k$ . Let $v_{z}$ be the vertex with the smallest index that is still in $C_{x}\setminus C_{x-1}$ and define $A=\left\{v_{z},\ldots,v_{j}\right\}$ . Let $g(v)$ be the degree of $v\in A$ right before $v_{j}$ is removed during GreedyLD. Note that, by definition, $\mathit{\mathrm{in}}\mathopen{}\left(v_{j}\right)\leq g(v)$ , and that

[TABLE]

This leads to

[TABLE]

where the optimality of $C_{x}$ implies the first inequality. ∎

We should point out that $\mathit{p}\mathopen{}\left(1,\mathcal{B}\right)$ is equal to the density of the densest subgraph, while $\mathit{p}\mathopen{}\left(1,\mathcal{C}\right)$ is equal to the density of the subgraph discovered by the Charikar’s algorithm. Consequently, Proposition 4.7 provides automatically the 2-approximation guarantee of the Charikar’s algorithm.

We should also point out that $\mathit{p}\mathopen{}\left(i,\mathcal{C}\right)$ can be larger than $\mathit{p}\mathopen{}\left(i,\mathcal{B}\right)$ . However, if $j$ is the first index, for which $\mathit{p}\mathopen{}\left(j,\mathcal{C}\right)\neq\mathit{p}\mathopen{}\left(j,\mathcal{B}\right)$ , then Proposition 3.5 guarantees that $\mathit{p}\mathopen{}\left(j,\mathcal{C}\right)<\mathit{p}\mathopen{}\left(j,\mathcal{B}\right)$ .

5. Locally-dense subgraphs and core decomposition

Here we study the connection of graph cores, obtained with the well-known $k$ -core decomposition algorithms, with local-density, studied in this paper. We are able to show that from the theory point-of-view, graph cores are as good approximation to the optimal locally-dense graph decomposition as the subgraphs obtained by the GreedyLD algorithm. In particular we show a similar result to Proposition 4.7, namely, a factor- $2$ approximation on the profile function of the core decomposition.

However, as we will see in our empirical evaluation, the behavior of the two algorithms, GreedyLD and $k$ -core decomposition are different in practice, with GreedyLD giving in general more dense subgraphs and closer to the ones given by exact locally-dense decomposition.

Before stating and proving the result regarding $k$ -cores, recall that a set of vertices $X\subseteq V$ is a $k$ -core if every vertex in the subgraph induced by $X$ has degree at least $k$ , and $X$ is maximal with respect to this property. A linear-time algorithm for obtaining all $k$ -cores is illustrated in Algorithm 3.

It is a well-known fact that the set of all $k$ -cores of a graph forms a nested chain of subgraphs, in the same way that locally-dense subgraphs do.

Proposition 5.1.

Let $\left\{C_{i}\right\}$ be the set of all $k$ -cores of a graph $G=(V,E)$ . Then $\left\{C_{i}\right\}$ forms a nested chain,

[TABLE]

Similar to Proposition 4.7, $k$ -cores provide a factor- $2$ approximation with respect to the locally-dense subgraphs. The proof is in fact quite similar to that of Proposition 4.7.

Proposition 5.2.

Let $\mathcal{B}=\left\{B_{i}\right\}$ be the set of locally-dense subgraphs. Let $\mathcal{C}=\left\{C_{i}\right\}$ be the set of $k$ -cores. Then

[TABLE]

Proof.

Sort $V$ according to the reverse visiting order of Core and let $\mathit{\mathrm{in}}\mathopen{}\left(v\right)$ be the number of edges of $v$ from earlier neighbors.

Fix $k$ to be an integer, $1\leq k\leq n$ and let $B_{i}$ be the smallest subgraph such that ${\left|B_{i}\right|}\geq k$ . Let $v_{j}$ be the last vertex occurring in $B_{i}$ . We must have $\mathit{\mathrm{in}}\mathopen{}\left(v_{j}\right)\geq\mathit{d}\mathopen{}\left(v_{j},B_{i}\setminus\left\{v_{j}\right\}\right)$ , and, due to Lemma 4.8, $\mathit{d}\mathopen{}\left(v_{j},B_{i}\setminus\left\{v_{j}\right\}\right)\geq\mathit{d}\mathopen{}\left(B_{i},B_{i-1}\right)$ . In summary, we have

[TABLE]

Let $C_{x}$ be the smallest core such that ${\left|C_{x}\right|}\geq k$ , and write $A=C_{x}\setminus C_{x-1}$ . Let $v_{s}$ be the vertex with the smallest index that is still in $A$ , and let $v_{l}$ be the vertex with the largest index that is still in $A$ , that is, $\left\{v_{s},\ldots,v_{l}\right\}=A$ .

If $j>l$ , then $\mathit{\mathrm{in}}\mathopen{}\left(v_{j}\right)<\mathit{\mathrm{in}}\mathopen{}\left(v_{l}\right)$ , otherwise $C_{x}$ is not a core. If $j<l$ , then $\mathit{\mathrm{in}}\mathopen{}\left(v_{j}\right)\leq\mathit{\mathrm{in}}\mathopen{}\left(v_{l}\right)$ , otherwise $v\j\notin C_{x}$ , and since $j\geq k$ , then $C_{x}$ is not the smallest core with at least $k$ vertices, which is a contradiction. Hence, $\mathit{\mathrm{in}}\mathopen{}\left(v_{j}\right)\leq\mathit{\mathrm{in}}\mathopen{}\left(v_{l}\right)$ .

Let $g(v)$ be the degree of $v\in A$ right before $v_{l}$ is removed during Core. We now have

[TABLE]

which proves the proposition. ∎

6. Segmentation problem: constraining the number of subgraphs

It is possible that the decomposition yields a significant amount of subgraphs. In such a case it may be useful to constraint the number of the subgraphs. In order to do so we need to define an optimization criterion, which will be our first step. We then demonstrate how to solve the problem exactly, and how to estimate the solution efficiently.

6.1. Problem definition

Our goal is to discover $k$ nested subgraphs that minimize a certain cost. We base the cost on the degree of a node, relative to the subgraph. A natural approach here is to model the degree, that is, our goal is to maximize the log-likelihood $\sum_{v}\log p(\mathit{\operatorname{deg}}\mathopen{}\left(v;C_{i}\right);\lambda_{i})$ , where $C_{i}$ is the smallest subgraph containing $v$ and $\lambda_{i}$ is a parameter of the distribution. Unfortunately, this is problematic due to the following reason: an edge $(x,y)$ , where $x,y\in C_{i}\setminus C_{i-1}$ increases the degrees of both $x$ and $y$ , whereas an edge $(x,y)$ , with $x\in C_{i}$ and $y\in C_{i-1}$ increases the degrees only for $x$ and not for $y$ . The distribution we will consider favors small degrees, so this leads to a scenario where the cost function implicitly favors having a lot of cross-edges. To rectify this problem we introduce the notion of adjusted degree, where we count each cross-edge twice.

Definition 6.1.

Assume a sequence of nested subgraphs $\mathcal{C}=\left(\emptyset=C_{0}\subsetneq C_{1}\subsetneq\cdots\subsetneq C_{k}=V\right)$ . Let $v$ be a vertex and let $C_{i}$ be the smallest set containing $v$ . Define the adjusted degree as

[TABLE]

To reduce the clutter, we typically omit $\mathcal{C}$ from the notation and write $\mathit{\operatorname{adg}}\mathopen{}\left(v\right)$ .

Next we give a formal definition of the problem.

Definition 6.2.

Assume that we are given a distribution $p(\cdot;r)$ for the adjusted degree. This distribution has one parameter $r$ ; small values indicate the likelihood of high degrees. Given a graph $G$ and an integer $k$ , find a $k$ -segmentation, a sequence of nested subgraphs $\mathcal{C}=\left(\emptyset=C_{0}\subsetneq C_{1}\subsetneq\cdots\subsetneq C_{k}=V\right)$ and parameters $\lambda_{1}\leq\cdots\leq\lambda_{k}$ , minimizing the negative log-likelihood

[TABLE]

where $i$ is the index of the smallest $C_{i}$ containing $v$ .

The reason why we write this problem as a minimization problem is because typically the log-likelihood is negative, and in order to establish approximation guarantees we need to have the cost function to be positive.

We are specifically interested in geometric and exponential distributions. Both distributions can be written as $p(x;\lambda)=\exp(-\lambda x-Z(\lambda))$ , where $Z(\lambda)$ is the normalization constant333The geometric distribution is defined over the integers whereas the exponential distribution is defined over the real domain. This results in different normalization constants.. Moreover, smaller values of $\lambda$ will result in a distribution favoring larger degrees, that is, inner subgraphs should be denser.

6.2. Exact algorithm

In this section we demonstrate how to find an optimal segmentation using locally-dense subgraphs. First we prove the key proposition that states that it is enough to use locally-dense subgraphs when looking for the optimal segmentation.

Proposition 6.3.

Assume that $p$ is either exponential or geometric distribution. Then there is an optimal segmentation $\mathcal{C}=\left(\emptyset=C_{0}\subsetneq C_{1}\subsetneq\cdots\subsetneq C_{k}=V\right)$ such that each $C_{i}$ is locally-dense.

To prove the proposition, we need the following technical lemma.

Lemma 6.4.

Let $C_{1},\ldots,C_{k}$ be the optimal solution, and assume some of the subgraphs are not locally-dense. Then there is $C_{i}$ that is not locally-dense along with the violating sets $X$ and $Y$ such that $Y\subseteq C_{i+1}$ and $X\cap C_{i-1}=\emptyset$ .

Proof.

Let $C_{i}$ be a set that is not locally-dense, and let $X$ and $Y$ be the violating sets. Next we argue that we can safely assume that $Y\subseteq C_{i+1}$ and $X\cap C_{i-1}=\emptyset$ . We will split the argument in two cases: Case ( $i$ ): $Y\nsubseteq C_{i+1}$ and Case ( $ii$ ): $Y\subseteq C_{i+1}$ .

Assume Case ( $i$ ). If $\mathit{d}\mathopen{}\left(C_{i+1}\setminus C_{i},C_{i}\right)\geq\mathit{d}\mathopen{}\left(X,C_{i}\setminus X\right)$ , then redefine $Y$ as $C_{i+1}\setminus C_{i}$ . In such case, $X$ and $Y$ are still violating the local density but now we can use Case ( $ii$ ). Assume that $\mathit{d}\mathopen{}\left(C_{i+1}\setminus C_{i},C_{i}\right)<\mathit{d}\mathopen{}\left(X,C_{i}\setminus X\right)$ . Define $Y_{1}=Y\cap C_{i+1}$ and $Y_{2}=Y\setminus Y_{1}$ . Note that $Y_{2}\neq\emptyset$ . Assume that $\mathit{d}\mathopen{}\left(Y_{2},C_{i}\cup Y_{1}\right)\geq\mathit{d}\mathopen{}\left(Y,C_{i}\right)$ . Then

[TABLE]

Redefine $Y$ as $Y_{2}$ , $X$ as $C_{i+1}\setminus C_{i}$ , and increase $i$ by 1. The previous arguments show that new $Y$ and $X$ violate the local density of $C_{i+1}$ , so we repeat our argument with either Case ( $i$ ) or Case ( $ii$ ).

Assume now that $\mathit{d}\mathopen{}\left(Y_{2},C_{i}\cup Y_{1}\right)<\mathit{d}\mathopen{}\left(Y,C_{i}\right)$ . This forces $Y_{1}\neq\emptyset$ . Since $\mathit{d}\mathopen{}\left(Y,C_{i}\right)$ is a weighted average of $\mathit{d}\mathopen{}\left(Y_{2},C_{i}\cup Y_{1}\right)$ and $\mathit{d}\mathopen{}\left(Y_{1},C_{i}\right)$ , we have $\mathit{d}\mathopen{}\left(Y,C_{i}\right)\leq\mathit{d}\mathopen{}\left(Y_{1},C_{i}\right)$ . Redefine $Y$ as $Y_{1}$ , and apply Case ( $ii$ ).

Assume Case ( $ii$ ). Write $X_{1}=X\cap C_{i-1}$ and $X_{2}=X\setminus X_{1}$ . If $X_{1}=\emptyset$ , then we are done; assume otherwise. If $\mathit{d}\mathopen{}\left(C_{i}\setminus C_{i-1},C_{i-1}\right)\leq\mathit{d}\mathopen{}\left(Y,C_{i}\right)$ , then we can replace $X$ with $C_{i}\setminus C_{i-1}$ to complete the argument. Assume that $\mathit{d}\mathopen{}\left(C_{i}\setminus C_{i-1},C_{i-1}\right)>\mathit{d}\mathopen{}\left(Y,C_{i}\right)$ .

Assume $X_{2}\neq\emptyset$ . If $\mathit{d}\mathopen{}\left(X_{2},C_{i}\setminus X_{2}\right)\leq\mathit{d}\mathopen{}\left(X,C_{i}\setminus X\right)$ , then we can replace $X$ with $X_{2}$ to complete the argument. Assume $\mathit{d}\mathopen{}\left(X_{2},C_{i}\setminus X_{2}\right)>\mathit{d}\mathopen{}\left(X,C_{i}\setminus X\right)$ . Note that $\mathit{d}\mathopen{}\left(X,C_{i}\setminus X\right)$ is a weighted average of $\mathit{d}\mathopen{}\left(X_{2},C_{i}\setminus X_{2}\right)$ and $\mathit{d}\mathopen{}\left(X_{1},C_{i}\setminus X\right)$ . This implies that $\mathit{d}\mathopen{}\left(X_{1},C_{i}\setminus X\right)<\mathit{d}\mathopen{}\left(X,C_{i}\setminus X\right)$ .

On the other hand, if $X_{2}=\emptyset$ , then $X_{1}=X$ , and $\mathit{d}\mathopen{}\left(X_{1},C_{i}\setminus X\right)=\mathit{d}\mathopen{}\left(X,C_{i}\setminus X\right)$ .

Combining everything gives us

[TABLE]

Redefine $X$ as $X_{1}$ , $Y$ as $C_{i}\setminus C_{i-1}$ and decrease $i$ by one, and repeat Case ( $ii$ ).

Note that we do first at most $k$ repetitions of Case ( $i$ ), and then at most $k$ repetitions of Case ( $ii$ ). After a finite numer of repetitions we end up with $C_{i}$ that satisfies the conditions. This completes the proof. ∎

Proof of Proposition 6.3.

Both geometric and exponential distributions can be written as $p(x;\lambda)=\exp(-\lambda x-Z)$ , where $Z$ is the normalization constant (depending on $\lambda$ ).

Write $B_{i}=C_{i}\setminus C_{i-1}$ . We can write the optimization function as

[TABLE]

where $Z_{i}$ is the normalization constant for the parameter $\lambda_{i}$ .

Assume that $C_{i}$ is not locally-dense, that is, there is $X$ and $Y$ that violate the local density. Lemma 6.4 states that we can safely assume that $Y\subseteq C_{i+1}$ and $X\cap C_{i-1}=\emptyset$ . This allows us to either remove $X$ from $C_{i}$ or add $Y$ to $C_{i}$ without changing the other sets.

The cost of the $i$ th and the $i+1$ th segment is equal to

[TABLE]

Let us define $W=B_{i}\cup B_{i+1}$ . Due to the equality

[TABLE]

the cost can be rewritten as

[TABLE]

by setting $I=W$ , $J=B_{i}$ and $A=C_{i-1}$ . We would like to vary $B_{i}$ while keeping the remaining variables constant; let us define

[TABLE]

Note that the last two terms do not depend on $U$ . Due to optimality of $C_{i}$ , we have $g(B_{i})\leq g(B_{i}\cup Y)$ , or

[TABLE]

where the last equality is due to Eq. 1. We can rewrite the inequality as

[TABLE]

where the last inequality follows from the fact that $X$ and $Y$ violate the local density of $C_{i}$ , and since $\lambda_{i+1}\geq\lambda_{i}$ . We can rewrite the left-hand side and the right-hand side as

[TABLE]

or $g(B_{i}\setminus X)\leq g(B_{i})$ .

We have shown that if there is $C_{i}$ that is not locally-dense, we can delete some vertices from $C_{i}$ without sacrificing the quality. We continue this until all $C_{i}$ are locally-dense; the process must end because at each step we reduce the size of some $C_{i}$ . ∎

The proposition gives us means to compute the optimal segmentation. First we discover locally-dense decomposition, say, $\mathcal{L}$ . If the number of subgraphs is less or equal than $k$ , we are done. Otherwise, we group subgraphs until we reach $k$ . The optimal grouping can be done with a dynamic program. Write $o[i,j]$ to be the cost of partial $j$ -segmentation using only $L_{0},\ldots,L_{i}$ . We have the identity

[TABLE]

and $\lambda$ is the optimal parameter for modeling $L_{i}\setminus L_{\ell}$ . This identity allows us to compute $o[n,k]$ recursively with a dynamic program. Note that the monotonicity of the segmentation—that is, the inner subgraphs should be more dense—is automatically guaranteed. We will refer to this algorithm as $\textsc{Segment}(\mathcal{L},k)$ .

Computing $c[\ell,i]$ can be done in constant time. To see this, let $r={\left|L_{i}\setminus L_{\ell}\right|}$ be the number of nodes in $L_{i}\setminus L_{\ell}$ . Let also

[TABLE]

be the sum of all adjusted degrees in $L_{i}\setminus L_{\ell}$ . Note $r$ and $q$ can be maintained in constant time. Then the corresponding costs for the geometric and exponential distributions are

[TABLE]

Let us consider computational complexity. Discovering locally-dense decomposition can be done in $\mathit{\mathcal{O}}\mathopen{}\left(n^{2}m\right)$ time, whereas the actual segmentation can be done in $\mathit{\mathcal{O}}\mathopen{}\left(\ell^{2}k\right)\subseteq\mathit{\mathcal{O}}\mathopen{}\left(n^{2}k\right)$ time, where $\ell$ is the number of subgraphs in locally-dense decomposition. In practice, $\ell\ll n$ so the segmentation step is relatively cheap. However, if $\ell$ is large, it is possible to achieve $(1+\epsilon)$ approximation for the segmentation in linear time (Guha et al., 2006; Tatti, 2019).

6.3. Approximation algorithm

As pointed out above, the bottleneck of the exact algorithm is the locally-dense decomposition step. For large graphs we can significantly speed-up this step by using the faster algorithm GreedyLD. The next proposition shows that this yields 2-approximation guarantee, if we use the geometric distribution.

Proposition 6.5.

Let $p$ be the geometric distribution. Let $\mathcal{C}$ be the optimal segmentation, and let $\mathcal{O}=\textsc{Segment}(\textsc{GreedyLD}(G),k)$ be the optimal segmentation using the sets obtained from GreedyLD. Then $\mathit{cost}\mathopen{}\left(\mathcal{O}\right)\leq 2\mathit{cost}\mathopen{}\left(\mathcal{C}\right)$ .

Before proving the result, we need to introduce some notation. The geometric distribution can be written as

[TABLE]

where $Z(\lambda)\geq 0$ is the normalization constant.

To prove the result let us enumerate the vertices, that is, $V=\left\{v_{i}\right\}_{i=1}^{n}$ , and assume that this order respects the optimal segmentation $\left\{C_{i}\right\}$ , $v_{i}\in C_{j}$ implies that $v_{i-1}\in C_{j}$ . Let $\left\{\lambda_{i}\right\}$ be the optimal parameters for $\left\{C_{i}\right\}$ . We write $\eta_{i}$ to be the parameter $\lambda_{j}$ that is used to model $\mathit{\operatorname{adg}}\mathopen{}\left(v_{i};C_{j}\right)$ , where $C_{j}$ is the smallest subgraph containing $v_{i}$ . Write $Z=\sum Z(\eta_{i})$ to be the sum of normalization constants. Note that $Z\geq 0$ . Given a sequence $X=x_{1},\ldots,x_{n}$ , we define

[TABLE]

Define $A$ with $a_{i}=\mathit{\operatorname{adg}}\mathopen{}\left(v_{i}\right)$ . Note that $f(A)=\mathit{cost}\mathopen{}\left(\mathcal{C}\right)$ .

Define an order for vertex indices $\left(o_{i}\right)_{i=1}^{n}$ , vertices with high degree first, that is, $\mathit{\operatorname{deg}}\mathopen{}\left(v_{o_{i}}\right)\geq\mathit{\operatorname{deg}}\mathopen{}\left(v_{o_{i+1}}\right)$ . Define a sequence $T$ with $t_{i}=\mathit{\operatorname{deg}}\mathopen{}\left(v_{o_{i}}\right)$ .

Lemma 6.6.

$f(T)\leq\mathit{cost}\mathopen{}\left(\mathcal{C}\right).$ **

Proof.

Define $T^{\prime}$ as $t^{\prime}_{i}=\mathit{\operatorname{deg}}\mathopen{}\left(v_{i}\right)$ . We argue first that $f(T^{\prime})\leq\mathit{cost}\mathopen{}\left(\mathcal{C}\right)=f(A)$ . We can rewrite

[TABLE]

Since $\eta_{i}\leq\eta_{i+1}$ , we have $f(T^{\prime})\leq f(A)$ . To prove $f(T)\leq f(T^{\prime})$ , note that

[TABLE]

That is, let $\left(q_{i}\right)$ be any vertex order, and let $X$ be the degree sequence $x_{i}=\mathit{\operatorname{deg}}\mathopen{}\left(v_{q_{i}}\right)$ . Then sorting the vertices with bubble sort from $\left(q_{i}\right)$ to $\left(o_{i}\right)$ will not increase the sum in $f$ at any step. Consequently, $f(T)\leq f(X)$ . Since this holds for any order, $f(T)\leq f(T^{\prime})$ , which proves the lemma. ∎

Let $\left(g_{i}\right)_{i=1}^{n}$ be the reverse order of indices in which GreedyLD removes the vertices, and let $s_{i}$ be the degree of $v_{g_{i}}$ during its removal.

Lemma 6.7.

$s_{i}\leq t_{i}$ .

Proof.

Consider two sets $P=\left\{v_{g_{1}},\ldots,v_{g_{i-1}}\right\}$ and $Q=\left\{v_{o_{1}},\ldots,v_{o_{i-1}}\right\}$ . Assume that $P\neq Q$ when treated as sets, that is, there are indices $j$ and $\ell$ with $j<i\leq\ell$ such that $g_{j}=o_{\ell}$ . Let $h$ be the degree of $v_{g_{j}}$ when deleting $v_{g_{i}}$ . Since GreedyLD deletes the vertex with the smallest degree, $s_{i}\leq h$ . Consequently, $s_{i}\leq h\leq\mathit{\operatorname{deg}}\mathopen{}\left(v_{g_{j}}\right)=t_{\ell}\leq t_{i}$ .

Assume the opposite case: $P=Q$ . Due to pigeonhole principle, there is $\ell\geq i$ such that $o_{\ell}=g_{i}$ . Thus, $s_{i}\leq\mathit{\operatorname{deg}}\mathopen{}\left(v_{g_{i}}\right)=t_{\ell}\leq t_{i}$ . ∎

Proof of Proposition 6.5.

Define $B$ as $b_{i}=\mathit{\operatorname{adg}}\mathopen{}\left(v_{g_{i}}\right)$ . Note that $\mathit{\operatorname{adg}}\mathopen{}\left(v_{g_{i}}\right)\leq 2s_{i}$ . Thus,

[TABLE]

Consider a segmentation $\mathcal{G}$ respecting the order $g_{i}$ and having the same sizes as $\mathcal{C}$ , ${\left|C_{i}\right|}={\left|G_{i}\right|}$ . The value $f(B)$ corresponds to the log-likelihood of $\mathcal{G}$ and the parameters $\lambda_{1},\ldots,\lambda_{k}$ , and $\mathit{cost}\mathopen{}\left(\mathcal{G}\right)$ corresponds to the log-likelihood of $\mathcal{G}$ and the optimized parameters. Thus, $\mathit{cost}\mathopen{}\left(\mathcal{G}\right)\leq f(B)$ .

We have shown that there is a segmentation respecting the order chosen by GreedyLD that is at most $2\mathit{cost}\mathopen{}\left(\mathcal{C}\right)$ . Thus, the optimal segmentation respecting the order is also at most $2\mathit{cost}\mathopen{}\left(\mathcal{C}\right)$ . The argument in the proof of Proposition 6.3 can be now used to show that we can safely assume that the segmentation uses sets returned by GreedyLD. ∎

We can show a similar result for the exponential distribution as long as the original graph does not have any singletons.

Proposition 6.8.

Let $p$ be the exponential distribution. Assume that $G$ has no singletons. Let $\mathcal{C}$ be the optimal segmentation, and let $\mathcal{O}=\textsc{Segment}(\textsc{GreedyLD}(G),k)$ be the optimal segmentation using the sets obtained from GreedyLD. Then $\mathit{cost}\mathopen{}\left(\mathcal{O}\right)\leq 2\mathit{cost}\mathopen{}\left(\mathcal{C}\right)$ .

Proof.

Similarly to the geometric distribution, exponential distribution can be written as

[TABLE]

Let $Z$ be as defined in proof of Proposition 6.5, that is, it is total sum of the normalization constants. To prove the result we only need to show that $Z\geq 0$ , and we can use the proof of Proposition 6.5. Note that $Z(\lambda)=-\log\lambda$ , and the optimal $\lambda$ for a segment $C_{i}$ is $1/[2\mathit{d}\mathopen{}\left(C_{i}\setminus C_{i-1},C_{i-1}\right)]$ . This leads to

[TABLE]

To prove the result we will show that $\mathit{d}\mathopen{}\left(C_{i}\setminus C_{i-1},C_{i-1}\right)\geq 1/2$ . It is enough to prove the case $i=k$ as due to Proposition 6.3 the densities are monotonic.

Let $X$ be any subset of vertices. As there are no singletons, $\mathit{\operatorname{deg}}\mathopen{}\left(v\right)\geq 1$ . This leads to

[TABLE]

Set $X=C_{k}\setminus C_{k-1}$ to complete the proof. ∎

We should point out that these results also work if the graph has weights on the edges. However, in such a case, Proposition 6.8 requires weights to be larger than or equal to 1.

7. Related work

This paper is an extension of previouly published work (Tatti and Gionis, 2015), and in this extension we introduce the segmentation problem, where we constrain the number of subgraphs. Danisch et al. (2017) introduced an alternative iterative technique for computing locally-dense decomposition that scales well in practice.

Our paper is related to previous work on discovering dense subgraphs, clique-like structures, and hierarchical communities. We review some representative work on these topics.

Clique relaxations. The densest possible subgraph is a clique. Unfortunately finding large cliques is computationally intractable (Håstad, 1996). Additionally, the notion of clique does not provide a robust definition for practical situations, as a few absent edges may completely destroy the clique. To address these issues, researchers have come up with relaxed clique definitions. A relaxation, $k$ -plex was suggested by Seidman and Foster (2010). In a $k$ -plex a vertex can have at most $k-1$ absent edges. Unfortunately, discovering maximal $k$ -plexes is also an NP-hard problem (Balasundaram et al., 2011). An alternative relaxation for a clique is the one of an $n$ -clique, a maximal subgraph where each vertex is connected to every vertex with a path, possibly outside of the subgraph, of at most $n$ -length (Bron and Kerbosch, 1973). So, according to this definition a clique is an $1$ -clique. As maximal $n$ -cliques may produce sparse graphs, the concept of $n$ -clans was also proposed by limiting the diameter of the subgraph to be at most $n$ (Mokken, 1979). Since $1$ -clan corresponds to a maximal clique, discovering $n$ -clans is a computationally intractable problem.

Quasi-cliques. For the definition of graph density we have chosen to work with $\mathit{d}\mathopen{}\left(X\right)$ , the average degree of the subgraph induced by $X$ . While this is a popular density definition, there are other alternatives. One such alternative would be to divide the number of edges present in the subgraph with the total number of possible edges, that is, divide by ${n\choose 2}$ . This would give us a normalized density score that is between [math] and $1$ . Subgraphs that maximize this density definition are called quasi-cliques, and algorithms for enumerating all quasi-cliques, which can be exponentially many, have been proposed by Abello et al. (2002) and Uno (2010). However, the definition of quasi-cliques is problematic. Note that a single edge already provides maximal density. Consequently additional objectives are needed. One natural objective is to maximize the size of a graph with density of $1$ , however, this makes the problem equivalent to finding a maximal clique which, as mentioned above, is a computationally-intractable problem (Håstad, 1996).

Alternative definitions for density. Other definitions of graph density have been proposed. Recently, Tsourakakis proposed to measure density by counting triangles, instead of counting edges (Tsourakakis, 2015). Interestingly enough, it is possible to find an approximate densest subgraph under this definition. An interesting future direction for our work is to study if the decomposition proposed in this paper can be extended for the triangle-density definition. Density definitions of the form $g({\left|E\right|})-\alpha h({\left|V\right|})$ , where $g$ and $h$ are some increasing functions were studied by Tsourakakis et al. (2013), with specific focus on $h(x)={x\choose 2}$ . It not known whether the densest-subgraph problem according to this definition is polynomially-time solvable or NP-hard. Finally, a variant for $\mathit{d}\mathopen{}\left(X\right)$ adopted for directed graph, along with polynomial-time discovery algorithm, was suggested by Khuller and Saha (2009). Such a definition could serve for defining decompositions of directed graphs, which is also left for future work.

Hierarchical communities. A classic technique for modelling hierarchical nature of communities is with a hierarchical blockmodel (Clauset et al., 2008). Here we are given a tree, where the leaves are the vertices of the original graph and each vertex in a tree is given a probablility. We then model an edge $(u,v)$ with a probability given to the lowest common ancestor of $u$ and $v$ . Tatti and Gionis (2013) studied a restricted version of this problem where the tree yields a nested structure; inner communities being denser. Unfortunately, no exact polynomial-time algorithm is known for the restricted or general problem. On other hand, in the segmentation problem we based the model on degrees and not individual edges. This allowed to us to solve the problem exactly.

8. Experimental evaluation

We will now present our experimental evaluation. We tested the two proposed algorithms, ExactLD and GreedyLD, for decomposing a graph into locally-dense subgraphs, and we contrast the resulting decompositions against $k$ -cores, obtained with the Core algorithm. We compare the three algorithms in terms of running time, decomposition size (number of subgraphs they provide), and relative density of the subgraphs they return. We also use the Kendall- $\tau$ to measure how similar are the decompositions in terms of the order they induce on the graph vertices.

8.1. Experimental setup

We performed our evaluation on 13 graphs of different sizes and densities. A short description of the graphs is given below, and their basic characteristics can be found in Table 1.

•

dolphins: an undirected social network of frequent associations between dolphins in a community living off Doubtful Sound in New Zealand.

•

karate: the social network of friendships between members of a karate club at a US university in the 1970.

•

lesmis: co-appearance of characters in Les Miserables novel by Victor Hugo.

•

astro: a co-authorship network among arXiv Astro Physics publications.

•

enron: an e-mail communication network by Enron employees.

•

fb1912: an ego-network obtained from Facebook.

•

hepph: a co-authorship network among arXiv High Energy Physics publications.

•

dblp: a co-authorship network among computer science researchers.

•

gowalla: a friendship network of gowalla.com.

•

roadnet: a road network of California, where vertices represent intersections and edges represent road segments.

•

skitter: an internet topology graph, obtained from traceroutes run daily in 2005.

•

airports: US flight traffic in January 2016444http://www.transtats.bts.gov/, where vertices represent airports and weighted edges flight routes. The weights represent the number of flights between two airports.

•

trains: UK train routes.555http://data.atoc.org/ The vertices represent medium or large exchange points (stations), while the weighted edges represent scheduled routes. The weights represent the number of routes in a single week.

The first three datasets were obtained from UCIrvine Network Data Repository,666http://networkdata.ics.uci.edu/index.php and the remaining datasets, except for airports and trains, were obtained from Stanford SNAP Repository.777http://snap.stanford.edu/data

We applied Core, GreedyLD, and ExactLD to every dataset. We used a computer equipped with 3GHz Intel Core i7 and 8GB of RAM.888The implementation is available at

https://version.helsinki.fi/dacs

8.2. Results

We begin by reporting the running times of the three algorithms for all of our datasets. They are shown in Table 1. As expected, the linear-time algorithms Core and GreedyLD are both very fast; the largest graph with 11 million edges and 1.7 million vertices is processed in 21 seconds. However, we are also able to run the exact decomposition for all the graphs in reasonable time, despite its running-time complexity of $\mathit{\mathcal{O}}\mathopen{}\left(n^{2}m\right)$ . It takes less than 2 minutes for ExactLD to process the largest graph. There are three reasons that contribute to achieving this performance. First, we need to compute the minimum cut only $\mathit{\mathcal{O}}\mathopen{}\left(k\right)$ times, where $k$ is the number of locally-dense graphs. In practice, $k$ is much smaller than the number of vertices. Second, computing minimum cut in practice is faster than the theoretical $\mathit{\mathcal{O}}\mathopen{}\left(nm\right)$ bound. Third, as described in Section 4, most of the minimum cuts are computed using subgraphs. While in theory these subgraphs can be as large as the original graph, in practice these subgraphs are significantly smaller.

Next, we compare how well Core and GreedyLD approximate the exact locally-dense decomposition. In order to do that we compute the ratio

[TABLE]

where $\mathcal{B}$ is the locally-dense decomposition and $\mathcal{C}$ is obtained by either from GreedyLD or Core. These ratios are shown in Table 2. We also compare $\mathit{p}\mathopen{}\left(1;\mathcal{C}\right)/\mathit{p}\mathopen{}\left(1;\mathcal{B}\right)$ , that is, the ratio of density for the inner most subgraph in $\mathcal{C}$ against the density of $\mathcal{B}_{1}$ , the densest subgraph. Propositions 4.7 and 5.1 guarantee that there ratios are at least $1/2$ . In practice, the ratios are larger, typically over $0.8$ . In most cases, but not always, GreedyLD obtains better ratios than Core. When comparing the ratio for the inner most subgraph, GreedyLD, by design, will always be better or equal than Core. We see that only in three datasets Core is able to find the same subgraph as GreedyLD.

Let us now compare the different solutions found by the three algorithms. In Table 3 we report the sizes of discovered communities and their Kendall- $\tau$ statistics, which compares the ordering of the vertices induced by the decompositions. In particular, the Kendall- $\tau$ statistic is computed by assigning each vertex an index based on which subgraph the vertex belongs. To handle ties, we use the $b$ -version of Kendall- $\tau$ , as given by Agresti (2010). If the statistic is 1, the decompositions are equal.

Our first observation is that typically the locally-dense decomposition algorithms return more subgraphs than the $k$ -core decomposition. As an extreme example, roadnet contains only 3 $k$ -cores while GreedyLD finds 43 subgraphs and ExactLD finds 2710. This can be explained by the fact that the vertices in the graph have low degrees, which results in a very coarse $k$ -core decomposition. On the other hand, ExactLD and GreedyLD exploit density to discover more fine-grained decompositions. This result is similar to what we presented in the Example 1.1 in the introduction.

The Kendall- $\tau$ statistics are typically close to $1$ , especially for large datasets suggesting that all 3 methods result in similar decompositions. The statistic between Core and GreedyLD is typically larger than to the exact solution. This is expected since Core and GreedyLD use the exact same order for vertices—the only difference between these two methods is how they partition the vertex order. In addition, decompositions produced by GreedyLD are closer to the exact solution than the decompositions produced by Core, which is also a natural result.

Let us now compare the solutions in terms of profile functions as defined in Definition 4.6. We illustrate several prototypical examples of such profile functions in Figure 2. We see that GreedyLD produces similar profiles as the exact locally-dense decomposition. We also see that Core does not respect the local density constraint. In fb1912, astro, and hepph there exist $k$ -shells that are denser than their inner shells, that is, joining these shells would increase the density of the inner shell. GreedyLD does not have this problem since by definition it will have a monotonically decreasing profile.

In Figure 3 we present the decompositions obtained by the three algorithms for the lesmis graph. We see that GreedyLD obtains very similar result to the exact solution, the only difference is the second subgraph and the third subgraph are merged and the $7$ th subgraph (in ExactLD) lends vertices to the 8th last subgraph. While GreedyLD has the same first subgraph as the exact solution, which is the densest subgraph, Core breaks this subgraph into 3 subgraphs. Interestingly enough, the protagonist of the book, Jean Valjean, is not placed into the first shell by Core.

Next, we present our result with segmentation. First we computed the cost of optimal segmentation as a function of the number of segments $k$ . Here, we used exponential distribution as the underlying model. The normalized scores are shown in left plot of Figure 4. The scores behave similarly for all datasets: they improve quickly at the very beginning (for $k=1,\ldots,10$ ), after which they settle to a relatively stable value. This value depends on the dataset.

Next, we study how well can approximate the segmentation by using GreedyLD instead of the exact solution. The results are shown in the right plot of Figure 4. Here, we plot the relative difference between the approximate solution and the optimal solution. Ideally, the difference should be 0, and Proposition 6.8 states that it is at most 1. We see that in practice the estimates are really close to each other: all differences are within $0.006$ . The approximation is better for smaller $k$ . This is a natural result as there is less room for disagreement in more coarse segmentations.

Finally, let us look on segmentations obtained from trains and airports data. Our goal is to discover which locations, that is, train stations or airports, are central. Here, by centrality we mean that a central location is well-connected with others central locations. To quantify this notion we use locally-dense subgraphs. Note that the number of locally-dense subgraphs is relatively large in these graphs; this is due to the fact that the graphs are weighted. We were interested to group the locations in 4 categories. So to reduce the the size of decomposition, we solved segmentation problem with $k=4$ and the exponential model. The results are shown in Figure 5–7.

The discovered trains segmentation shows that the densest segment occurs in the vicinity of London, as expected. There is also a strong concentration of the second densest segment around Manchester/Liverpool area while the stations in Scotland, apart from the capital Edinburgh, are in outer segments. For airports, we see that the inner segments consists of large well-connected airports, such as JFK, DFW, ATL, or ORD, while the smaller, regional, airports are assigned to the outer segments.

9. Conclusions

Inspired by $k$ -core analysis and density-based graph mining, we propose density-friendly graph decomposition, a new tool for analyzing graphs. Like $k$ -core decomposition, our approach decomposes a given graph into a nested sequence of subgraphs These subgraphs have the property that the inner subgraphs are always denser than the outer ones; additionally the most inner subgraph is the densest one—properties that the $k$ -cores do not satisfy.

We provide two efficient algorithms to discover such a decomposition. The first algorithm is based on minimum cut and it extends the exact algorithm of Goldberg for the densest-subgraph problem. The second algorithm extends a linear-time algorithm by Charikar for approximating the same problem. The second algorithm runs in linear time, and thus, in addition to finding subgraphs that respect better the density structure of the graph, it is as efficient as the $k$ -core decomposition algorithm.

In addition to offering a new alternative for decomposing a graph into dense subgraphs, we significantly extend the analysis, the understanding, and the applicability of previous well-known graph algorithms: Goldberg’s exact algorithm and Charikar’s approximation algorithm for finding the densest subgraph, as well as the $k$ -core decomposition algorithm itself.

Finally, we considered a constrained version of the problem, where we restrict the number of subgraphs. We do this by designing a model based on segmentation. The likelihood of this model is then optimized, and we show that we can do this either exactly or estimate this efficiently by a factor of 2.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Abello et al . (2002) James Abello, Mauricio G.C. Resende, and Sandra Sudarsky. 2002. Massive Quasi-Clique Detection. In LATIN 2002: Theoretical Informatics . 598–612.
3Agresti (2010) Alan Agresti. 2010. Analysis of Ordinal Categorical Data (2nd ed.). John Wiley & Sons.
4Alvarez-Hamelin et al . (2005) J. Ignacio Alvarez-Hamelin, Luca Dall’Asta, Alain Barrat, and Alessandro Vespignani. 2005. k 𝑘 k -core decomposition: a tool for the visualization of large scale networks. Co RR abs/cs/0504107 (2005).
5Asahiro et al . (1996) Yuichi Asahiro, Kazuo Iwama, Hisao Tamaki, and Takeshi Tokuyama. 1996. Greedily finding a dense subgraph. Scandinavian Workshop on Algorithm Theory (SWAT) (1996), 136–148.
6Ayer et al . (1955) M. Ayer, H. Brunk, G. Ewing, and W. Reid. 1955. An empirical distribution function for sampling with incomplete information. The Annals of Mathematical Statistics 26, 4 (1955), 641–647.
7Bader and Hogue (2003) Gary Bader and Christopher Hogue. 2003. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 1 (2003).
8Balasundaram et al . (2011) Balabhaskar Balasundaram, Sergiy Butenko, and Illya V. Hicks. 2011. Clique Relaxations in Social Network Analysis: The Maximum k 𝑘 k -Plex Problem. Operations Research 59, 1 (2011), 133–142.