Direction Matters: On Influence-Preserving Graph Summarization and   Max-cut Principle for Directed Graphs

Wenkai Xu; Gang Niu; Aapo Hyv\"arinen; Masashi Sugiyama

arXiv:1907.09588·stat.ML·July 24, 2019

Direction Matters: On Influence-Preserving Graph Summarization and Max-cut Principle for Directed Graphs

Wenkai Xu, Gang Niu, Aapo Hyv\"arinen, Masashi Sugiyama

PDF

Open Access

TL;DR

This paper introduces a novel graph summarization method for directed graphs that preserves edge directionality and minimizes reconstruction error, enabling more effective analysis of large-scale directed networks.

Contribution

The paper proposes a new model based on Max-Cut principles and non-negative constraints for directed graph summarization, with a multiplicative update algorithm and theoretical guarantees.

Findings

01

The method accurately preserves directed edge information.

02

It demonstrates robustness across various experiments.

03

The approach effectively captures group-level features.

Abstract

Summarizing large-scaled directed graphs into small-scale representations is a useful but less studied problem setting. Conventional clustering approaches, which based on "Min-Cut"-style criteria, compress both the vertices and edges of the graph into the communities, that lead to a loss of directed edge information. On the other hand, compressing the vertices while preserving the directed edge information provides a way to learn the small-scale representation of a directed graph. The reconstruction error, which measures the edge information preserved by the summarized graph, can be used to learn such representation. Compared to the original graphs, the summarized graphs are easier to analyze and are capable of extracting group-level features which is useful for efficient interventions of population behavior. In this paper, we present a model, based on minimizing reconstruction error…

Tables1

Table 1. Table 1: Term Comparison between Original Graph and Summarized Graph

original graph: $G$	vertices: $x_{i} \in V$	edges: $e_{i j} \in E$
summarized graph: $H$	compressed nodes: $c_{I} \in C$	compressed relations: $r_{I J} \in R$

Equations62

L_{0} (G, ϕ_{V}, ϕ_{E}) = I, J \sum x_{i} \in c_{I}, x_{j} \in c_{J} \sum ℓ (e_{ij}, r_{I J}),

L_{0} (G, ϕ_{V}, ϕ_{E}) = I, J \sum x_{i} \in c_{I}, x_{j} \in c_{J} \sum ℓ (e_{ij}, r_{I J}),

L_{1} (A, U, R) = I, J \sum i, j \sum (A_{ij} - r_{I J})^{2} u_{i I} u_{j J} .

L_{1} (A, U, R) = I, J \sum i, j \sum (A_{ij} - r_{I J})^{2} u_{i I} u_{j J} .

L_{2} (A, U, R) = I, J \sum \frac{1}{∣ C _{I} ∣∣ C _{J} ∣} i, j \sum (A_{ij} - r_{I J})^{2} u_{i I} u_{j J}

L_{2} (A, U, R) = I, J \sum \frac{1}{∣ C _{I} ∣∣ C _{J} ∣} i, j \sum (A_{ij} - r_{I J})^{2} u_{i I} u_{j J}

L_{2} (A, U, R) = ∥ A - U R U^{⊤} ∥_{F}^{2}, s . t . U^{⊤} U = I_{k}; R_{I J} R_{J I} = 0, \forall I, J .

L_{2} (A, U, R) = ∥ A - U R U^{⊤} ∥_{F}^{2}, s . t . U^{⊤} U = I_{k}; R_{I J} R_{J I} = 0, \forall I, J .

L_{3} (A, U, R) = ∥ A - U R U^{⊤} ∥_{F}^{2} s . t . U^{⊤} U = I_{k}; R_{I J} R_{J I} = 0, \forall I, J,

L_{3} (A, U, R) = ∥ A - U R U^{⊤} ∥_{F}^{2} s . t . U^{⊤} U = I_{k}; R_{I J} R_{J I} = 0, \forall I, J,

L_{4} (T; U, S) = ∥ T - U S U^{⊤} ∥_{F}^{2} s . t . U^{⊤} U = I_{k},

L_{4} (T; U, S) = ∥ T - U S U^{⊤} ∥_{F}^{2} s . t . U^{⊤} U = I_{k},

L_{5} (T; U, S) = ∥ T - U S U^{⊤} ∥_{F}^{2} s . t . U \geq 0, U^{⊤} U = I_{k} .

L_{5} (T; U, S) = ∥ T - U S U^{⊤} ∥_{F}^{2} s . t . U \geq 0, U^{⊤} U = I_{k} .

L_{6} (T; U, S, Λ) = ∥ T - U S U^{⊤} ∥^{2} + tr (Λ (U^{⊤} U - I)) s . t . U \geq 0,

L_{6} (T; U, S, Λ) = ∥ T - U S U^{⊤} ∥^{2} + tr (Λ (U^{⊤} U - I)) s . t . U \geq 0,

U \geq 0 min L_{6} (T; U, S, Λ) = U \geq 0 min tr (- 2 U^{⊤} T^{⊤} U S + U S^{⊤} U^{⊤} U S U^{⊤} + U Λ U^{⊤})

U \geq 0 min L_{6} (T; U, S, Λ) = U \geq 0 min tr (- 2 U^{⊤} T^{⊤} U S + U S^{⊤} U^{⊤} U S U^{⊤} + U Λ U^{⊤})

Z (U, U^{'}) = tr (- 2 Q_{+} U^{⊤} - U P_{-} U^{⊤}) + ij \sum \frac{[ U ^{'} ( P _{+} + Λ ) ] _{ij} U _{ij}^{2}}{U _{ij}^{'}} + 2 [Q_{-}]_{ij} \frac{U _{ij}^{2} + U _{ij}^{'2}}{2 U _{ij}^{'}}

Z (U, U^{'}) = tr (- 2 Q_{+} U^{⊤} - U P_{-} U^{⊤}) + ij \sum \frac{[ U ^{'} ( P _{+} + Λ ) ] _{ij} U _{ij}^{2}}{U _{ij}^{'}} + 2 [Q_{-}]_{ij} \frac{U _{ij}^{2} + U _{ij}^{'2}}{2 U _{ij}^{'}}

U \geq 0 min L_{7} (T; U, S, Λ) = U \geq 0 min tr (- 2 U^{⊤} Q_{+} - U^{⊤} U P_{-} + U^{⊤} U (U^{⊤} Q_{+} + P_{-} - U^{⊤} Q_{-}) + 2 U^{⊤} Q_{-})

U \geq 0 min L_{7} (T; U, S, Λ) = U \geq 0 min tr (- 2 U^{⊤} Q_{+} - U^{⊤} U P_{-} + U^{⊤} U (U^{⊤} Q_{+} + P_{-} - U^{⊤} Q_{-}) + 2 U^{⊤} Q_{-})

Z (U, U^{'}) = tr (- 2 Q_{+} U^{⊤} - U^{⊤} U P_{-}) + ij \sum \frac{[ U ^{'} ( U ^{'⊤} Q _{+} + P _{-} ) ] _{ij} U _{ij}^{2} + [ Q _{-} U ^{'} U ^{'⊤} ] U _{ij}^{'2}}{U _{ij}^{'}}

Z (U, U^{'}) = tr (- 2 Q_{+} U^{⊤} - U^{⊤} U P_{-}) + ij \sum \frac{[ U ^{'} ( U ^{'⊤} Q _{+} + P _{-} ) ] _{ij} U _{ij}^{2} + [ Q _{-} U ^{'} U ^{'⊤} ] U _{ij}^{'2}}{U _{ij}^{'}}

L H S = I, J \sum i, j \sum A_{ij}^{2} u_{i I} u_{j J} - 2 A_{ij} r_{I J} u_{i I} u_{j J} + r_{I J}^{2} u_{i I} u_{j J}

L H S = I, J \sum i, j \sum A_{ij}^{2} u_{i I} u_{j J} - 2 A_{ij} r_{I J} u_{i I} u_{j J} + r_{I J}^{2} u_{i I} u_{j J}

R H S = tr (- 2 A^{⊤} U R U^{⊤} + R^{⊤} R),

R H S = tr (- 2 A^{⊤} U R U^{⊤} + R^{⊤} R),

L (z) = max {s : sz \leq T z} = 1 \leq i \leq n, z_{i} > 0 min \frac{( T z ) _{i}}{z _{i}}

L (z) = max {s : sz \leq T z} = 1 \leq i \leq n, z_{i} > 0 min \frac{( T z ) _{i}}{z _{i}}

\frac{d}{d λ} d e t (λ I - T) = i \sum d e t (λ I - T (i))

\frac{d}{d λ} d e t (λ I - T) = i \sum d e t (λ I - T (i))

∥ T - U S U^{⊤} ∥^{2} = ∥ A - U R U^{⊤} - (A - U R U^{⊤})^{⊤} ∥^{2} = 2∥ A - U R U^{⊤} ∥ + 0

∥ T - U S U^{⊤} ∥^{2} = ∥ A - U R U^{⊤} - (A - U R U^{⊤})^{⊤} ∥^{2} = 2∥ A - U R U^{⊤} ∥ + 0

- 2 I^{'}, J^{'} \sum r_{I^{'} J^{'}} u_{: I^{'}}^{⊤} (I, J \sum \overset{ˉ}{A}_{I J}) u_{: J^{'}} + I^{'}, J^{'} \sum r_{I^{'} J^{'}}^{2}

- 2 I^{'}, J^{'} \sum r_{I^{'} J^{'}} u_{: I^{'}}^{⊤} (I, J \sum \overset{ˉ}{A}_{I J}) u_{: J^{'}} + I^{'}, J^{'} \sum r_{I^{'} J^{'}}^{2}

I^{'}, J^{'} \sum (u_{: I^{'}}^{⊤} (I, J \sum \overset{ˉ}{A}_{I J}) u_{: J^{'}})^{2} \leq I J \sum λ_{I J}^{2}

I^{'}, J^{'} \sum (u_{: I^{'}}^{⊤} (I, J \sum \overset{ˉ}{A}_{I J}) u_{: J^{'}})^{2} \leq I J \sum λ_{I J}^{2}

i, k, l, p \sum A_{ik} S_{k l}^{'} B_{l p} S_{i p}^{'} (a_{i p}^{2} - a_{i p} a_{k l}) = i, k, l, p \sum \frac{1}{2} A_{ik} S_{k l}^{'} B_{l p} S_{i p}^{'} (a_{i p}^{2} + a_{k l}^{2} - 2 a_{i p} a_{k l}) \geq 0

i, k, l, p \sum A_{ik} S_{k l}^{'} B_{l p} S_{i p}^{'} (a_{i p}^{2} - a_{i p} a_{k l}) = i, k, l, p \sum \frac{1}{2} A_{ik} S_{k l}^{'} B_{l p} S_{i p}^{'} (a_{i p}^{2} + a_{k l}^{2} - 2 a_{i p} a_{k l}) \geq 0

i, p \sum \frac{( B S ^{'⊤} ) _{i p} S _{i p}^{2}}{S _{i p}^{'}} - tr (S B S^{⊤}) = i, k, l, p \sum B_{ik} S_{p k}^{'} S_{i p}^{'} (a_{i p}^{2} - a_{i p} a_{k p}) = i, k, l, p \sum B_{ik} S_{p k}^{'} S_{i p}^{'} (a_{i p}^{2} + a_{k p}^{2} - 2 a_{i p} a_{k p}) \geq 0

i, p \sum \frac{( B S ^{'⊤} ) _{i p} S _{i p}^{2}}{S _{i p}^{'}} - tr (S B S^{⊤}) = i, k, l, p \sum B_{ik} S_{p k}^{'} S_{i p}^{'} (a_{i p}^{2} - a_{i p} a_{k p}) = i, k, l, p \sum B_{ik} S_{p k}^{'} S_{i p}^{'} (a_{i p}^{2} + a_{k p}^{2} - 2 a_{i p} a_{k p}) \geq 0

L_{6} (T; U, S, Λ) = tr (- 2 U^{⊤} Q_{+} - U^{⊤} U P_{-} + U^{⊤} U (P_{+} + Λ) + 2 U^{⊤} Q_{-})

L_{6} (T; U, S, Λ) = tr (- 2 U^{⊤} Q_{+} - U^{⊤} U P_{-} + U^{⊤} U (P_{+} + Λ) + 2 U^{⊤} Q_{-})

\frac{\partial Z ( U , U ^{'} )}{U _{ij}} = 2 ([- Q_{+} - U^{'} P_{-}]_{ij} + \frac{[ U ^{'} ( P _{+} + Λ ) + Q _{-} ] _{ij} U _{ij}}{U _{ij}^{'}}) = 0

\frac{\partial Z ( U , U ^{'} )}{U _{ij}} = 2 ([- Q_{+} - U^{'} P_{-}]_{ij} + \frac{[ U ^{'} ( P _{+} + Λ ) + Q _{-} ] _{ij} U _{ij}}{U _{ij}^{'}}) = 0

U_{ij} = U_{ij}^{'} \frac{[ Q _{+} + U ^{'} P _{-} ] _{ij}}{[ U ^{'} [ P _{+} + Λ ] + Q _{-} ] _{ij}} .

U_{ij} = U_{ij}^{'} \frac{[ Q _{+} + U ^{'} P _{-} ] _{ij}}{[ U ^{'} [ P _{+} + Λ ] + Q _{-} ] _{ij}} .

tr (- 2 U^{⊤} Q_{+} - U^{⊤} U P_{-} + U^{⊤} U (U^{⊤} Q_{+} + P_{-} - U^{⊤} Q_{-}) + 2 U^{'⊤} U^{'} U^{⊤} Q_{-})

tr (- 2 U^{⊤} Q_{+} - U^{⊤} U P_{-} + U^{⊤} U (U^{⊤} Q_{+} + P_{-} - U^{⊤} Q_{-}) + 2 U^{'⊤} U^{'} U^{⊤} Q_{-})

tr (U^{⊤} U (U^{⊤} Q_{+} + P_{-} - U^{⊤} Q_{-})) \leq ij \sum \frac{[ U ^{'} ( U ^{'⊤} Q _{+} + P _{-} - U ^{'⊤} Q _{-} )] U _{ij}^{2}}{U _{ij}^{'}}

tr (U^{⊤} U (U^{⊤} Q_{+} + P_{-} - U^{⊤} Q_{-})) \leq ij \sum \frac{[ U ^{'} ( U ^{'⊤} Q _{+} + P _{-} - U ^{'⊤} Q _{-} )] U _{ij}^{2}}{U _{ij}^{'}}

tr (U^{⊤} Q_{-} U^{'⊤} U^{'}) \leq ij \sum \frac{U _{ij}^{2} + U _{ij}^{'} ^{2}}{2 U _{ij}^{'}} (U^{'} U^{'⊤} Q_{-})_{ij}

tr (U^{⊤} Q_{-} U^{'⊤} U^{'}) \leq ij \sum \frac{U _{ij}^{2} + U _{ij}^{'} ^{2}}{2 U _{ij}^{'}} (U^{'} U^{'⊤} Q_{-})_{ij}

2 [- Q_{+} - U P_{-} + U P_{+} + Q_{-} + U Λ]_{ij} U_{ij} = 0.

2 [- Q_{+} - U P_{-} + U P_{+} + Q_{-} + U Λ]_{ij} U_{ij} = 0.

Λ = U^{⊤} Q - P = U^{⊤} Q_{+} + P_{-} - U^{⊤} Q_{-} - P_{+}

Λ = U^{⊤} Q - P = U^{⊤} Q_{+} + P_{-} - U^{⊤} Q_{-} - P_{+}

2 ([- Q_{+} - U^{'} P_{-}]_{ij} + \frac{[ U ^{'} ( P _{-} + U ^{'⊤} Q _{-} ) ] _{ij} U _{ij}}{U _{ij}^{'}}) = 0

2 ([- Q_{+} - U^{'} P_{-}]_{ij} + \frac{[ U ^{'} ( P _{-} + U ^{'⊤} Q _{-} ) ] _{ij} U _{ij}}{U _{ij}^{'}}) = 0

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Network Analysis Techniques · Advanced Graph Neural Networks · Bioinformatics and Genomic Networks

Full text

Direction Matters: On Influence-Preserving

Graph Summarization and Max-cut Principle

for Directed Graphs

Wenkai Xu111Contact at [email protected]

Gatsby Unit of Computational Neuroscience

Gang Niu

RIKEN AIP

Aaop Hyvärinen

INRIA-Saclay

University of Helsinki

Masashi Sugiyama

RIKEN AIP

The University of Tokyo

Abstract

Summarizing large-scaled directed graphs into small-scale representations is a useful but less studied problem setting. Conventional clustering approaches, which based on “Min-Cut”-style criteria, compress both the vertices and edges of the graph into the communities, that lead to a loss of directed edge information. On the other hand, compressing the vertices while preserving the directed edge information provides a way to learn the small-scale representation of a directed graph. The reconstruction error, which measures the edge information preserved by the summarized graph, can be used to learn such representation. Compared to the original graphs, the summarized graphs are easier to analyze and are capable of extracting group-level features which is useful for efficient interventions of population behavior. In this paper, we present a model, based on minimizing reconstruction error with non-negative constraints, which relates to a “Max-Cut” criterion that simultaneously identifies the compressed nodes and the directed compressed relations between these nodes. A multiplicative update algorithm with column-wise normalization is proposed. We further provide theoretical results on the identifiability of the model and on the convergence of the proposed algorithms. Experiments are conducted to demonstrate the accuracy and robustness of the proposed method.

1 Introduction

In directed graphs, it is important to understand the influence between vertices, which is represented by the directed edges. Investigating the influence structure in graphs has become an evolving research field that attracts wide attention from scientific communities including social sciences (Tang et al., 2009; Li et al., 2018a; Mehmood et al., 2013), economics (Spirtes, 2005; Jackson, 2011), ecological sciences (Pavlopoulos et al., 2011; Delmas et al., 2019) and more. In large-scaled densely-connected directed graphs, finding an efficient way to compress vertices and summarizing the directed influence between vertices are not only useful to visualize complicated networks but also crucial to extract group-level features for further analysis such as profiling or intervention.

Conventional graph clustering methods group the densely connected vertices into the same community on undirected graphs (Fortunato, 2010; Schaeffer, 2007; Shi and Malik, 2000). Directed graph clustering is commonly based on symmetrized undirected graphs (Malliaros and Vazirgiannis, 2013). However, the recovered communities do not preserve much of the edge information since the communities themselves are sparsely connected. Hence, effective reconstruction of the original graph from the summarized graph is a meaningful task that enjoys applications in graph compression (Dhabu et al., 2013; Dhulipala et al., 2016), graph sampling (Orbanz, 2017; Leskovec and Faloutsos, 2006) and so on.

For example, in a large-scaled social network, individual level connections are hard to analyze and contain a fair amount of noise. It is complicated to directly extract group-level features and interpret the influence structure of the graphs. In social network analysis, for instance, the Key Opinion Leaders (KOL) (Valente and Pumpuang, 2007; Nisbet and Kotcher, 2009) with common features may also share similar influence structure. Such information is important in terms of understanding the opinion diffusions within the network, as well as implementing interventions for various purposes such as marketing (Chaney, 2001) or pooling (Zhou et al., 2009) (Thomson et al., 1998) . Moreover, extracting these features from the KOL within a group may also enable us to analyze the fairness of a certain process and perform de-bias actions when necessary.

Previous works have considered related problems in undirected graph settings (Shahaf et al., 2013; Navlakha et al., 2008), which aim to define compressed nodes by preserving particular structures. Graph compression literature (Maneth and Peternek, 2015; Fan et al., 2012; Dhulipala et al., 2016) is also related, while the goal is to minimize the storage space, irrespective of preserving feature patterns of the graph. In addition, another line of related work, under the theme of influence maximization (Li et al., 2018b), studies directed influence of a set of vertices to the rest of the network.

In our setting, we would like to extract sets of vertices, each becoming a compressed node, such that the influence between vertices are maximally preserved by the directed summarized graph. Previous works such as flow-based graph summarization (Shi et al., 2016) or graph de-densification (Maccioni and Abadi, 2016) addressed a similar problem based on directed influence. Though these works deal with directed graphs, the directions of summarized nodes are defined from different domains so that the algorithms essentially apply to symmetrized undirected graphs. In this work, we present a novel criterion that is applicable to directed graphs, exploiting the asymmetric information of the directed edges and preserving the influence as much as possible.

In directed networks, the summarization is harder as there are both the edge weights to be summarized as well as the edge directions. To effectively summarize directed graphs, we focus on the reconstruction error from the summarized graph to the original graph. The directed graph summarization is more useful compared to the undirected case. For instance, with well-defined directed causal edges, the summarization can be helpful to approximate causal information between compressed nodes. Conventional clustering or dimensionality reduction methods utilizing the “Max-Flow, Min-Cut”-style criteria, compressed vertices without considering to preserve the edge information. These methods are unable to perform such summarization, illustrated in Figure. 1, since the objective is to minimize the connections between compressed nodes, which results in large reconstruction error thus undesired grouping. Our proposed objective is closely related to but essentially different from such a scheme while we try to maximize the “Cut” to preserve the directed edge information. Various discrete optimization schemes such as Dulmage-Mendelsohn Decomposition (Dulmage and Mendelsohn, 1958) can also find a good summarization in a noiseless case, while they are less accurate and harder to implement when the noise level is high. On the other hand, our proposed model does not only work well in the noiseless case but is also more robust in the presence of noise.

This paper is organized as follows. In Section 2, we introduce notations and the problem setting. In Section 3, we present our learning objective and propose the Structured Non-negative Matrix Factorization (StNMF) algorithm to solve the problem. In Section 4, we provide theoretical results for reconstruction error, identifiability, and convergence of the algorithm. In Section 5, we experimentally demonstrate the usefulness of the proposed method and conclude in Section 6.

2 Preliminaries and Problem Formulation

In this work, we focus on simple directed graphs, which exclude self-loops and multiple edges. In this paper, we use “graph” for referring to a directed graph when there is no ambiguity. A positive value in the adjacency matrix represents an out-edge. We may use a negative value to represent an in-edge. The inhibition type of directed relations, where an out-edge has a negative influence, are out of the scope of this paper. We continue by defining some preliminary concepts.

2.1 Notations and Definitions

Denote a directed graph of the node set $V$ and the directed edge set $E$ by $G=(V,E)$ . Denote a summarized directed graph of the compressed node set $C$ and the directed relation set $R$ . In this work, both $G$ and $H$ are simple. We distinguish terms in both graphs shown in Table.1. A node-compression is a function $\phi_{V}:V\to C$ that assigns a vertex $x_{i}\in V$ to a compressed node $c_{I}\in C$ . In this work, $\phi_{V}$ is surjective.222We do not require all vertices belongs to a compressed node as opposed to the graph partition problem. An edge-compression is a function $\phi_{E}:E\to R$ . We say an edge-compression, $\phi_{E}$ , is induced from a node-compression $\phi_{V}$ if $\phi_{V}(x_{i})=\phi_{V}(x_{i}^{\prime})$ , $\phi_{V}(x_{j})=\phi_{V}(x_{j}^{\prime})$ , implies $\phi_{E}(e_{ij})=\phi_{E}(e_{i^{\prime}j^{\prime}})$ , $\forall i,j,i^{\prime},j^{\prime}$ , i.e., vertices assigned to the same compressed node admit the same compressed relation. Hence, we can write $\phi_{E}(e_{ij})=r_{IJ}$ , $\forall\phi_{V}(x_{i})=c_{I},\phi_{V}(x_{j})=c_{J}$ . In this work, we only consider the edge-compression induced from the node-compression.

A graph summarization, based on the compressions $\phi_{V}$ and $\phi_{E}$ , refers to the map $\Phi$ from the original directed graph $G$ to the summarized graph $H$ , such that $\Phi(G,\phi_{V},\phi_{E})=H$ (Illustrate in Figure.2). A graph summarization of size $k$ , $\Phi_{k}$ is a constrained mapping where $|C|=k\leq|V|$ . In practice, we would like $k\ll|V|$ . When $k=|V|$ , the summarization is trivial as the original graph always gives reconstruction error zero.

Denote $A\in\mathbb{R}^{n\times n}$ as the asymmetric adjacency matrix of a directed graph such that $A_{ij}=1$ if there is a directed edge from $x_{i}$ to $x_{j}$ , and $A_{ij}=0$ otherwise. Denote $T\in\mathbb{R}^{n\times n}$ as the skew-symmetric adjacency matrix of a directed graph where $T_{ij}=1$ and $T_{ji}=-1$ if there is a directed edge from $x_{i}$ to $x_{j}$ ; $T_{ij}=0$ if there is no edge connection between $x_{i}$ and $x_{j}$ . We say a directed graph to be connected if its undirected skeleton is connected.

2.2 Influence Preserving Criteria

Consider the performance measure of our graph summarization problem. The quality of a summarization can be measured by how much the directed edge information can be recovered from the summarized graph, via the reconstruction error:

[TABLE]

where $\ell$ is some non-negative loss measure. We use the term influence to describe the information in directed edges. By choosing different loss $\ell$ , the reconstruction error describes different types of influence-preserving criteria. We say a graph has an exact Influence Preserving Structure (IPS) if the relevant reconstruction error $L_{0}=0$ .

Choosing $\ell_{d}(e_{ij},r_{IJ})=1-\mathbb{1}_{\mathrm{sign}(e_{ij})=\mathrm{sign}(r_{IJ})},\forall i\neq j$ ,333The absence edge does not have the same sign as a directed edge, i.e., $\mathrm{sign}(0)\neq\mathrm{sign}(p),\forall p\neq 0$ . we describe the reconstruction by recovering the directed edge direction. We say a graph summarization has an exact Directions Influence Preserving Structure (D-IPS) if there exists a summarization such that the reconstruction error based on $\ell_{d}$ is [math].

In a weighted graph, we may not only preserve the directional information of edges but also the weight information. Hence, we may choose the square loss between edges and compressed relations: $\ell_{w}(e_{ij},r_{IJ})=(e_{ij}-r_{IJ})^{2},\forall i\neq j$ . We say a graph summarization has an exact Weights Influence Preserving Structure (W-IPS) if there exists a summarization such that the reconstruction error based on $\ell_{w}$ is [math]. D-IPS is a special case for W-IPS. For a uniformly weighted graph, an exact W-IPS is equivalent to an exact D-IPS.

When a graph does not have an exact IPS, which is commonly observed in practice, we would like to simultaneously learn a node-compression, $\phi_{V}$ , and an edge-compression, $\phi_{E}$ , such that the corresponding summarization minimizes the relevant reconstruction error $L_{0}$ .

3 Learning the Influence-Preserving Summarization

In this section, we present the formulation of influence-preserving summarization as a constrained supervised learning objective based on the reconstruction error. Our “labels” can be seen as the compressed relations. We then present the algorithm to solve the constrained optimization problem.

3.1 The Constrained Supervised Learning Objective

We start from defining our learning objective based on the reconstruction loss with $\ell_{w}$ and derive the factorization model as our constrained optimization objective.

The IPS-based Objective

Our objective is to seek a graph summarization of size $k$ , that minimizes the reconstruction error (which corresponds to $\ell_{w}$ for the rest of the paper). Denote node-compression $\phi_{V}$ by assignment matrix $U\in\{0,1\}^{n\times k}$ where the $I^{th}$ column vector $u_{:I}\in\{0,1\}^{n\times 1}$ represents the elements in compressed node $c_{I}$ , i.e., $u_{iI}=\mathbb{1}_{x_{i}\in C_{I}}$ . Denote the edge-compression by relationship matrix $R\in\mathbb{R}^{k\times k}$ where $r_{IJ}$ represents the compressed relation from $c_{I}$ to $c_{J}$ . Since the summarized graph is assumed to be simple, $R$ is an asymmetric adjacency matrix. Given weighted asymmetric adjacency matrix $A$ and a graph summarization represented by $U$ and $R$ , the objective based on loss measure $\ell_{w}(A_{ij},r_{IJ})=(A_{ij}-r_{IJ})^{2}\mathbb{1}_{x_{i}\in C_{I}}\mathbb{1}_{x_{j}\in C_{J}}$ can be written as:

[TABLE]

However, without information on the number of compressed nodes allowed, the objective in Eq. (2) will take $k=|V|$ and the zero reconstruction error can always be achieved. To avoid this, we would like to impose a constraint on the size of the summarized graph to make $k\ll|V|$ . With such a constraint, this objective may still identify a compressed node containing less relevant elements . To address this problem, we propose a normalized version of the objective in Eq. (2):

[TABLE]

which corresponds to a normalized loss measure: $\ell_{w}(A_{ij},r_{IJ})=\frac{(A_{ij}-r_{IJ})^{2}\mathbb{1}_{x_{i}\in C_{I}}\mathbb{1}_{x_{j}\in C_{J}}}{|C_{I}||C_{J}|}$ . We further assume the compressed node does not have overlaps, which corresponds to the orthogonality constraints, i.e., $u_{:I}^{\top}u_{:J}=0,\forall I\neq J$ .

Lemma 1

The objective in Eq. (3) has the factorization form

[TABLE]

The proof proceeds by basic linear algebra, which can be found in Appendix A. Note that $R$ is an asymmetric adjacency matrix representing the compressed relations in the summarized graph. Since the summarized graph is assumed to be simple, $r_{IJ}$ and $r_{JI}$ have at most one non-zero $\forall I\neq J$ and $R_{II}=0,\forall I$ , which imply the constraint $R_{IJ}R_{JI}=0,\forall I,J$ .

Continuous Relaxation

The normalized objective in Eq. (3) is an NP-hard discrete problem, which is similar to the discrete cluster assignment problem. Using continuous relaxation proposed in Shi and Malik (2000) and Meilă and Pentney (2007) is a way to approximately solve such a problem. Here, we propose a continuous relaxation for the factorization model in Eq. (3):

[TABLE]

where $U\in\mathbb{R}^{n\times k}$ and $R\in\mathbb{R}^{k\times k}$ .

Due to the constraint on $R$ , it is not easy to solve such constrained objective as we do not assume structures on summarized graph. This issue can be alleviated by modeling structure via the skew-symmetric adjacency matrix $T$ . The corresponding factorization becomes:

[TABLE]

where $U\in\mathbb{R}^{n\times k}$ and $S\in\mathbb{R}^{k\times k}$ is skew-symmetric. Eq. (6) exploits skew-symmetric structure and is easier to solve. We show in Theorem 2 below that the objectives in Eq. (5) and Eq. (6) admit the same solution, up to permutation in the exact D-IPS case. Despite the fact that the asymmetric matrix $A$ is useful for deriving identifiability result, using the skew-symmetric matrix $T$ is easier to solve and more robust in noisy cases as the model explicitly penalizes the reversely directed noise edges. In the rest of the paper, we will use the skew-symmetric matrix $T$ to represent the graphs.

Positive Values Identifies the Compressed Node

With the factorization model in Eq. (6), we further show in Theorem 1 of Section 4.1, that under the exact D-IPS, the factor $U$ is non-negative and positive entries correctly identify the compressed nodes. Hence, we propose a non-negative constrained factorization model for better identification in the presence of noise:

[TABLE]

3.2 Learning Algorithms

With the non-negative and orthogonal constraints on $U$ , the model in Eq. (7) can be written as a regularized version of the orthogonality constraint non-negative matrix tri-factorization:

[TABLE]

where the regularization parameter $\Lambda$ is a symmetric matrix. It is also related to Semi Non-negative Matrix Factorization since $T$ itself is not a non-negative matrix.

This optimization objective can be solved by gradient methods with projection to the Stiefel manifold, as discussed in Hirayama et al. (2016) and Edelman et al. (1998). However, the projection based algorithm is very sensitive to initialization. Instead, we propose a multiplicative update scheme: $U\leftarrow U\odot\frac{[\nabla_{U}L]_{+}}{[\nabla_{U}L]_{-}}$ modified from Ding et al. (2010); Lee and Seung (2001) and Ding et al. (2006). We use $X_{+}$ and $X_{-}$ to denote the positive and negative parts of matrix $X$ respectively. The modification does not only allow the imposition of the specific skew-symmetric structure of $S$ and orthogonal constraint but also gives more stable results. This leads to our proposed Structured Non-negative Matrix Factorization (StNMF) in Algorithm 1.

The non-negative matrix $U$ can be effectively initialized via non-negative SVD (Boutsidis and Gallopoulos, 2008). $S$ can be initialized by any $k\times k$ skew-symmetric matrix. For instance, when $k=2$ , we set initial $S=\begin{pmatrix}0&1\\ -1&0\end{pmatrix}$ . The algorithm exploits zero locking properties in the multiplicative update scheme so that the desired structure of $S$ is preserved throughout the updates with such initialization. In the fixed regularization scheme, a different choice of $\Lambda$ results in a different local optimal solution and it depends on users’ preference. For instance, if the user would like to have a strictly non-overlapping compressed node set, one may set the magnitude of off-diagonal terms to be large to emphasize orthogonality; if the user is interested in the weight assignment between compressed nodes, the diagonal terms may be set relatively larger to ensure the unit length vector. However, the discussion of such a topic is out of the scope in this paper.

It is important to note that the directed compressed relations can be read off from $S$ , which represents the skew-symmetric (weighted) adjacency matrix of the summarized graph $H=(C,R)$ . Hence, our model is able to simultaneously identify the node-compression and the edge-compression, thus the summarized graph $H$ .

We can also optimize $\Lambda$ , using the Karush-Kuhn-Tacker (KKT) complementary condition (Kuhn and Tucker, 1951) and set: $\Lambda=U^{\top}Q-P=U^{\top}Q_{+}+P_{-}-U^{\top}Q_{-}-P_{+}$ , where $P=S^{\top}U^{\top}US$ and $Q=T^{\top}US$ . The derivation can be found in Section 4.2. In addition, the Algorithm 1, does not guarantee a tightly bounded norm of the column vectors in $U$ . When adaptive regularizer is used, the optimization trajectory is not monotonic non-increasing. Hence we impose a column-wise normalization step in Algorithm 2 to alleviate this problem.

4 Theoretical Analysis

In this section, we present theoretical results for identification of non-negative models and analysis of Structured Non-negative Matrix Factorization (StNMF).

4.1 Identifiability Analysis

Theorem 1

(exact D-IPS Identification) Let $A$ be an asymmetric adjacency matrix of a directed graph with the exact D-IPS with $k$ compressed nodes. Assume that each submatrix between compressed nodes has distinct leading singular values with geometric multiplicity one. The optimization problem in Eq. (5) has a unique solution $U\in\mathbb{R}^{n\times k}$ such that $U\geq 0$ and the positive part of each column vectors in $U$ identifies compressed nodes.

The proof technique extends on the $k=2$ case in Theorem 5, which applies Perron-Frobenius Theorem on rearranged block matrix. Details can be found in Appendix A. For graphs with more than one connected components to be determined, the most strongly connected component will be identified first and the consecutive components can be identified via deflation methods discussed in Hyvärinen et al. (2016); Hirayama et al. (2016), which is out of the scope of this paper.

Lemma 2

If a directed graph, with asymmetric adjacency matrix $A$ , has the exact IPS, $A$ can be divided into block submatrix according to compressed nodes, such that: 1) If a block $\tilde{A}_{IJ}$ is non-zero, its block-wise transpose $\tilde{A}_{JI}$ zero matrix; 2) The diagonal blocks are zero-matrices.

Proof 1

By definition of the exact D-IPS, the direction of edges between compressed nodes are the same and there are no links within the compressed nodes. The result follows since summarized graph is simple.

Theorem 2

(equivalence decomposition of $A$ and $T$ ) Let $A$ and $T$ be the asymmetric and skew-symmetric adjacency matrix of a directed graph with the exact D-IPS, respectively. The optimal solution $U$ for $L_{3}$ and $L_{4}$ are the same up to permutation.

Detailed proofs can be found in Appendix A.

Corollary 1

Let $T$ be the skew-symmetric adjacency matrix of a directed graph with the exact D-IPS. Assume that each submatrix between compressed nodes has geometric multiplicity two. The optimization problem with loss $L_{4}$ has a unique solution such that $U\geq 0$ and the positive part of each column vectors in $U$ identifies communities.

The corollary follows from combining Theorem 1 and Theorem 2.

4.2 Convergence Analysis

In this section, we analyze the convergence of fixed regularization scheme in Algorithm 1 and the adaptive regularization scheme with column-wise normalization in Algorithm 2. Optimizing the objective in Eq. (8), for fixed $\Lambda$ can be written as

[TABLE]

Lemma 3

Let $Q=T^{\top}US$ and $P=S^{\top}U^{\top}US$ .

[TABLE]

is an auxiliary function of Eq. (9).

The proof is based on pairing symmetric terms and details can be found in Appendix A.

Lemma 4

Choosing $\Lambda=U^{\top}Q_{+}+P_{-}-U^{\top}Q_{-}-P_{+}$ , the objective in Eq. (9) becomes:

[TABLE]

then

[TABLE]

is an auxiliary function of Eq. (10).

Theorem 3

The update rule described in Algorithm 1 is non-increasing and converges to the stationary point of objective in Eq. (8) .

Proof of Theorem 3 can be found in Appendix A. The proof technique is based on some carefully chosen symmetric matrices, which is applicable for fixed $\Lambda$ as it is a symmetric matrix. With the adaptively chosen $\Lambda$ in each step, as in Eq. (13), the column of $U$ does not have a fixed norm and $P_{+}+\Lambda$ is no longer symmetric, making the proof technique for Theorem 3 not applicable. However, with the proposed column-wise normalization scheme, the Algorithm is shown to be monotonic non-increasing and convergent to the stationary point.

Theorem 4

The update rule described in Algorithm 2 is non-increasing and convergent.

The proof is by constructing a symmetric matrix based on the unit-normed column vectors and applying the proof techniques in Theorem 3. Details can be found in Appendix A.

5 Experiments and Results

In this section, we apply the graph summarization model on synthetically generated directed graphs and compare with summarization methods on the undirected cases as well as conventional clustering such as spectral methods or normalized cut methods. In the synthetic examples, we simulate graphs of different sizes at different noise levels with known compressed node. The background noise is the ratio: $\gamma_{b}=\frac{\sum_{i,j\notin\mathrm{D-IPS}}|e_{i^{\prime}j^{\prime}}|}{\sum_{i,j\in\mathrm{D-IPS}}|e_{ij}|}$ where the direction noise is the ratio: $\gamma_{d}=\frac{\sum_{i,j\in\mathrm{D-IPS},e_{ij}<0}e_{ij}}{\sum_{i,j\in\mathrm{D-IPS},e_{ij}\geq 0}e_{ij}}<0.5$ . We compare the following algorithms: Fix-StNMF is the fixed regularization scheme in Algorithm 1 and $\Lambda$ is chosen as a scalar time all-one matrix; Adaptive-StNMF is the adaptive scheme described in Algorithm 2; Undirected is the graph summarization scheme using the undirected skeleton, similar to Hirayama et al. (2016); WNCut is the weighted normalized cut scheme (Meilă and Pentney, 2007) for directed graph; Spectral is the clustering method using normalized Laplacian (Shi and Malik, 2000). From the result, we see that at low noise levels, the Adaptive-StNMF correctly finds the compressed node assignment, as the theory shows. When the noise level is higher, it still performs best among the competitors. The low accuracies for “clustering methods” are expected as they do not maximize the desired objectives. Moreover, we see that the Fixed-StNMF is worse than the Adaptive version as we deliberately chose $\Lambda$ to be an all-one matrix, where the algorithm does not necessarily converge to the most useful local optimal, which shows that the learning accuracy is also sensitive to the choice of regularization parameters.

6 Conclusion and Future Work

We propose a new problem setting to summarize directed graphs. Our key contribution is to define a novel learning criterion that preserves the directed edge information from the original graph. Our criterion is related to the reconstruction error from the summarized graph to the original graph. We proposed a non-negative algorithm to learn such graph summarization. We provide theoretical analysis on identifiability and convergence and experimental demonstration to show the usefulness of our method.

Appendix A Additional Theorems and Proofs

Proof of Lemma 1

Proof 2

Normalized by the size of compressed node, each assignment vector has unit length. Expanding each term in Eq. (2), we have

[TABLE]

Summing over the index $I,J$ , the first term is $\sum_{i,j}A_{ij}=\mathrm{tr}(A^{\top}A)$ , which is a constant independent of $R$ and $U$ . Summing over the index $I,J$ , the second term is $\sum_{i,j}A_{ij}(URU^{\top})_{ij}=\mathrm{tr}(A^{\top}URU^{\top})$ . For the third term, since $\sum_{i,j}u_{iI}u_{jJ}=1$ in the normalized setting, summing over index $i,j$ , we have $\sum_{I,J}r_{IJ}^{2}=\mathrm{tr}(R^{\top}R)$ . Writing

[TABLE]

the result follows. As the summarized graph is simple, the constraint on $R$ in factorization model is imposed.

Proposition 1

(Perron-Frobenius) Suppose $M\in\mathbb{R}^{n\times n}$ is a non-negative square matrix that is irreducible, then:

$M$ * has a positive real eigenvalue $\lambda_{\max}$ , such that all other eigenvalues of $M$ satisfy, $|\lambda|\leq\lambda_{\max}$ (if $M$ is primitive, $|\lambda|<\lambda_{\max}$ )* 2. 2.

$\lambda_{max}$ * has algebraic and geometric multiplicity $1$ and has positive eigenvector $x>0$ (called Perron vector)* 3. 3.

any non-negative eigenvector is a multiple of $x$

Proof 3

$M$ * is irreducible non-negative square matrix, then $\exists k\in\mathbb{N}^{+}$ such that $P=(I+M)^{k}>0$ . $(I+M)^{k}=I+M+\frac{1}{2!}M^{2}+...\frac{1}{k!}M^{k}$ . By irreducibility and non-negativity, for large enough $k$ , the expansion fills in all $n^{2}$ terms with positive numbers. Hence $P$ is primitive. We also have $TP=PT$ .*

Let $Q$ be the positive orthant and $C$ be the intersection of the surface of the unit sphere and positive orthant. $\forall z\in Q$ , define a function:

[TABLE]

For $\forall r>0$ , we have $L(rz)=L(z)$ by definition, so $L(z)$ depends only on the ray along $z$ .

We write $\leq$ sign between vectors, $v\leq w$ to imply $v_{i}\leq w_{i},\forall i$ . Similar definition applies for $<$ . For $v\leq w$ and $v\neq w$ , we have $Pv<Pw$ , since $P(w-v)\geq 0$ and $P(w-v)\neq 0$ .

If for scalar $s$ , $sz\leq Tz$ , then $Psz\leq PTz=TPz$ , which implies $s(Pz)\leq T(Pz)$ . Thus, $L(Pz)\geq L(z)$ .

If $L(z)z\neq Tz$ , then $L(z)Pz<TPz$ . This implies $L(z)<L(Pz)$ , unless $z$ is an eigenvector ( $Tz=L(z)z$ ) Hence, positive $z$ is eigenvector when $L(z)$ is maximised.

Consider the image of $C$ under $P$ . It is compact as it is the image of a compact set under a continuous map. All of the elements of $P(C)$ have all their components strictly positive, as $P>0$ . Hence the $L$ is continuous on $P(C)$ . Thus $L$ achieves a maximum value on $P(C)$ . Since $L(z)\leq L(Pz)$ , this is, in fact, the maximum value of $L$ on all of $Q$ , which implies the existence of maximum eigenvalue. Since $L(Pz)>L(z)$ unless $z$ is an eigenvector of $T$ , $L_{max}$ is achieved at an eigenvector, call it $x$ of T and $x>0$ with $L_{max}$ as the eigenvalue. Since $Tx>0$ and $Tx=L_{max}x$ we have $L_{max}>0$ .

Let $y$ be any other eigenvectors of $T$ with eigenvalue $\lambda$ , we have $\lambda y_{i}=\sum_{j}T_{ij}y_{j}$ . As $T\geq 0$ , we have $|\lambda y_{i}|=\sum_{j}T_{ij}|y_{j}|$ , thus we write $|\lambda||y|\leq T|y|$ . Consider $|\lambda|\leq L(|y|)\leq L_{max}$ by definition of $L$ , writing $\lambda_{max}=L_{max},$ we show that $|\lambda|\leq\lambda_{max}$ . Note that if $\lambda_{max}=0$ , $T$ is nil-potent, contradicting to irreducible. Thus we have $\lambda_{max}>0$ .

Consider the rate of change in characteristic polynomial of matrix $T$ :

[TABLE]

where $T(i)$ is matrix $T$ deleting $i^{th}$ row and column. Each of the matrices $\lambda_{max}I-T(i)$ has strictly positive determinant, which shows that the derivative of the characteristic polynomial of $T$ is not zero at $\lambda_{max}$ , and therefore the algebraic multiplicity and hence the geometric multiplicity of $\lambda_{max}$ is one.

If there exists any other nontrivial non-negative eigenvector $y\geq 0$ , such that $y$ is not a multiple of $x$ , since $\lambda_{max}$ has geometric multiplicity $1$ , $y^{\top}x=0$ . However, $x>0$ and $y^{\top}x=0$ implies $y=0$ , a contradiction.

Proof of Theorem 2

Proof 4

By Lemma 2, we write $A$ in the block form where the blocks are grouped by compressed node assignment. Hence, use the fact that $\mathrm{tr}(AA)=0=\mathrm{tr}(RR)$ from simple graph and $U^{\top}AU$ has the same zero/non-zero positions as $R$ for the exact D-IPS, we have $\mathrm{tr}((A-URU^{\top})(A-URU^{\top}))=\mathrm{tr}(AA-2U^{\top}AUR+RR)=0$ and

[TABLE]

Hence, both objectives are solving the same problem.

Theorem 5

(Bipartite Identification) Let $A$ be an asymmetric adjacency matrix of the exact D-IPS of two compressed nodes. SVD of $A$ has a unique leading left and right singular vector $v,w\geq 0$ and the positive part of $v,w$ identifies two compressed nodes.

Proof 5

For the exact D-IPS with two compressed nodes, we can always rearrange the vertices such that $A=\begin{pmatrix}0&0\\ \tilde{A}&0\end{pmatrix}$ . $\tilde{A}^{\top}\tilde{A}$ and $\tilde{A}\tilde{A}^{\top}$ represents the ”in-out” and ”out-in” two step transition. As the two compressed nodes are connected, the any vertex from the two step transition can reach any other vertex in the same compressed node. Hence, $\tilde{A}^{\top}\tilde{A}$ and $\tilde{A}\tilde{A}^{\top}$ are both primitive. Using Perron-Frobenius Theorem, we have a unique real positive leading eigenvector. Padded with [math]s, the leading eigenvectors of $\tilde{A}^{\top}\tilde{A}$ and $\tilde{A}\tilde{A}^{\top}$ are unique and non-negative where the non-zero terms corresponds to the compressed node assignment.

Proof Theorem 1

Proof 6

The proof of Theorem 1 is based on Proposition 1, and Lemma 2. Re-arrange the indices according to compressed node and denote the block submatrix between $C_{I}$ and $C_{J}$ as $\tilde{A}_{IJ}\in\mathbb{R}^{|C_{I}|\times|C_{J}|}$ . Write $\bar{A}_{IJ}\in\mathbb{R}^{n\times n}$ as the zero-padded matrix of $\tilde{A}_{IJ}$ . The zero-padded vector for compressed node $C_{I}$ , denoted by $u^{I}$ is the vector with non-zero $i^{th}$ entries for $x_{i}\in C_{I}$ and zeros otherwise. Write each column of $U$ , $u_{:I^{\prime}}$ as a linear combination of zero-padded vector: $u_{:I^{\prime}}=\sum_{I}\eta_{II^{\prime}}u^{I}_{:I^{\prime}}$ , where $\sum_{I}\eta_{II^{\prime}}^{2}=1$ . We write the non-zero part of $u^{I}\in\mathbb{R}^{n}$ as $u^{I}_{I}\in\mathbb{R}^{|C_{I}|}$ , which is a unit vector. The optimization objective in Eq. (5) can be written as:

[TABLE]

Differentiate w.r.t. $r_{I^{\prime}J^{\prime}}$ to find the optimized $r_{I^{\prime}J^{\prime}}=u_{:I^{\prime}}^{\top}(\sum_{I,J}\bar{A}_{IJ})u_{:J^{\prime}}$ , then the optimization objective becomes: $\max_{u}\sum_{I^{\prime},J^{\prime}}(u_{:I^{\prime}}^{\top}(\sum_{I,J}\bar{A}_{IJ})u_{:J^{\prime}})^{2}$ which can be simplified as $\sum_{I^{\prime},J^{\prime}}(\sum_{I,J}\eta_{II^{\prime}}\eta_{JJ^{\prime}}w^{IJ}_{I^{\prime}J^{\prime}})^{2}$ where $w^{IJ}_{I^{\prime}J^{\prime}}={u^{I}_{II^{\prime}}}^{\top}\tilde{A}_{IJ}{u^{J}_{JJ^{\prime}}}$ . Since $u^{I}_{II^{\prime}},u^{J}_{JJ^{\prime}}$ are unit vectors, $max_{I^{\prime}J^{\prime}}w^{IJ}_{I^{\prime}J^{\prime}}\leq\lambda_{IJ}$ where $\lambda_{IJ}$ is the leading singular value of $\tilde{A}_{IJ}$ . Due to the unit norm constraint, we have the objective

[TABLE]

where the equality holds when $u^{I}_{II}$ , $u^{J}_{JJ}$ are the left and right singular vectors of $\tilde{A}_{IJ}$ and $\eta_{II^{\prime}}=\mathbb{1}_{I=I^{\prime}}$ . By Theorem 5, we know that $\tilde{A}_{IJ}$ are primitive for all $I,J\in[k]$ . Applying Perron-Frobenius in Theorem 1, $u^{I}_{II}>0$ and $u_{:I}\geq 0$ where the positive part identifies some compressed node $C_{I}$ . As the compressed node blocks does not need to have an order, the solution is unique only up to permutation of blocks.

Proposition 2

(Proposition 6 in Ding et al. [2006]) For any symmetric matrices A $\in\mathbb{R}_{\geq 0}^{n\times n},B\in\mathbb{R}_{\geq 0}^{k\times k},S,S^{\prime}\in\mathbb{R}_{\geq 0}^{n\times k}$ , the following inequality holds: $\sum_{i,p}\frac{(AS^{\prime}B)_{ip}S_{ip}^{2}}{S^{\prime}_{ip}}\geq\mathrm{tr}(S^{\top}ASB)$

Proof 7

Write $S_{ip}=S^{\prime}_{ip}a_{ip}$ . Then $\sum_{i,p}\frac{(AS^{\prime}B)_{ip}S_{ip}^{2}}{S^{\prime}_{ip}}-\mathrm{tr}(S^{\top}ASB)=$

[TABLE]

as $A$ and $B$ are symmetric and non-negative.

Proposition 3

For any matrices $B\in\mathbb{R}_{\geq 0}^{k\times k},S,S^{\prime}\in\mathbb{R}_{\geq 0}^{n\times k}$ , and B is symmetric, the following inequality holds $\sum_{i,p}\frac{(BS^{\prime\top})_{ip}S_{ip}^{2}}{S^{\prime}_{ip}}\geq\mathrm{tr}(SBS^{\top})$

Proof 8

Similar to the proof above, we write Write $S_{ip}=S^{\prime}_{ip}a_{ip}$ . Then

[TABLE]

as $B$ is symmetric and non-negative.

Proof of Lemma 3

Proof 9

Write $Q=T^{\top}US$ and $P=S^{\top}U^{\top}US$ . Since both $Q$ and $P$ are not non-negatie matrices in general, the optimization objective $L_{6}$ in Eq. (9) can be written as:

[TABLE]

for $U\geq 0$ . From Proposition 2, we have $\mathrm{tr}(U(P_{+}+\Lambda)U^{\top})\leq\sum_{ij}\frac{[U^{\prime}(P_{+}+\Lambda)]_{ij}U_{ij}^{2}}{U_{ij}^{\prime}}$ since $P_{+}$ and $\Lambda$ are both symmetric matrices. Using $a\leq\frac{a^{2}+b^{2}}{2b}$ , we have $\mathrm{tr}(Q_{-}U^{\top})\leq\sum_{ij}[Q_{-}]_{ij}\frac{U_{ij}^{2}+U_{ij}^{\prime 2}}{2U_{ij}^{\prime}}$ . $Z(U,U^{\prime})$ reaches lower bound $L_{3}$ when $U=U^{\prime}$ . Hence, $Z(U,U^{\prime})$ is an auxiliary function.

Proof of Theorem 3

Proof 10

Using the auxiliary function in Lemma 3, we take the derivative of $Z(U,U^{\prime})$ w.r.t. $U_{ij}$ :

[TABLE]

Solving the stationary point, we have the update rule for $U$ as stated in Algorithm 1:

[TABLE]

Since the update of S is independent of $\Lambda$ , the update can be readily adapted from (Theorem 8 Ding et al. [2006]). As the objective is bounded below and the iterative procedure is monotonic non-increasing, the algorithm finds the local minimum of the objective function.

Lemma 5

Let $U\in\mathbb{R}^{n\times k}$ be orthogonal matrix such that $U^{\top}U=I_{k}$ and $U^{\prime}\in\mathbb{R}^{n\times k}$ be a matrix of unit column vectors. Let $G\in\mathbb{R}^{k}$ be a non-negative matrix. Then $\mathrm{tr}(U^{\top}UG)\leq\mathrm{tr}(U^{\prime\top}U^{\prime}G)$

Proof 11

Write $U^{\prime\top}U^{\prime}=I_{k}+E$ for some non-negative matrix $E$ . Since $E$ and $G$ are non-negative, then $\mathrm{tr}(U^{\prime\top}U^{\prime}G)=\mathrm{tr}(I_{k}G+EG)\geq\mathrm{tr}(I_{k}G)$

Lemma 6

$U^{\top}Q=U^{\top}T^{\top}US$ * is symmetric under the update rule of Algorithm 2.*

Proof 12

Under the update rule in Algorithm 2, as $U$ is column-wise normalized, $U^{\top}T^{\top}U=S^{\top}$ . Hence, $U^{\top}Q=S^{\top}S$ is symmetric.

It is worth note that, the original scheme proposed in Ding et al. [2006], without normalization does not have such property. Assume the norm for each row of $U$ is $D$ , where normalized $\tilde{U}D=U$ . Then the update $\tilde{S}=\tilde{U}^{\top}T\tilde{U}$ , where $S=U^{\top}TU=D\tilde{S}D$ . Hence, $S^{\top}S=D\tilde{S}^{\top}D\tilde{S}$ is not necessarily symmetric, which violate the auxillary function formulation.

Proof of Lemma 4

Proof 13

The proof is using Lemma 5. Due to normalization step, the factor $U$ have unit norm column vectors. Hence, $\mathrm{tr}(U^{\prime\top}U^{\prime}U^{\top}Q_{-})\geq\mathrm{tr}(U^{\top}Q_{-})$ and

[TABLE]

is an upper bound for Eq. (10), where equality hold when $U$ is orthogonal matrix. As $U^{\top}Q$ is symmetric by Lemma 6, we can apply Proposition 2 and have

[TABLE]

. We also have

[TABLE]

. Combining both term, the result follows.

We assume $\Lambda+P_{+}\geq 0$ . The KKT condition on the orthogonal constraint case can be applied to choose the optimum regularization term $\Lambda$ . The KKT condition reads:

[TABLE]

For diagonal terms, we sum over $j$ in Eq. (12) to have $[-U^{\top}Q_{+}-U^{\top}UP_{-}+U^{\top}UP_{+}+U^{\top}Q_{-}+U^{\top}U\Lambda]_{ii}=0$ , which implies $\Lambda_{kk}=[U^{\top}Q_{+}+P_{-}-P_{+}-U^{\top}Q_{-}]_{kk}$ . For off diagonal terms $j\neq p$ , $\sum_{k}[\Lambda+P]_{ik}U_{jk}=Q_{ij}$ , multiply $U_{ip}$ and sum over $p$ on both sides, we get $\sum_{k}[\Lambda+P]_{pk}=[\Lambda+P]_{jp}=[U^{\top}Q]_{jp}$ . Hence we have:

[TABLE]

with $\Lambda+P_{+}\geq 0$ .

Proof of Theorem 4

Proof 14

Applying KKT condition and choosing adaptive $\Lambda=U^{\top}Q-P$ , the objective has the form in Eq. (10), which is bounded by Eq. (11) in Lemma 4. Differentiate Eq. (11) w.r.t. $U_{ij}$ :

[TABLE]

Solving the stationary point, we have the update rule for $U$ as stated in Algorithm 1:

[TABLE]

Since the $U$ factor here does not have unit norm for each column, we explicitly normalized $U$ and update $S=U^{\top}TU$ after normalization. With the normalization step, the optimization scheme in Algorithm 2 is non-increasing even for the adaptive regularization scheme. Since the objective is bounded below, it converges to the stationary point.

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Boutsidis and Gallopoulos [2008] Christos Boutsidis and Efstratios Gallopoulos. Svd based initialization: A head start for nonnegative matrix factorization. Pattern Recognition , 41(4):1350–1362, 2008.
2Chaney [2001] Isabella M Chaney. Opinion leaders as a segment for marketing communications. Marketing Intelligence & Planning , 19(5):302–308, 2001.
3Delmas et al. [2019] Eva Delmas, Mathilde Besson, Marie-Hélène Brice, Laura A Burkle, Giulio V Dalla Riva, Marie-Josée Fortin, Dominique Gravel, Paulo R Guimarães Jr, David H Hembry, Erica A Newman, et al. Analysing ecological networks of species interactions. Biological Reviews , 94(1):16–36, 2019.
4Dhabu et al. [2013] Meera Dhabu, P Deshpande, and Siyaram Vishwakarma. Partition based graph compression. Editorial Preface , 4(9), 2013.
5Dhulipala et al. [2016] Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. Compressing graphs and indexes with recursive graph bisection. ar Xiv preprint ar Xiv:1602.08820 , 2016.
6Ding et al. [2006] Chris Ding, Tao Li, Wei Peng, and Haesun Park. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , pages 126–135. ACM, 2006.
7Ding et al. [2010] Chris HQ Ding, Tao Li, and Michael I Jordan. Convex and semi-nonnegative matrix factorizations. IEEE transactions on pattern analysis and machine intelligence , 32(1):45–55, 2010.
8Dulmage and Mendelsohn [1958] Andrew L Dulmage and Nathan S Mendelsohn. Coverings of bipartite graphs. Canadian Journal of Mathematics , 10:517–534, 1958.