Graph Sampling for Matrix Completion Using Recurrent Gershgorin Disc   Shift

Fen Wang; Yongchao Wang; Gene Cheung; Cheng Yang

arXiv:1906.01087·eess.SP·June 24, 2020·IEEE Trans. Signal Process.

Graph Sampling for Matrix Completion Using Recurrent Gershgorin Disc Shift

Fen Wang, Yongchao Wang, Gene Cheung, Cheng Yang

PDF

TL;DR

This paper introduces a novel graph signal processing-based sampling strategy for matrix completion, optimizing sample selection to improve stability and accuracy while maintaining computational efficiency.

Contribution

It proposes a greedy sampling method leveraging Gershgorin circle theorem and eigenvector analysis to select informative samples for matrix completion.

Findings

01

Outperforms existing sampling schemes in reducing completion error.

02

Efficiently scales to large matrices through block-diagonal decomposition.

03

Enhances stability of the linear system via eigenvalue maximization.

Abstract

Matrix completion algorithms fill missing entries in a large matrix given a subset of observed samples. However, how to best pre-select informative matrix entries given a sampling budget is largely unaddressed. In this paper, we propose a fast sample selection strategy for matrix completion from a graph signal processing perspective. Specifically, we first regularize the matrix reconstruction objective using a dual graph signal smoothness prior, resulting in a system of linear equations for solution. We then select appropriate samples to maximize the smallest eigenvalue $λ_{m i n}$ of the coefficient matrix, thus maximizing the stability of the linear system. To efficiently solve this combinatorial problem, we derive a greedy sampling strategy, leveraging on Gershgorin circle theorem, that iteratively selects one sample (equivalent to shifting one Gershgorin disc) at a time…

Tables4

Table 1. TABLE I: Profiles of experimented datasets.

dataset	Exp.	users	items	features	entries	density	entry levels
Synthetic Netflix [3]	Fig. 3	200	100	-	20,000	100%	1,2,…,5
ML100K [52]	Fig. 3	100	200	-	12,566	62.83%	1,2,…,5
ML100K [52]	Tab. II/III	943	1682	✓	100,000	6.3%	1,2,…,5
ML10M [52]	Fig. 3	100	200	-	18,119	90.6%	0.5,1,…,5
ML10M [52]	Tab. IV	1000	500	-	345,904	69.18%	0.5,1,…,5
Douban [26]	Tab. IV	3000	3000	-	136,891	1.52%	1,2,…,5
Flixster [26]	Tab. IV	3000	3000	-	26,173	0.29%	0.5,1,…,5
YahooMusic [26]	Tab. IV	3000	3000	-	5,335	0.06%	1,2,…,100
Book-Crossing [53]	Tab. IV	1000	1000	-	3,166	0.32%	1,2,…,10
Jester [54]	Tab. IV	1000	100	-	73,320	73.32%	(0,1)
ML1M [52]	Tab. IV	6040	3706	✓	1,000,209	4.47%	1,2,…,5
FilmTrust [55]	Tab. IV	1000	1000	-	31,880	3.19%	0.5,1,…,4

Table 2. TABLE II: RMSE for ML100K using random / IGCS sampling combined with different MC methods. Graph-based MC strategies are marked with ✓.

MC methods	$𝒢$ ?	G1	G2
IMC [58]	-	1.590 - 1.507	1.590 - 1.600
SVT [57]	-	1.021 - 1.031	1.021 - 0.983
GRALS [25]	✓	0.947 - 0.931	0.945 - 0.893
GMC [3]	✓	1.036 - 1.037	1.118 - 1.054
GC-MC [26]	✓	0.898 - 0.891	0.899 - 0.858
NMC [27]	-	0.892 - 0.887	0.892 - 0.861

Table 3. TABLE III: RMSE and sampling time for IGCS with different ζ 𝜁 \zeta ’s on ML100K , along with LSS as comparison.

	MC	LSS	eigs	$ζ = 1$	$ζ = 3$	$ζ = 5$	$ζ = 7$
G1	GRALS	0.962	0.931	0.927	0.935	0.934	0.931
	GC-MC	0.910	0.896	0.889	0.895	0.897	0.891
	NMC	0.891	0.907	0.880	0.888	0.889	0.886
	Time ( $10^{3}$ s)	-	1.975	1.104	0.503	0.375	0.320
G2	GRALS	0.958	0.889	0.871	0.870	0.882	0.882
	GC-MC	0.909	0.860	0.839	0.840	0.847	0.851
	NMC	0.907	0.858	0.840	0.845	0.843	0.852
	Time ( $10^{3}$ s)	-	1.278	1.216	0.573	0.441	0.388

Table 4. TABLE IV: RMSE of the proposed IGCS on different datasets with different ζ 𝜁 \zeta ’s, along with random sampling and LSS for comparison. The MC method is GRALS.

dataset	random	LSS	$ζ = 1$	$ζ = 3$	$ζ = 5$	$ζ = 7$
Flixster	1.029	1.207	0.932	1.057	1.046	1.045
Douban	0.744	0.750	0.715	0.720	0.736	0.730
YahooMusic	96.987	125.0	59.172	44.546	52.391	47.082
ML1M	0.905	0.930	0.829	0.833	0.835	0.838
Book-Crossing	3.987	5.095	3.578	3.704	3.804	4.185
ML10M	0.706	0.777	0.655	0.656	0.656	0.656
Jester	0.214	0.217	0.160	0.162	0.162	0.165
FilmTrust	0.820	0.941	0.668	0.735	0.711	0.742

Equations102

\begin{split}{\mathbf{A}_{\Omega}(i,j)}=\left\{\begin{array}[]{ll}1,&\mbox{if}\;{(i,j)\in{\Omega}};\\ 0,&\mbox{otherwise}.\end{array}\right.\end{split}

\begin{split}{\mathbf{A}_{\Omega}(i,j)}=\left\{\begin{array}[]{ll}1,&\mbox{if}\;{(i,j)\in{\Omega}};\\ 0,&\mbox{otherwise}.\end{array}\right.\end{split}

X min rank (X)

X min rank (X)

s.t. ∥ A_{Ω} \circ X - Y ∥_{F} < σ

x_{j}^{⊤} L_{r} x_{j} = (k, l) \in E_{r} \sum W_{r} (k, l) (x_{j} (k) - x_{j} (l))^{2} .

x_{j}^{⊤} L_{r} x_{j} = (k, l) \in E_{r} \sum W_{r} (k, l) (x_{j} (k) - x_{j} (l))^{2} .

X min f (X) = \frac{1}{2} ∥ A_{Ω} \circ (X - Y) ∥_{F}^{2}

X min f (X) = \frac{1}{2} ∥ A_{Ω} \circ (X - Y) ∥_{F}^{2}

+ \frac{α}{2} Tr (X^{⊤} L_{r} X) + \frac{β}{2} Tr (X L_{c} X^{⊤}),

(\tilde{A}_{Ω} + α I_{n} \otimes L_{r} + β L_{c} \otimes I_{m}) vec (X^{*}) = vec (Y)

(\tilde{A}_{Ω} + α I_{n} \otimes L_{r} + β L_{c} \otimes I_{m}) vec (X^{*}) = vec (Y)

Ω max g (Ω) = λ_{min} (\tilde{A}_{Ω} + α I_{n} \otimes L_{r} + β L_{c} \otimes I_{m})

Ω max g (Ω) = λ_{min} (\tilde{A}_{Ω} + α I_{n} \otimes L_{r} + β L_{c} \otimes I_{m})

∥ vec (X^{*}) - vec (X) ∥_{2} \leq \frac{ρ}{λ _{m i n} ( Q )} + ∥ vec (N) ∥_{2}

∥ vec (X^{*}) - vec (X) ∥_{2} \leq \frac{ρ}{λ _{m i n} ( Q )} + ∥ vec (N) ∥_{2}

vec (X^{*}) = Q^{- 1} vec (Y) = Q^{- 1} \tilde{A}_{Ω} [vec (X + N)]

vec (X^{*}) = Q^{- 1} vec (Y) = Q^{- 1} \tilde{A}_{Ω} [vec (X + N)]

= Q^{- 1} (Q - α I_{n} \otimes L_{r} - β L_{c} \otimes I_{m}) [vec (X + N)]

= vec (X) + vec (N) - Q^{- 1} L [vec (X + N)],

∥ vec (X^{*}) - vec (X) ∥_{2}

∥ vec (X^{*}) - vec (X) ∥_{2}

= ∥ vec (N) - Q^{- 1} L vec (X + N) ∥_{2}

\leq ∥ Q^{- 1} L vec (X + N) ∥_{2} + ∥ vec (N) ∥_{2}

\leq ∥ Q^{- 1} ∥_{2} ∥ L vec (X + N) ∥_{2} + ∥ vec (N) ∥_{2}

= ρ ∥ Q^{- 1} ∥_{2} + ∥ vec (N) ∥_{2}

∥ Q^{- 1} ∥_{2} = λ_{max} (Q^{- 1}) = \frac{1}{λ _{min} ( Q )} .

∥ Q^{- 1} ∥_{2} = λ_{max} (Q^{- 1}) = \frac{1}{λ _{min} ( Q )} .

\begin{split}\tilde{{\mathbf{A}}}_{\Omega}(l,l)=\left\{\begin{array}[]{ll}1,&\mbox{if}\;l\in{\mathcal{S}};\\ 0,&\mbox{otherwise}.\end{array}\right.\end{split}

\begin{split}\tilde{{\mathbf{A}}}_{\Omega}(l,l)=\left\{\begin{array}[]{ll}1,&\mbox{if}\;l\in{\mathcal{S}};\\ 0,&\mbox{otherwise}.\end{array}\right.\end{split}

Q = L + \tilde{A}_{Ω} = L + t = 1 \sum K e_{k_{t}} e_{k_{t}}^{⊤},

Q = L + \tilde{A}_{Ω} = L + t = 1 \sum K e_{k_{t}} e_{k_{t}}^{⊤},

k_{t}^{*} = k_{t} \in S_{t - 1}^{c} argmax λ_{m i n} (L_{t - 1} + e_{k_{t}} e_{k_{t}}^{⊤}),

k_{t}^{*} = k_{t} \in S_{t - 1}^{c} argmax λ_{m i n} (L_{t - 1} + e_{k_{t}} e_{k_{t}}^{⊤}),

\exists i ∣ a_{ii} - R_{i} \leq λ \leq a_{ii} + R_{i} .

\exists i ∣ a_{ii} - R_{i} \leq λ \leq a_{ii} + R_{i} .

k_{t}^{*} = k_{t} \in S_{t - 1}^{c} argmax ∣ ϕ (k_{t}) ∣,

k_{t}^{*} = k_{t} \in S_{t - 1}^{c} argmax ∣ ϕ (k_{t}) ∣,

s.t. L_{t - 1} ϕ = λ_{m i n} (L_{t - 1}) ϕ

L_{t - 1} = L + i \in S_{t - 1} \sum e_{i} e_{i}^{⊤} .

L_{t - 1} = L + i \in S_{t - 1} \sum e_{i} e_{i}^{⊤} .

λ_{m i n} (L_{t - 1}) = ∥ x ∥_{2} = 1 min x^{⊤} L + i \in S_{t - 1} \sum e_{i} e_{i}^{⊤} x

λ_{m i n} (L_{t - 1}) = ∥ x ∥_{2} = 1 min x^{⊤} L + i \in S_{t - 1} \sum e_{i} e_{i}^{⊤} x

\leq c^{⊤} L c + c^{⊤} i \in S_{t - 1} \sum e_{i} e_{i}^{⊤} c

= ∥ c (S_{t - 1}) ∥_{2}^{2} < 1

k_{t} \in S_{t - 1}^{c} argmax δ \to 0 lim λ_{m i n} (L_{t - 1} + δ e_{k_{t}} e_{k_{t}}^{⊤})

k_{t} \in S_{t - 1}^{c} argmax δ \to 0 lim λ_{m i n} (L_{t - 1} + δ e_{k_{t}} e_{k_{t}}^{⊤})

λ_{m i n} (\tilde{L}_{t}) = x min \frac{x ^{⊤} L _{t - 1} x + δ x ^{⊤} e _{k_{t}} e _{k_{t}}^{⊤} x}{x ^{⊤} x}

λ_{m i n} (\tilde{L}_{t}) = x min \frac{x ^{⊤} L _{t - 1} x + δ x ^{⊤} e _{k_{t}} e _{k_{t}}^{⊤} x}{x ^{⊤} x}

= x min \frac{x ^{⊤} L _{t - 1} x + δ x ( k _{t} ) ^{2}}{x ^{⊤} x},

δ \to 0 lim λ_{m i n} (\tilde{L}_{t}) = λ_{m i n} (L_{t - 1}) + δ ϕ (k_{t})^{2},

δ \to 0 lim λ_{m i n} (\tilde{L}_{t}) = λ_{m i n} (L_{t - 1}) + δ ϕ (k_{t})^{2},

k_{t} \in S_{t - 1}^{c} argmax δ \to 0 lim λ_{m i n} (\tilde{L}_{t}) = k_{t} \in S_{t - 1}^{c} argmax δ ϕ (k_{t})^{2} = k_{t}^{*}

k_{t} \in S_{t - 1}^{c} argmax δ \to 0 lim λ_{m i n} (\tilde{L}_{t}) = k_{t} \in S_{t - 1}^{c} argmax δ ϕ (k_{t})^{2} = k_{t}^{*}

λ_{0} + ψ (i)^{2} \leq β_{0} \leq λ_{0} + ϕ (i)^{2}

λ_{0} + ψ (i)^{2} \leq β_{0} \leq λ_{0} + ϕ (i)^{2}

λ_{0} = ∥ x ∥_{2} = 1 min x^{⊤} L_{t} x = ϕ^{⊤} L_{t} ϕ .

λ_{0} = ∥ x ∥_{2} = 1 min x^{⊤} L_{t} x = ϕ^{⊤} L_{t} ϕ .

β_{0} = ∥ y ∥_{2} = 1 min y^{⊤} (L_{t} + e_{i} e_{i}^{⊤}) y

β_{0} = ∥ y ∥_{2} = 1 min y^{⊤} (L_{t} + e_{i} e_{i}^{⊤}) y

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Graph Sampling for Matrix Completion Using Recurrent Gershgorin Disc Shift

Fen Wang, Yongchao Wang, Member, IEEE, Gene Cheung, Senior Member, IEEE and Cheng Yang, Member, IEEE F. Wang conducted this work during her visit to York University under the scholarship from China Scholarship Council. *(Corresponding author: Yongchao Wang and Gene Cheung.)*F. Wang and Y. Wang are with State Key Laboratory of ISN, Xidian University, Xi’an 710071, Shaanxi, China (e-mail: [email protected]; [email protected]).G. Cheung and C. Yang are with the department of EECS, York University, 4700 Keele Street, Toronto, M3J 1P3, Canada (e-mail:[email protected]; [email protected]).

Abstract

Matrix completion algorithms fill missing entries in a large matrix given a subset of observed samples. However, how to best pre-select informative matrix entries given a sampling budget is largely unaddressed. In this paper, we propose a fast sample selection strategy for matrix completion from a graph signal processing perspective. Specifically, we first regularize the matrix reconstruction objective using a dual graph signal smoothness prior, resulting in a system of linear equations for solution. We then select appropriate samples to maximize the smallest eigenvalue $\lambda_{\min}$ of the coefficient matrix, thus maximizing the stability of the linear system. To efficiently solve this combinatorial problem, we derive a greedy sampling strategy, leveraging on Gershgorin circle theorem, that iteratively selects one sample (equivalent to shifting one Gershgorin disc) at a time corresponding to the largest magnitude entry in the first eigenvector of a modified graph Laplacian matrix. Our algorithm benefits computationally from warm start as the first eigenvectors of incremented Laplacian matrices are computed recurrently for more samples. To achieve computation scalability when sampling large matrices, we further rewrite the coefficient matrix as a sum of two separate components, each of which exhibits block-diagonal structure that we exploit for alternating block-wise sampling. Extensive experiments on both synthetic and real-world datasets show that our graph sampling algorithm substantially outperforms existing sampling schemes for matrix completion and reduces the completion error, when combined with a range of modern matrix completion algorithms.

Index Terms:

Graph sampling, matrix completion, Gershgorin circle theorem, graph Laplacian regularization

I Introduction

Big data means not only that the volume of acquired data is large, but the dimensionality of the dataset is also considerable. Matrix completion (MC) [1] is an example of this “curse of dimensionality” problem, where two large dimensional item sets (e.g., viewers and movies in the famed Netflix challenge) are correlated within and across sets. Specifically, given a small subset of pairwise observations (e.g., viewers’ ratings on movies), an MC algorithm reconstructs missing entries in the target matrix signal. Many MC algorithms have been devised using different priors to regularize the under-determined inverse problem, such as low rank of the target matrix [2] and graph signal smoothness priors [3]. See [4] for an introductory exposition.

While MC has been investigated intensively, how to pre-select matrix entries to collect informative samples given a sampling budget is largely unaddressed. This sampling problem is of practical concern for applications where sampling is expensive and/or time-consuming [5] (e.g., requesting viewers to fill out movie surveys is cumbersome and costly). Conventionally, entries in matrix were collected based on their informative uncertainty computed using different methods [6, 7, 8], which are typically computation-intensive. There exist fast sampling strategies for coherent matrices that assigned each entry a probability for non-uniform random sampling [9]. However, performance of random selection schemes is in general inferior compared to their deterministic counterparts.

Recently, rating matrix in a recommendation system was investigated from a graph signal processing (GSP) perspective [10], where the target matrix signal was assumed to be bandlimited / smooth with respect to both the row (movies) and column (viewers) graphs (called factor graphs), [11, 3]. Under this assumption, [12, 13] identified a structured set for MC by sampling entries that are intersections of greedily selected rows and columns. However, the imposed structure severely limits the possible sampling patterns and thus is too restrictive to achieve a general sampling budget. More general sampling methods for single graphs are not applicable for MC due to their high complexities on the product graph (one large graph containing all matrix entries as nodes) [14].

In contrast, in this paper we propose a fast unstructured graph sampling method for MC. We first regularize the sampling objective with a dual graph smoothness prior—a generalization of the well-known Tikhonov regularizer [15] to the graph signal domain—which was shown effective in completing missing matrix entries previously [3, 16]. This formulation leads to a system of linear equations for solution, which can be computed efficiently using known numerical linear algebra algorithms such as conjugate gradient (CG) [17]. To maximize the stability of the linear system, we select samples to maximize the smallest eigenvalue $\lambda_{\min}$ of the coefficient matrix 111Maximizing $\lambda_{\min}$ of a matrix is also known as the E-optimality criterion in optimal design of experiments [18, 19, 20]., which we show to also mean minimizing the upper bound of the reconstructed matrix signal’s squared error.

We propose to optimize the formulated objective greedily: select one node at a time such that the current sample set results in the largest $\lambda_{\min}$ . However, in each greedy step, computing $\lambda_{\min}$ for all candidates and choosing the largest one would still be expensive. Instead, leveraging on an insightful corollary of the Gershgorin circle theorem [21], we greedily select the sample corresponding to the largest magnitude entry in the first eigenvector of an augmented coefficient matrix, which also minimizes a related objective. Our algorithm benefits from warm start as the first eigenvectors of incrementally updated Laplacian matrices are computed recurrently during sampling using the well-known locally optimal block preconditioned conjugated gradient (LOBPCG) method [22].

To achieve computation scalability when sampling large matrices, we further partition the coefficient matrix into two matrices, each exhibiting attractive block-diagonal structure after permutation. We then propose an iterative sampling strategy that efficiently collects a pre-determined number of samples block-wise on smaller blocks alternately. Extensive experiments on synthetic and real-world datasets show that our proposed graph sampling methods achieve much smaller RMSE than competing sampling schemes for MC [12, 23, 24], when combined with a variety of state-of-the-art MC methods [25, 26, 27].

The outline of the paper is as follows. We first overview related works in graph sampling and active matrix completion in Section II. We then derive our graph sampling objective for MC using the dual graph smoothness prior in Section III. In Section IV, we describe our sampling strategy via the Gershgorin circle theorem, and then we propose an iterative block-wise sampling scheme for large matrices in Section V. Finally, extensive experiments and conclusion are presented in Section VII and VIII, respectively.

II Related works

We first discuss related works in graph sampling. Then we review some literature in conventional active matrix completion domain.

II-A Subset Sampling of Graph Signals

Subset sampling of graph signals is a fundamental problem in GSP [10]: how to select a node subset in a graph for sampling such that the remaining samples in the signal can be reconstructed with high accuracy. Most existing works [28, 29, 18, 14, 30, 31, 24, 12] extended the notion of critical sampling (also known as Nyquist sampling) in regular data kernels to bandlimited / smooth signals on graphs, where graph frequencies are defined as eigenvalues of a graph variation operator like the graph Laplacian or adjacency matrix. Sampling methods in GSP can be broadly divided into two categories: i) deterministic schemes [28, 29, 18, 14, 30, 32], and ii) random schemes [31, 24].

Most deterministic schemes [18, 33, 32] assumed that the graph signal is bandlimited: its spectral coefficients are concentrated on a set of extreme eigenvectors. [14] proposed a lightweight sampling method using the notion of spectral proxies, which collected samples based on the first eigenvector of a submatrix in each greedy step. One recent work [30] avoided eigenvector computation via Neumann series expansion, but required a large number of matrix series multiplications for accurate approximation. Recently, [34] proposed a sampling method based on Gershgorin disc alignment without any explicit eigen-decompositions. [23] also proposed an eigen-decomposition-free sampling method based on localization operator’s coverage surface, but had no notion of global errors in its optimization objective. However, those sampling methods cannot be directly applied to real-world MC problem because of their high complexities on the corresponding product graph.

In parallel, [24] proposed a non-uniform random graph sampling scheme to select nodes based on the notion of graph coherence, such that each node was sampled with a designed probability. However, the performance of random sampling is generally inferior compared to its deterministic competitors.

To the best of our knowledge, we are the first to propose an unstructured and deterministic graph sampling strategy specifically for MC in the literature, with complexity roughly linear to the size of the factor graphs.

II-B Active Matrix Completion

In the active learning literature, strategically selecting matrix entries is also called active matrix completion [35]. Active matrix completion approaches can be categorized into two types: i) statistical approach [6, 35], and ii) GSP approach [12, 13].

Among works pursuing a statistical approach, [35] formulated an active learning objective for MC, which was tackled using collaborative filtering. In [6], three different active querying strategies were proposed based on the reconstruction uncertainty of each entry; this querying idea was further investigated in [36, 37, 8]. However, those methods are generally computation-expensive since they must evaluate all candidates based on expected reconstructed error. Separately, adaptive sensing was proposed in [7] to select a subset of informative columns for MC with bounded complexity. Using the coherent property of matrix signals, [9] proposed a fast leveraged score-based sampling (LSS) to assign each matrix entry a sampling probability for non-uniform random sampling. Nevertheless, the performance of random sampling is not comparable to the deterministic strategies.

Recently, from a GSP viewpoint, the target matrix was interpreted as a bandlimited signal on the two factor graphs [11]. Based on such bandlimited model, [12, 13] proposed an efficient structured graph sampling strategy for MC, and then extended it to multidimensional tensor graph signals, whose effectiveness has been validated in recommendation system and point cloud sampling. However, structured sampling—selected samples must correspond to matrix entries that are intersections of chosen rows and columns of the matrix—is too restrictive to achieve arbitrary sampling budgets. In this paper, we propose a fast unstructured graph sampling strategy for MC, with comparable complexity to the structured counterpart [12, 13].

III Problem Formulation

We derive an objective function for matrix sampling using a dual graph signal smoothness prior [3, 16]. We first define the graph-based MC problem in Section III-A, and then formulate the graph spectral matrix sampling problem in Section III-B.

III-A Dual Graph Smoothness based Matrix Completion

Denote the original matrix signal and additive noise by ${\mathbf{X}}$ and ${\mathbf{N}}$ respectively, where ${\mathbf{X}},{\mathbf{N}}\in\mathbb{R}^{m\times n}$ . Given a sampling set $\Omega=\{(i,j)\;|\;i\in\{1,\ldots,m\},\;j\in\{1,\ldots,n\}\}$ , its corresponding sampling operator $\mathbf{A}_{\Omega}\in\{0,1\}^{m\times n}$ can be defined as

[TABLE]

With the above notations, the sampled noise-corrupted observation is ${\mathbf{Y}}={\mathbf{A}}_{\Omega}\circ({\mathbf{X}}+{\mathbf{N}})\in\mathbb{R}^{m\times n}$ , where $\circ$ denotes the element-wise matrix multiplication operator. We assume that elements in noise $\mathbf{N}$ are zero-mean, independent and identically distributed (i.i.d.) noise with the same variance.

MC methods attempt to reconstruct the original matrix ${\mathbf{X}}$ from the partial noisy observations ${\mathbf{Y}}$ , under an assumed prior for matrix ${\mathbf{X}}$ , like low-rank [2]:

[TABLE]

where $\sigma$ is set sufficiently small to enforce similar reconstruction of the observed samples ${\mathbf{Y}}$ in signal ${\mathbf{X}}$ .

Recently, [3] introduced a dual graph smoothness prior to promote low rank matrix reconstruction. Specifically, columns of ${\mathbf{X}}$ are assumed to be smooth with respect to an undirected weighted row graph ${\mathcal{G}}_{r}=\{{\mathcal{V}}_{r},{\mathcal{E}}_{r},{\mathbf{W}}_{r}\}$ with vertices ${\mathcal{V}}_{r}=\{1,\dots,m\}$ and edges ${\mathcal{E}}_{r}\subseteq{\mathcal{V}}_{r}\times{\mathcal{V}}_{r}$ . Weight matrix ${\mathbf{W}}_{r}$ specifies pairwise similarities among vertices in ${\mathcal{G}}_{r}$ . The combinatorial graph Laplacian matrix of row graph ${\mathcal{G}}_{r}$ is ${\mathbf{L}}_{r}={\mathbf{D}}_{r}-{\mathbf{W}}_{r}$ , where the degree matrix ${\mathbf{D}}_{r}$ is a diagonal matrix with entries ${\mathbf{D}}_{r}(i,i)=\sum_{j}{\mathbf{W}}_{r}(i,j)$ . Taking the $j$ -th column of ${\mathbf{X}}$ , denoted by ${\mathbf{x}}_{j}$ , as an example, the total graph variation of ${\mathbf{x}}_{j}$ on graph ${\mathcal{G}}_{r}$ is defined as [38]:

[TABLE]

Thus a smaller variation value would mean similar sample reconstructions between strongly connected nodes.

Similarly, the rows of ${\mathbf{X}}$ are assumed smooth with respect to a column graph ${\mathcal{G}}_{c}=\{{\mathcal{V}}_{c},{\mathcal{E}}_{c},{\mathbf{W}}_{c}\}$ with vertices ${\mathcal{V}}_{c}=\{1,\dots,n\}$ , edges ${\mathcal{E}}_{c}\subseteq{\mathcal{V}}_{c}\times{\mathcal{V}}_{c}$ and weight matrix ${\mathbf{W}}_{c}$ . Corresponding graph Laplacian matrix for the column graph ${\mathcal{G}}_{c}$ is ${\mathbf{L}}_{c}={\mathbf{D}}_{c}-{\mathbf{W}}_{c}$ . Using the movie recommendation systems as an example, the row graph is a similarity graph among movies, and the column graph is a social relationship graph among viewers. Row and column graphs can be constructed from observed data using different methods [3, 39, 40]; we describe our adopted graph construction schemes in Section VI.

We now formulate the MC problem with dual graph Laplacian regularization (DGLR) [41] as follows:

[TABLE]

where $\alpha$ and $\beta$ are parameters trading off the first fidelity term with the two signal smoothness priors.

It has been shown through extensive experiments that the dual graph signal smoothness prior enables good MC performance [3]. More generally, a graph smoothness prior is a generalization of the well-known Tikhonov regularizer—popular regularization for ill-posed problems—to graph data kernels [10]. In the fast growing field of GSP [38], the graph smoothness prior has already been shown effective empirically for a wide range of inverse problems (e.g., image denoising / deblurring [42, 41], point cloud denoising [43, 44]), and recently is successfully applied to MC also [11, 3, 16, 13]. Next, we derive our sampling algorithm based on this well-accepted prior in the GSP community.

To solve the unconstrained QP problem (4), we take the derivative of $f({\mathbf{X}})$ with respect to ${\mathbf{X}}$ , set it to [math] and solve for ${\mathbf{X}}$ , resulting in a system of linear equations for unknown $\text{vec}({\mathbf{X}}^{*})$ :

[TABLE]

where $\tilde{\mathbf{A}}_{\Omega}=\text{diag}(\text{vec}({\mathbf{A}}_{\Omega}))$ , $\textrm{vec}(\cdot)$ means a vector form of a matrix by stacking its columns, and $\textrm{diag}(\cdot)$ creates a diagonal matrix with input vector as its diagonal elements. See Appendix A for a detailed derivation.

Since the coefficient matrix ${\mathbf{Q}}=\tilde{\mathbf{A}}_{\Omega}+\alpha\mathbf{I}_{n}\otimes\mathbf{L}_{r}+\beta\mathbf{L}_{c}\otimes\mathbf{I}_{m}$ is in general symmetric, sparse and positive definite (PD)222See Appendix B for a detailed description of ${\mathbf{Q}}$ ., (5) can be solved efficiently using a plethora of mature numerical linear algebra methods such as conjugate gradient (CG) [45]. This is one notable appeal of formulating the MC problem using the dual graph signal smoothness prior in (4), where computing its solution requires only solving a system of linear equations.

III-B Graph Sampling for Matrix Completion based on DGLR Formulation

The stability of the linear system in (5) is determined by the condition number of coefficient matrix ${\mathbf{Q}}$ , which is the ratio of the largest eigenvalue $\lambda_{\max}$ of ${\mathbf{Q}}$ to its smallest eigenvalue $\lambda_{\min}$ . Given that $\lambda_{\max}({\mathbf{Q}})$ is upper-bounded for a degree-constrained graph (see Appendix C for a proof), to maximize stability, we seek to maximize $\lambda_{\min}({\mathbf{Q}})$ through sampling, i.e.,

[TABLE]

Maximizing $\lambda_{\min}$ of a coefficient matrix is also known as the E-optimality criterion in optimal design [18, 19, 20], and is a common objective for many well-known linear system optimizations, e.g., active learning [46], sensor placement [47] and polynomial regression [48]. In our sampling scenario, we show further that maximizing (6) also means minimizing the MSE upper bound, as stated formally in the lemma below.

Lemma 1.

Given dual graph Laplacians ${\mathbf{L}}_{r}$ and ${\mathbf{L}}_{c}$ , assuming ground truth signal ${\mathbf{X}}$ is corrupted by independent additive noise ${\mathbf{N}}$ , MSE of the reconstructed signal $\mathbf{X}^{*}$ with respect to the original signal $\mathbf{X}$ is upper-bounded by

[TABLE]

where $\rho=\|\left(\alpha\mathbf{I}_{n}\otimes\mathbf{L}_{r}+\beta\mathbf{L}_{c}\otimes\mathbf{I}_{m}\right)\left[\text{vec}(\mathbf{X+N})\right]\|_{2}$ .

Proof.

In vector form, $\text{vec}(\mathbf{Y})=\tilde{\mathbf{A}}_{\Omega}\left[\text{vec}(\mathbf{X+N})\right]$ . Thus, the solution to the system of linear equations (5) is

[TABLE]

where ${\mathbf{L}}=\alpha{\mathbf{I}}_{n}\otimes{\mathbf{L}}_{r}+\beta{\mathbf{L}}_{c}\otimes{\mathbf{I}}_{m}$ .

Thus the squared error of estimator $\text{vec}({\mathbf{X}}^{*})$ with respect to $\text{vec}({\mathbf{X}})$ is

[TABLE]

where $\rho=\|\left(\alpha\mathbf{I}_{n}\otimes\mathbf{L}_{r}+\beta\mathbf{L}_{c}\otimes\mathbf{I}_{m}\right)\left[\text{vec}(\mathbf{X+N})\right]\|_{2}$ .

From inequality (9), we see that sampling set $\Omega$ only influences the MSE upper bound by manipulating $\|{\mathbf{Q}}^{-1}\|_{2}$ . Moreover, for symmetric and positive definite matrix $\mathbf{Q}$ , we know

[TABLE]

We complete this proof by substituting (10) into equation (9). ∎

In the next section, we will present a fast graph sampling strategy to solve optimization problem (6).

IV Fast Sampling on Product Graph via Gershgorin Circle Theorem

For brevity, we interpret ${\mathbf{L}}=\alpha\mathbf{I}_{n}\otimes\mathbf{L}_{r}+\beta\mathbf{L}_{c}\otimes\mathbf{I}_{m}$ as the Laplacian of a scaled product graph333Cartesian product between two matrices ${\mathbf{L}}_{r}$ and ${\mathbf{L}}_{c}$ is defined by: ${\mathbf{L}}_{r}\odot{\mathbf{L}}_{c}=\mathbf{I}_{n}\otimes\mathbf{L}_{r}+\mathbf{L}_{c}\otimes\mathbf{I}_{m}$ .. Thus optimization (6) becomes the maximization of $\lambda_{\min}$ for matrix $\tilde{{\mathbf{A}}}_{\Omega}+{\mathbf{L}}$ . By definition, $\tilde{{\mathbf{A}}}_{\Omega}$ is a diagonal matrix:

[TABLE]

where ${\mathcal{S}}=\left\{l|l=i+m\times(j-1),\forall(i,j)\in\Omega\right\}$ .

Thus, coefficient matrix ${\mathbf{Q}}$ can be rewritten as:

[TABLE]

where $K=|{\mathcal{S}}|$ , $k_{t}={\mathcal{S}}(t)$ and ${\mathbf{e}}_{k_{t}}$ is an indicator vector with ${\mathbf{e}}_{k_{t}}({k_{t}})=1$ and ${\mathbf{e}}_{k_{t}}(q)=0$ for $q\neq k_{t}$ .

Finding an optimal $\Omega$ (or ${\mathcal{S}}$ ) to maximize $\lambda_{\min}({\mathbf{Q}})$ is combinatorial in nature. Towards a low-complexity sampling strategy, we take a greedy approach, where we iteratively add a locally optimal sample to a selected sample set until the sample budget is exhausted. Hence, assuming we have collected $t-1$ samples in ${\mathcal{S}}_{t-1}$ , at the $t$ -th iteration, we solve the following local optimization problem:

[TABLE]

where $t\in\{1,\dots,K$ }, ${\mathcal{S}}_{t}={\mathcal{S}}_{t-1}\cup k^{*}_{t}$ with ${\mathcal{S}}_{0}=\emptyset$ , and ${\mathbf{L}}_{t}={\mathbf{L}}_{t-1}+{\mathbf{e}}_{k^{*}_{t}}{\mathbf{e}}^{\top}_{k^{*}_{t}}$ with ${\mathbf{L}}_{0}={\mathbf{L}}$ .

To find an optimal solution $k^{*}_{t}$ in (13) for each new sample, one can compute $\lambda_{\min}$ of the incremented Laplacian444We use “increment” here to mean increasing one diagonal element of a matrix by 1 (equivalently shifting the center of one Gershogrin disc right by 1), while other matrix entries remain unchanged. ${\mathbf{L}}_{t-1}+{\mathbf{e}}_{k_{t}}{\mathbf{e}}^{\top}_{k_{t}}$ corresponding to all candidate nodes $k_{t}\in{\mathcal{S}}^{c}_{t-1}$ and identify the largest one, which is computation-intensive. Instead, we circumvent multiple computations of the smallest eigenvalue for candidates using a strategy based on the Gershgorin circle theorem (GCT).

IV-A Gershgorin Disc Shift based Graph Sampling

We first review GCT and its corollary [21], which will lead to a lightweight sampling method later.

Theorem 1.

Given an $n\times n$ matrix ${\mathbf{A}}$ with entries $a_{ij}$ , define the $i$ -th Gershgorin disc $D(a_{ii},R_{i})$ , corresponding to the $i$ -th row of ${\mathbf{A}}$ , with center $a_{ii}$ and radius $R_{i}=\sum_{j\neq i}|a_{ij}|$ . Each eigenvalue $\lambda$ of ${\mathbf{A}}$ lies within at least one Gershgorin disc, i.e.,

[TABLE]

Corollary 1.

If the largest magnitude component of an eigenvector ${\mathbf{x}}$ is at index $i$ , then its corresponding eigenvalue $\lambda$ must be within the $i$ -th Gershgorin disc $D(a_{ii},R_{i})$ .

This corollary implies that $\lambda_{\min}$ of matrix ${\mathbf{L}}_{t-1}$ must reside in the $j^{*}$ -th Gershgorin disc, where $j^{*}=\operatorname*{argmax}_{j}~{}|{\boldsymbol{\phi}}(j)|$ and ${\boldsymbol{\phi}}$ is the first eigenvector of ${\mathbf{L}}_{t-1}$ corresponding to $\lambda_{\min}$ 555In this paper, eigenvectors are all normalized, i.e., $\|{\boldsymbol{\phi}}\|_{2}=1$ .. By (13), ${\mathbf{e}}_{k_{t}}{\mathbf{e}}^{\top}_{k_{t}}$ shifts the center of the $k_{t}$ -th Gershgorin disc of ${\mathbf{L}}_{t-1}$ to the right by 1. * Our strategy is then to right-shift the Gershgorin disc corresponding to the largest magnitude entry $k_{t}^{*}\in\mathcal{S}_{t-1}^{c}$ in ${\boldsymbol{\phi}}$ which contains $\lambda_{\min}$ , thus promoting a larger $\lambda_{\min}$ in ${\mathbf{L}}_{t}$ ; i.e., select sample $k^{*}_{t}$ where *

[TABLE]

Remark: To choose one sample, our strategy requires computation of only the first eigenvector of a sparse matrix ${\mathbf{L}}_{t-1}$ once, without multiple evaluations for all candidates.

Note that in (15) we select the index $k_{t}^{*}$ with the largest magnitude $|{\boldsymbol{\phi}}(k_{t}^{*})|$ only among entries in the unsampled set $\mathcal{S}_{t-1}^{c}$ instead of the entire vector, as specified in Corollary 1. However, one can guarantee that the largest magnitude index $k_{t}^{*}$ in ${\boldsymbol{\phi}}$ , in fact, only resides in $\mathcal{S}_{t-1}^{c}$ , and thus (15) is consistent with Corollary 1. We state this formally in the following Proposition:

Proposition 1.

$k^{*}_{t}$ * computed from (15) is also the index with the largest magnitude in ${\boldsymbol{\phi}}$ , i.e., $k_{t}^{*}={\operatorname*{argmax}}_{j\in{\mathcal{V}}}~{}~{}|{\boldsymbol{\phi}}(j)|$ .*

Proof.

Based on the definition below equation (13), we can deduce that

[TABLE]

Hence, in matrix ${\mathbf{L}}_{t-1}$ , left-ends of Gershgorin discs corresponding to indices $i\in{\mathcal{S}}_{t-1}$ are at 1, and the other discs’ left-ends are at 0. Suppose now that $j^{*}={\operatorname*{argmax}}_{j\in{\mathcal{V}}}~{}~{}|{\boldsymbol{\phi}}(j)|$ and $j^{*}\in{\mathcal{S}}_{t-1}$ . According to Corollary 1, the smallest eigenvalue $\lambda_{\min}({\mathbf{L}}_{t-1})$ must be within $j^{*}$ -th Gershgorin disc, i.e., $\lambda_{\min}({\mathbf{L}}_{t-1})\geq 1$ .

We also know the first eigenvector of matrix ${\mathbf{L}}$ is a constant vector ${\mathbf{c}}=\frac{1}{\sqrt{mn}}[1,\dots,1]$ with eigenvaue 0. This yields:

[TABLE]

where the last inequality holds since ${\mathcal{S}}_{t-1}\subset{\mathcal{V}}$ .

This is contradictory to previous result that $\lambda_{\min}({\mathbf{L}}_{t-1})\geq 1$ . Thus $j^{*}\in{\mathcal{S}}^{c}_{t-1}$ and $j^{*}=k^{*}_{t}$ . ∎

Thus, we conclude that entries within set ${\mathcal{S}}_{t-1}$ cannot have the largest energy in first eigenvector ${\boldsymbol{\phi}}$ of ${\mathbf{L}}_{t-1}$ .

We can alternatively justify our strategy by showing that the index chosen by our strategy optimizes a related objective to (13). We state this formally in the following lemma.

Lemma 2.

An optimal solution to the problem

[TABLE]

is $k^{*}_{t}=\mathop{\operatorname*{argmax}}\limits_{k_{t}\in{\mathcal{S}}^{c}_{t-1}}~{}~{}|{\boldsymbol{\phi}}(k_{t})|$ , where ${\boldsymbol{\phi}}$ is the first eigenvector of ${\mathbf{L}}_{t-1}$ corresponding to smallest eigenvalue $\lambda_{\min}({\mathbf{L}}_{t-1})$ .

Proof.

Since matrix ${\mathbf{L}}_{t-1}$ is augmented by a small matrix $\delta{\mathbf{e}}_{k_{t}}{\mathbf{e}}^{\top}_{k_{t}}$ , the resulting matrix is ${\tilde{{\mathbf{L}}}_{t}}={\mathbf{L}}_{t-1}+\delta{\mathbf{e}}_{k_{t}}{\mathbf{e}}^{\top}_{k_{t}}$ , where $k_{t}\in{\mathcal{S}}_{t-1}^{c}$ . Using the Rayleigh quotient theorem [49], we can write $\lambda_{\min}$ of matrix $\tilde{{\mathbf{L}}}_{t}$ as

[TABLE]

where the minimizer ${\mathbf{x}}^{*}$ is the first eigenvector of ${\mathbf{L}}_{t-1}$ when $\delta\rightarrow 0$ , i.e., $\lim_{\delta\rightarrow 0}{\mathbf{x}}^{*}={\boldsymbol{\phi}}$ . Therefore,

[TABLE]

where ${\boldsymbol{\phi}}^{\top}{\boldsymbol{\phi}}$ is omitted since $\|{\boldsymbol{\phi}}\|_{2}=1$ .

Given collected ${\mathcal{S}}_{t-1}$ , $\lambda_{\min}({{\mathbf{L}}_{t-1}})$ does not depend on $k_{t}$ . Hence,

[TABLE]

∎

Thus, by computing $k^{*}_{t}$ using (15), we are optimally solving problem (18), which is a proxy approximating original (13).

IV-B Fast Repeated Eigenvector Computation with Warm Start

Our proposed formulation (15) requires computing the first eigenvector of ${\mathbf{L}}_{t-1}$ in each greedy step. In this paper, we will adopt the state-of-the-art LOBPCG method [22] to compute the first eigenvector, which has been proved very efficient for large sparse matrices [14]. With an initial input ${\mathbf{x}}_{0}$ , in each iteration, LOBPCG works as follows:

Multiply $\mathbf{L}_{t-1}$ with ${\mathbf{x}}_{i}\in\mathbb{R}^{mn}$ (guess of the first eigenvector in $i$ -th iteration) with complexity ${\mathcal{O}}(E_{t-1})$ , where $E_{t-1}$ is the number of non-zero entries in ${\mathbf{L}}_{t-1}$ ; 2. 2.

Perform a Rayleigh-Ritz step [22] to compute the combination coefficients, solving an eigenvalue problem with complexity ${\mathcal{O}}(r^{3})$ . $r$ is the number of computed eigenvectors and in our problem, $r=1$ ; 3. 3.

Update ${\mathbf{x}}_{i}$ based on Rayleigh-Ritz coefficients, go to step (1) until convergence.

Our proposed algorithm can benefit computationally from warm start when deploying LOBPCG: we use the estimated first eigenvector ${\boldsymbol{\phi}}$ of ${\mathbf{L}}_{t-1}$ in the last iteration as the initial guess ${\mathbf{x}}_{0}$ for ${\mathbf{L}}_{t}$ . The small change between ${\mathbf{L}}_{t}$ and ${\mathbf{L}}_{t-1}$ (the Forbenius norm difference is only 1) ensures a good initial guess, reducing the number of iterations for LOBPCG to converge. Simulation results in Section VII show that warm start does reduce sampling time noticeably.

We write the pseudo code of our sampling strategy in Algorithm 1, called Gershgorin circle shift (GCS)-based sampling.

IV-C Complexity Analysis

The complexity of our proposed GCS method is dominated by three components: i) $K$ times greedy search, ii) first eigenvector computation, and iii) finding the largest element’s location in each greedy search.

Identifying the largest energy index in a vector with length $mn$ has complexity ${\mathcal{O}}(mn)$ . The complexity of using LOBPCG to compute the first eigenvector of ${\mathbf{L}}_{t}$ is ${\mathcal{O}}(E_{t}F_{t})$ , where $F_{t}$ is the number of iterations till convergence in LOBPCG. Note that $E_{0}=E_{1}=\dots=E_{K-1}$ since ${\mathbf{L}}_{t}={\mathbf{L}}_{t-1}+{\mathbf{e}}_{k^{*}_{t}}{\mathbf{e}}^{T}_{k^{*}_{t}}$ , and the diagonal terms of ${\mathbf{L}}$ are all non-zero for a connected graph. Because ${\mathbf{L}}_{0}={\mathbf{L}}=\alpha{\mathbf{I}}_{n}\otimes{\mathbf{L}}_{r}+\beta{\mathbf{L}}_{c}\otimes{\mathbf{I}}_{m}$ , i.e., matrix ${\mathbf{L}}$ is consisted of $n$ matrix ${\mathbf{L}}_{r}$ and $m$ matrix ${\mathbf{L}}_{c}$ , the number nonzero entries in ${\mathbf{L}}$ is at most $E_{0}={\mathcal{O}}(n|{\mathcal{E}}_{r}|+m|{\mathcal{E}}_{c}|)$ , where $|{\mathcal{E}}_{r}|$ and $|{\mathcal{E}}_{c}|$ are the numbers of edges in graph ${\mathbf{L}}_{r}$ and ${\mathbf{L}}_{c}$ respectively.

Therefore, denoting by $F=\max\{F_{0},\dots,F_{K-1}\}$ , in each greedy step, the complexity of LOBPCG is ${\mathcal{O}}((n|{\mathcal{E}}_{r}|+m|{\mathcal{E}}_{c}|)F)$ ; combined with $K$ times greedy search and signal sorting, our GCS method has the complexity ${\mathcal{O}}(K(n|{\mathcal{E}}_{r}|+m|{\mathcal{E}}_{c}|)F+Kmn)$ . If the row graph and column graph are both sparse such that $|{\mathcal{E}}_{r}|={\mathcal{O}}(m)$ and $|{\mathcal{E}}_{c}|={\mathcal{O}}(n)$ , then the complexity of GCS can be abbreviated as ${\mathcal{O}}(KFmn)$ . Though the spectral proxy based sampling method in [14] also computes the first eigenvector of a submatrix of ${\mathbf{L}}^{p}$ via LOBPCG, it does not benefit from warm start, and the complexity of computing ${\mathbf{L}}^{p}{\mathbf{x}}$ will be higher than ${\mathbf{L}}{\mathbf{x}}$ by at least by a factor $p$ .

IV-D Explanation from Graph Spectral Energy Perspective

We now interpret the GCS sampling from an energy spreading perspective. First, we define the absolute value of the first eigenvector of incremented Laplacian as graph spectral energy. Our proposed GCS method is to select node with the largest energy at each greedy step. Once node $i$ is sampled, intuitively, the energy of this node and nodes near $i$ should decrease such that in the next step, the proposed strategy will not sample those nodes. This is formally stated in the next lemma:

Lemma 3.

Denote by $\lambda_{0}=\lambda_{\min}({\mathbf{L}}_{t})$ , $\beta_{0}=\lambda_{\min}({\mathbf{L}}_{t}+{\mathbf{e}}_{i}{\mathbf{e}}^{\top}_{i})$ and ${\mathbf{L}}_{t}{\boldsymbol{\phi}}=\lambda_{0}{\boldsymbol{\phi}}$ , $({\mathbf{L}}_{t}+{\mathbf{e}}_{i}{\mathbf{e}}^{\top}_{i}){\boldsymbol{\psi}}=\beta_{0}{\boldsymbol{\psi}}$ . Then

[TABLE]

In our problem, ${\mathbf{L}}_{t}\in\{{\mathbf{L}}_{0},{\mathbf{L}}_{1},\dots,{\mathbf{L}}_{K-1}\}$ .

Proof.

From the Rayleigh quotient theorem [49],

[TABLE]

and,

[TABLE]

From equation (23), we know ${\boldsymbol{\psi}}^{\top}{\mathbf{L}}_{t}{\boldsymbol{\psi}}\geq{\boldsymbol{\phi}}^{\top}{\mathbf{L}}_{t}{\boldsymbol{\phi}}=\lambda_{0}$ , which implies $\beta_{0}\geq\lambda_{0}+{\boldsymbol{\psi}}(i)^{2}$ . From equation (24), we can derive that $\beta_{0}\leq{\boldsymbol{\phi}}^{\top}{\mathbf{L}}_{s}{\boldsymbol{\phi}}={\boldsymbol{\phi}}^{\top}{\mathbf{L}}_{t}{\boldsymbol{\phi}}+{\boldsymbol{\phi}}(i)^{2}=\lambda_{0}+{\boldsymbol{\phi}}(i)^{2}$ , which is exactly the right part of the lemma. ∎

This Lemma states that sampling node $i$ will reduce the spectral energy at node $i$ . Moreover, sampling node $i$ with the largest $|{\boldsymbol{\phi}}(i)|$ actually maximizes the upper-bound of $\lambda_{\min}({\mathbf{L}}_{t}+{\mathbf{e}}_{i}{\mathbf{e}}^{\top}_{i})$ . We know that the value of $|{\boldsymbol{\psi}}(i)|$ is penalized from $|{\boldsymbol{\phi}}(i)|$ , so selecting the node with largest $|{\boldsymbol{\phi}}(i)|$ will also promote a reasonably large $|{\boldsymbol{\psi}}(i)|$ , thus provide a large lower-bound of $\lambda_{\min}({\mathbf{L}}_{t}+{\mathbf{e}}_{i}{\mathbf{e}}^{\top}_{i})$ .

Equation (3) tells us that strongly connected nodes would have similar signal, thus the energy of nodes near $i$ is also decreased when node $i$ is sampled. Therefore, nodes close to $i$ will not be sampled by the proposed GCS method with highly probability. This agrees with our intuition: node $i$ carries information of its local neighborhood; after sampling it, there is no need to sample connected nodes in its neighborhood.

We conduct toy experiments on a community graph with 100 nodes using the proposed GCS sampling, whose results are shown in Fig.1. As depicted in this experiment, the first four samples lie in four different communities. From the graph energy perspective, sampling one node in one community will decrease the energy of nodes within this community, thus leading to sampling the next node from other communities.

V Iterative Graph Spectral Sampling for Matrix Completion

Though we identify samples using LOBPCG to compute first eigenvectors repeatedly with warm start, sampling on a very large product graph (matrix ${\mathbf{L}}$ ) with $mn$ nodes is still expensive for large real-world MC datasets. We thus propose an efficient block-wise sampling method for MC problem operating on two corresponding row and column graphs, while retaining the same sampling idea in GCS. Towards a simpler presentation, we omit $\Omega$ in $\tilde{{\mathbf{A}}}_{\Omega}$ in the sequel.

Since coefficient matrix ${\mathbf{Q}}$ is a combination of row and column from ${\mathbf{L}}_{r}$ and ${\mathbf{L}}_{c}$ , it does not exhibit any structure that one can exploit for optimization. We thus split ${\mathbf{Q}}$ into two separate matrices ${\mathbf{Q}}_{1}$ and ${\mathbf{Q}}_{2}$ as follows:

[TABLE]

where $0<q<1$ is a split parameter.

Since ${\mathbf{Q}}_{1}$ and ${\mathbf{Q}}_{2}$ are both Hermitian, by Weyl’s inequality [49]

[TABLE]

which indicates that each selected sample affects respective $\lambda_{\min}$ ’s of ${\mathbf{Q}}_{1}$ and ${\mathbf{Q}}_{2}$ and the lower bound of $\lambda_{\min}({\mathbf{Q}})$ .

From a Gershgorin circle perspective, each selected sample shifts one disc in $\alpha{\mathbf{I}}_{n}\otimes{\mathbf{L}}_{r}$ and $\beta{\mathbf{L}}_{c}\otimes{\mathbf{I}}_{m}$ by $q$ and $1-q$ , respectively. Next, we will exploit the block diagonal property of matrices ${\mathbf{Q}}_{1}$ and ${\mathbf{Q}}_{2}$ to develop an efficient sampling framework.

V-A Block Diagonal Structure and Inner Connections

${\mathbf{Q}}_{1}\in\mathbb{R}^{mn\times mn}$ has block-diagonal structure, i.e.,

[TABLE]

where $\tilde{{\mathbf{A}}}_{j}\in\mathbb{R}^{m\times m}$ is the $j$ -th diagonal block of $\tilde{{\mathbf{A}}}$ .

When entry $(i,j)$ of the target matrix ${\mathbf{X}}$ is sampled, $\tilde{{\mathbf{A}}}_{j}(i,i)=1$ since $\mathbf{A}(i,j)=1$ and $\tilde{\mathbf{A}}=\text{diag}(\text{vec}({\mathbf{A}}))$ . Equivalently, if $\tilde{{\mathbf{A}}}_{j}(i,i)=1$ , we know that the information from $i$ -th row (movie) and $j$ -th column (customer) is collected.

In contrast, matrix ${\mathbf{Q}}_{2}\in\mathbb{R}^{mn\times mn}$ is:

[TABLE]

It is known that matrix ${\mathbf{L}}_{c}\otimes{\mathbf{I}}_{m}$ and ${\mathbf{I}}_{m}\otimes{\mathbf{L}}_{c}$ are permutation similar, i.e., there exists a permutation matrix ${\mathbf{P}}$ such that [50]:

[TABLE]

Thus the permuted sampling matrix $\hat{{\mathbf{A}}}$ for ${\mathbf{I}}_{m}\otimes{\mathbf{L}}_{c}$ is also a block-diagonal matrix.

[TABLE]

where $\hat{{\mathbf{A}}}_{i}\in\mathbb{R}^{n\times n}$ is the $i$ -th diagonal block of $\hat{{\mathbf{A}}}$ .

Combined with the block diagonal property of ${\mathbf{I}}_{m}\otimes{\mathbf{L}}_{c}$ , we will write the permuted form of ${\mathbf{Q}}_{2}$ as follows:

[TABLE]

where $\hat{q}=1-q$ for brevity.

From the property of similarity transform, we know that $\lambda({\mathbf{Q}}_{2})=\lambda(\hat{{\mathbf{Q}}}_{2})$ since ${\mathbf{P}}^{-1}={\mathbf{P}}^{\top}$ for any permutation matrices. From (45), we see that when $\hat{{\mathbf{A}}}_{i}(j,j)=1$ , the $j$ -th disc in $i$ -th block in matrix ${\mathbf{I}}_{m}\otimes{\mathbf{L}}_{c}$ is shifted.

We illustrate the relationship between matrix $\hat{{\mathbf{A}}}$ and $\tilde{{\mathbf{A}}}$ via a simple example. Assuming $m=3,{\mathbf{L}}_{c}\in\mathbb{R}^{2\times 2}$ , we have the following two matrices:

[TABLE]

where $l_{ij}$ are the elements in matrix ${\mathbf{L}}_{c}$ .

We first assign the Gershgorin discs in matrices ${\mathbf{L}}_{c}\otimes{\mathbf{I}}_{3}$ and ${\mathbf{I}}_{3}\otimes{\mathbf{L}}_{c}$ with indices, as illustrated in Fig. 2. Assuming $\tilde{{\mathbf{A}}}_{1}(3,3)=1$ , the disc $\{1,3\}$ in matrix ${\mathbf{L}}_{c}\otimes{\mathbf{I}}_{3}$ is shifted by $\hat{q}$ . After permutation, this means that the disc $\{3,1\}$ in matrix ${\mathbf{I}}_{3}\otimes{\mathbf{L}}_{c}$ is shifted, i.e., $\hat{{\mathbf{A}}}_{3}(1,1)=1$ . Thus, $\tilde{{\mathbf{A}}}_{j}(i,i)=1$ is equivalent to $\hat{{\mathbf{A}}}_{i}(j,j)=1$ .

Therefore, sampling data ${\mathbf{X}}$ at $(i,j)$ will promote the smallest eigenvalue of $j$ -th ( $i$ -th) diagonal block of ${\mathbf{Q}}_{1}$ ( $\hat{{\mathbf{Q}}}_{2}$ ) by making $\tilde{{\mathbf{A}}}_{j}(i,i)=1$ ( $\hat{{\mathbf{A}}}_{i}(j,j)=1$ ). Given this connection, we next propose an iterative sampling strategy. Block matrices in ${\mathbf{Q}}_{1}$ and $\hat{{\mathbf{Q}}}_{2}$ will be called ‘clusters’ and ‘groups’ respectively.

V-B Iterative Sampling between Clusters and Groups

We here propose to alternately collect samples based on one cluster in ${\mathbf{Q}}_{1}$ only or one group in $\hat{{\mathbf{Q}}}_{2}$ only. Specifically, we start sampling the matrix signal from the first column ( $j=1$ ) based on the Laplacian matrix of the first cluster in ${\mathbf{Q}}_{1}$ . If its first eigenvetor has the largest energy at the $i$ -th index, we will sample data at $(i,1)$ and then proceed sampling based on corresponding incremented Laplacian matrix of the $i$ -th group in $\hat{{\mathbf{Q}}}_{2}$ . We continue to choose samples alternating between clusters and groups until the sampling budget is exhausted.

Since our GCS sampling strategy using LOBPCG benefits from warm start, the computation complexity of this iterative scheme can be further reduced if we choose more than one sample from the same cluster (or group). We thus introduce a warm start parameter $\zeta$ to trade off sampling performance and computation complexity; its sensitivity will be examined in Section VII. Detailed iterative sampling pseudo-code is shown in Algorithm 2, called iterative Gershgorin circle shift (IGCS)-based sampling. As we analyzed in Section IV, our proposed GCS has complexity ${\mathcal{O}}(KFmn)$ , while the complexity of IGCS is just ${\mathcal{O}}(K\hat{F}c)$ . $\hat{F}$ is the convergence iteration number of LOBPCG in IGCS, and $c=\max\{m,n\}$ . Therefore, the complexity is reduced by at least a factor $\min\{m,n\}$ using this iterative sampling framework, i.e., the complexity is roughly linear to the size of factor graph.

VI Graph Construction

Before using graph sampling for MC via the proposed IGCS, one has to first acquire the row graph ${\mathbf{L}}_{r}$ and the column graph ${\mathbf{L}}_{c}$ . There exist many methods to construct finite row / column graphs from data, so that the observed signal(s) are smooth (low-pass) with respect to the constructed graphs [3, 11]. For completeness, we overview methods we chose to construct row and column graphs using which we select samples. We stress that our work focuses on sampling; the discussion here merely demonstrates that our graph sampling schemes can be practically realized in combination with existing graph learning methods.

$\bullet$ G1: Feature-based graph

As done in graph-based MC methods [16, 26], when the user / item profiles (e.g., age, gender and occupation of users and genre of the items) are available, we construct a weighted 10-nearest neighbor graph using GSPBox [51] based on feature vectors of each node.

$\bullet$ G2: Content-based graph from observed information

When features of data points are not available, we construct row and column graphs only from partial matrix entries, extending method used in [3]. Specifically, the observed matrix is ${\mathbf{Z}}={\mathbf{A}}_{\Gamma}\circ{\mathbf{X}}$ for a given random initial set $\Gamma$ . Then, for each pair of users $\{i,j\}$ , their partial ratings are in the $i$ - and $j$ -th rows of matrix ${\mathbf{Z}}$ , denoted by ${\mathbf{z}}_{i}$ and ${\mathbf{z}}_{j}$ . We then compute the inter-node distance as

[TABLE]

where $\mathcal{R}_{ij}=\mathcal{R}_{i}\cap\mathcal{R}_{j}$ , and ${\mathcal{R}}_{i}$ is the set of items rated by user $i$ .

If $|{\mathcal{R}}_{ij}|=0$ , we set $d_{ij}=\infty$ . We then compute the edge weight between users $i$ and $j$ as

[TABLE]

where $d_{\min}=\min_{\{i,j\}}d_{ij}$ and $d_{s}$ is the threshold of user $i$ for sparsifying ${\mathbf{W}}_{r}$ ; $\gamma$ is a factor to control function shape for weight computation.

Likewise, the item graph is constructed similarly using column ratings in observed matrix ${\mathbf{Z}}$ . In our experiments, for real-world datasets, we first assume partial ground-truth data is known, and then use them to construct the factor graphs based on the above method. With the constructed graphs, we proceed the following sampling based on different schemes and then compute the completion error on the unobserved entries.

VII Experimentation

In this section, we present experimental results of our proposed sampling methods and other competing schemes, combined with several state-of-the-art MC strategies. We list profiles of the simulated datasets in Table. I.

VII-A Experimental Setup

In all experiments, we set $\alpha=\beta=0.1$ for GCS and IGCS, and set $q=0.5$ for IGCS. We implement five sampling methods for comparison, whose specific settings are as follows:

•

Graph weight coherence (GWC-random) [24]: the ‘estimated’ setting was used for computing the probability for random sampling without replacement. The bandwidth information was set to be 1000.

•

Localized operator coverage (LOC) [23]: the bandwidth prior was set to be 1000.

•

Product Graph-based sampling (PG) [12]: the dual graph bandwidth $(\eta_{1},\eta_{2})$ was set to be $(11,9)$ or $(9,11)$ for conducting structured sampling.

•

LSS [9]: we set rank to be 5 for implementing this method.

The last competing method is uniform random sampling. In the following, we list the details for all simulated MC methods:

•

IMC [56] and SVT [57] were simulated with the same settings as reproducible codes.

•

GMC [3]: we set $\gamma_{n}=3$ , $\gamma_{r}=\alpha=0.1$ and $\gamma_{c}=\beta=0.1$ . For completion method in equation (4), we set $\gamma_{n}=0$ and keep other parameters the same.

•

GRALS [25]: we set the rank to be 5 for GRALS, and its Laplacian matrix input were computed from ${\mathbf{L}}_{h}={\mathbf{L}}_{c}+0.1{\mathbf{I}}_{n}$ and ${\mathbf{L}}_{w}={\mathbf{L}}_{r}+0.1{\mathbf{I}}_{m}$ , as used in [12].

•

GCMC [26]: the training epochs were set to be 1000.

•

NMC [27]: the number of training epochs was 10000.

VII-B Performance on Small Datasets

We first conduct experiments on small-size Synthetic Netflix datasets for performance comparison, where the matrix is completed by dual smoothness based method (4). For our proposed GCS, LOBPCG is employed with warm start to compute the first eigenvector of matrix ${\mathbf{L}}\in\mathbb{R}^{20000}$ , while the IGCS uses the MATLAB’s inbuilt function (Krylov-Schur method) for eigen-decomposition since the factor graphs are small. Two classical graph sampling methods GWC-random [24] and LOC [23] are implemented on the product graph ${\mathbf{L}}_{p}=\mathbf{I}_{n}\otimes\mathbf{L}_{r}+\mathbf{L}_{c}\otimes\mathbf{I}_{m}$ directly. For structured method PG, with bandwidth input $(11,9)$ or $(9,11)$ , we artificially increase the parameter $L=|{\mathcal{L}}_{1}|+|{\mathcal{L}}_{2}|$ to get rectangular output, where ${\mathcal{L}}_{1}$ and ${\mathcal{L}}_{2}$ are its selected row and column indices, respectively. After sampling, we record its sample size $|{\mathcal{L}}_{1}|\times|{\mathcal{L}}_{2}|$ and corresponding reconstruction error. The root mean square error (RMSE) is computed on unobserved entries in terms of ground-truth value for evaluation, as done in [12, 29, 13].

Specific experimental results on noiseless and noisy synthetic Netflix dataset (noisy one is depicted in Fig. 3 (a)) in terms of sample size are shown in Fig. 3 (b) and (c). We observe that our proposed GCS outperforms all competitors especially when the sampling budget is small and the matrix signal is noisy. Though with bandwidth $(11,9)$ , PG has comparable RMSE value, its performance deteriorates drastically by just changing the bandwidth to $(9,11)$ . This means that PG is very sensitive to bandwidth settings. Further, PG cannot achieve arbitrary sample size due to its rigid sampling structure.

Our proposed iterative sampling method IGCS also achieves good performance when sample size is relatively large with complexity ${\mathcal{O}}(K\hat{F}\max\{m,n\})$ . Recall that the complexity of GCS is ${\mathcal{O}}(KFmn)$ and LOC is ${\mathcal{O}}(KJ)$ , where $J$ is the number of non-zero entries in matrix ${\mathbf{L}}^{d}$ . We know that $J={\mathcal{O}}(d_{\max}mn)$ , where $d_{\max}$ is the largest degree in product graph ${\mathbf{L}}$ . Hence IGCS has much lower complexity. Fig. 3 (d) further illustrates GCS’s superiority for different noise levels, where we remove some inferior competitors for better visualization.

We also test sampling methods on the dense submatrix ( $100\times 200$ ) from two real-world datasets ML100K and ML10M. As done in [3], by assuming that the information outside this submatrix is given as prior, we construct content-based graph G2 via strategy described in Section VI. Since the ground truth submatrix is sparse, the sampling method must be constrained to sample on the sparse entries. PG, being a structured sampling strategy, cannot satisfy this requirement. Further, LOC requires computation of a Chebyshev polynomial graph filter before sampling, which always results in out-of-memory error using our constructed graph. Thus, we show only executable sampling schemes for comparison. The resulting RMSE is shown in Fig. 3 (e) and (f), which illustrate IGCS’s superiority over GWC, GCS and random sampling for those constructed small real-world datasets.

VII-C Performance on Real-world Large Datasets

To actively sample entries on large real-world datasets, both GCS and GWC are not applicable, since the size of the product graph ${\mathbf{L}}\in\mathbb{R}^{mn\times mn}$ can contain millions of nodes. For the following real-world large datasets, we only test uniform random sampling and LSS [9] for performance comparison to our proposed IGCS.

$\bullet$ Simulations on Movielens 100K

We first deploy our proposed IGCS on a large real-world dataset ML100K [52] to collect samples. Since the typically used ratio between training and testing in ML100K is 80% to 20%, in our experiments, we first randomly select 60K samples from 100K datasets as the initial available data, and then proceed to sample 20K from the rest 40K data pool based on our proposed IGCS or random sampling. The final un-selected 20K samples are used for computing RMSE. For this experiment, we use MATLAB’s inbuilt function for eigen-decomposition in IGCS.

We create the feature-based graph G1 from ML100k’s features and content-based graph G2 from 60K initial samples using the second method in Section VI. Average RMSE on different MC methods and graphs are listed in Table. II, where the best performance number for each MC method is marked in boldface. In the widely used feature-based graph (G1), our proposed IGCS (right side) achieves better performance than random sampling (left side) for almost all popular MC methods. Further, when we select entries on the content-based graph (G2), IGCS substantially outperforms random sampling. Note that when G2 is used instead of G1, RMSE for random sampling is almost the same for every graph-based MC method. Hence, we can conclude that the performance improvement using G2 is due to the more informative samples chosen using our proposed IGCS.

$\bullet$ Warm start parameter’s effect on sampling time

In this experiment, we deploy IGCS on ML100K using different graphs and in combination with different MC methods. The resulting RMSE values and sampling times are shown in Table. III, along with LSS for comparison. “eigs” means the eigen-decompostion in IGCS is computed using the Krylov-Schur method, while LOBPCG is used for different $\zeta$ . Note that LSS is essentially a random sampling with specified selection probability for each entry. Thus we didn’t record its running time. All experiments are performed on a laptop with Intel Core i7-8750H and 16GB of RAM on Windows 10 for counting time. In Table. III, the best performance numbers for each method are marked in boldface. Table. III shows that IGCS is always superior to LSS using different state-of-the-art MC methods under different graphs. It also shows that with increasing $\zeta$ , execution time of IGCS with LOBPCG decreases substantially, while the performance become slightly worse. Note that when $\zeta=1$ , there is no warm start in LOBPCG. Simulation results show that LOBPCG is more efficient than Krylov-Schur method for computing the first eigenvector and achieves better performance for MC.

$\bullet$ Simulations on other popular real-world datasets

We next evaluate IGCS on various well-known real-world datasets, combined with GRALS MC method. Random sampling and LSS are simulated for comparison. Since features for constructing G1 are not available for most datasets, we use G2 as the underlying graph for sampling. For datasets Flixter, YahooMusic, Douban, random 90/10 training/test splits are used for simulations. Specifically, we first choose 80% entries in the given 90% training set as the initial samples to construct G2 and then use our IGCS method (or competing schemes) to sample entries to form a new 90% training set. RMSE is computed on final un-selected 10% entries. For other datasets, we first randomly generate 90/10 training/test split and then use the above-mentioned procedure to collect samples and compute RMSE 666 For datasets ML1M and ML10M, the percentage of initial samples for constructing G2 is changed from 80% into 90%. .

Experimental RMSEs of different sampling methods are shown in Table. IV. Table. IV shows that IGCS outperforms random sampling and LSS in all datasets, which have various data size, density and rating level. Moreover, when the warm start parameter $\zeta$ in IGCS becomes larger, RMSE of IGCS deteriorates only slightly, but still significantly outperforms the competitors for almost all datasets.

VIII Conclusion

Pre-selection of entries for matrix completion is an important but under-addressed problem. In this paper, we propose a graph sampling strategy for matrix completion based on recurrent Gershgorin disc shift. Specifically, assuming that the target matrix signal is smooth with respect to dual graphs, we can complete the matrix via partial observations by solving a system of linear equations. To maximize the stability of the linear system, we select samples to maximize the smallest eigenvalue $\lambda_{\min}$ of the coefficient matrix, which is equivalent to minimize the upper-bound of the reconstructed error. We tackle the formulated sampling objective with a greedy scheme to select one sample at a time (equivalent to shifting one Gershorin disc). To achieve fast sampling, inspired by one corollary of the Gershgorin circle theorem, we select the node corresponding to the largest energy in the first eigenvector of the incremented Laplacian matrix. We employ LOBPCG to compute the first eigenvector of an incremented Laplacian matrix, which benefits from warm start as the first eigenvectors are computed repeatedly. To efficently sample large real-world datasets, we further devise a block-wise graph sampling scheme, where the samples are collected alternately between blocks in two separate block-diagonal matrices. Extensive experiments have validated the superiority of our proposed graph sampling method for matrix completion, compared with other graph sampling and active matrix completion methods, in different datasets, under different graphs, and in combination with different popular matrix completion methods.

Acknowledgement

The authors would like to thank the great help from Wes Eardley for investigating available datasets for simulations.

Appendix A Derivations of linear equation

We first compute the derivative of $f({\mathbf{X}})$ with respect to the optimization variable $\mathbf{X}$ :

[TABLE]

whose vector form by using the property $\text{vec}({\mathbf{A}}{\mathbf{B}})=({\mathbf{I}}_{n}\otimes{\mathbf{A}})\text{vec}({\mathbf{B}})=({\mathbf{B}}^{\top}\otimes{\mathbf{I}}_{m})\text{vec}({\mathbf{A}})$ is:

[TABLE]

Note that $\mathbf{A}_{\Omega}\circ\mathbf{A}_{\Omega}=\mathbf{A}_{\Omega}$ and ${\mathbf{Y}}={\mathbf{A}}_{\Omega}\circ({\mathbf{X}}+{\mathbf{N}})$ , so $\text{vec}(\mathbf{A}_{\Omega}\circ\mathbf{Y})=\text{vec}({\mathbf{Y}})$ . To obtain an optimal solution, we set ${\partial f(\mathbf{X})}/{\partial\mathbf{X}}={\mathbf{0}}$ , which in vector form leads to

[TABLE]

Appendix B The property of matrix ${\mathbf{Q}}$

Note that $\lambda_{\min}(\mathbf{I}_{n}\otimes\mathbf{L}_{r})=0$ since $\lambda_{\min}({\mathbf{L}}_{r})=0$ and $\lambda_{\min}(\mathbf{L}_{c}\otimes\mathbf{I}_{m})=\lambda_{\min}(\mathbf{I}_{m}\otimes\mathbf{L}_{c})=0$ , so $\lambda_{\min}$ of $\alpha\mathbf{I}_{n}\otimes\mathbf{L}_{r}+\beta\mathbf{L}_{c}\otimes\mathbf{I}_{m}$ is at least 0 based on Weyl’s inequality on eigenvalues that $\lambda_{\min}({\mathbf{A}}+{\mathbf{B}})\geq\lambda_{\min}({\mathbf{A}})+\lambda_{\min}({\mathbf{B}})$ [49]. Moreover, the vectorized sampling operator $\tilde{{\mathbf{A}}}_{\Omega}$ is positive semi-definite (PSD). If matrix ${\mathbf{Q}}$ is invertible with enough samples in matrix $\tilde{{\mathbf{A}}}_{\Omega}$ , we know that ${\mathbf{Q}}=\tilde{{\mathbf{A}}}_{\Omega}+\alpha\mathbf{I}_{n}\otimes\mathbf{L}_{r}+\beta\mathbf{L}_{c}\otimes\mathbf{I}_{m}$ is sparese, symmetric and positive definite (PD), and the optimal solution to problem (73) in closed form is :

[TABLE]

Appendix C The upper-bound of $\lambda_{\max}({\mathbf{Q}})$

Reusing the notations in Section IV, let ${\mathbf{L}}=\alpha{\mathbf{I}}_{n}\otimes{\mathbf{L}}_{r}+\beta{\mathbf{L}}_{c}\otimes{\mathbf{I}}_{m}$ , and ${\mathbf{Q}}=\tilde{{\mathbf{A}}}_{\Omega}+{\mathbf{L}}$ . Specifically,

[TABLE]

For $k=i+m\times(j-1)$ with $\forall i\in{\mathcal{V}}_{r}$ and $j\in{\mathcal{V}}_{c}$ , the $k$ -th row of matrix ${\mathbf{L}}$ (denoted by ${\mathbf{s}}_{k}$ ) is a combination of the $i$ -th row of ${\mathbf{L}}_{r}$ (denoted by ${\mathbf{v}}_{i}$ ) and the $j$ -th row of ${\mathbf{L}}_{c}$ (denoted by ${\mathbf{t}}_{j}$ ). It is easy to see that ${\mathbf{s}}_{k}(k)=\alpha{\mathbf{v}}_{i}(i)+\beta{\mathbf{t}}_{j}(j)$ and $\sum^{mn}_{l=1;l\neq k}|{\mathbf{s}}_{k}(l)|=\alpha\sum^{m}_{l=1;l\neq i}|{\mathbf{v}}_{i}(l)|+\beta\sum^{n}_{l=1;l\neq j}|{\mathbf{t}}_{j}(l)|$ . Since ${\mathbf{v}}_{i}(i)=\sum^{m}_{l=1;l\neq i}|{\mathbf{v}}_{i}(l)|$ and ${\mathbf{t}}_{j}(j)=\sum^{n}_{l=1;l\neq j}|{\mathbf{t}}_{j}(l)|$ by the definitions of combinatorial Laplacian matrix ${\mathbf{L}}_{r}$ and ${\mathbf{L}}_{c}$ , we know that ${\mathbf{s}}_{k}(k)=\sum^{mn}_{l=1;l\neq k}|{\mathbf{s}}_{k}(l)|,\forall k\in\{1,2,\dots,mn\}$ . Based on the Gershgorin circle theorem presented in Section IV-A, we know that the eigenvalues of matrix ${\mathbf{L}}$ are all bounded in [0, $2\max_{k}\{{\mathbf{s}}_{k}(k)\}$ ], where 0 is the lower bound of all left ends of Gershgorin discs, and $2\max_{k}\{{\mathbf{s}}_{k}(k)\}$ is the upper bound of all right ends of those discs. Note that ${\mathbf{v}}_{i}(i)={\mathbf{D}}_{r}(i,i)$ and ${\mathbf{t}}_{j}(j)={\mathbf{D}}_{c}(j,j)$ for connected graph without self-loop. Assuming $\max_{i}\{{\mathbf{D}}_{r}(i,i)\}\leq d_{r}$ and $\max_{j}\{{\mathbf{D}}_{c}(j,j)\}\leq d_{c}$ (degree constrained graphs), we will have

[TABLE]

Therefore, $\lambda_{\max}({\mathbf{L}})$ will be upper-bounded by $2\alpha d_{r}+2\beta d_{c}$ . Because $\tilde{{\mathbf{A}}}_{\Omega}$ is a diagonal matrix with diagonal entries 0 or 1, $\lambda_{\max}({\mathbf{Q}})$ will be upper-bounded by $2\alpha d_{r}+2\beta d_{c}+1$ if the two factor graphs are degree-bounded by $d_{r}$ and $d_{c}$ respectively.

Bibliography58

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] G. Liu, Q. Liu, and X. Yuan, “A new theory for matrix completion,” in Advances in Neural Information Processing Systems , 2017, pp. 785–794.
2[2] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics , vol. 9, no. 6, p. 717, 2009.
3[3] V. Kalofolias, X. Bresson, M. Bronstein, and P. Vandergheynst, “Matrix completion on graphs,” ar Xiv preprint ar Xiv:1408.1717 , 2014.
4[4] E. Candes and Y. Plan, “Matrix completion with noise,” in Proceedings of the IEEE , vol. 98, no.6, April 2010, pp. 925–936.
5[5] N. Rubens, M. Elahi, M. Sugiyama, and D. Kaplan, “Active learning in recommender systems,” in Recommender systems handbook . Springer, 2015, pp. 809–846.
6[6] S. Chakraborty, J. Zhou, V. Balasubramanian, S. Panchanathan, I. Davidson, and J. Ye, “Active matrix completion,” in 2013 IEEE 13th International Conference on Data Mining . IEEE, 2013, pp. 81–90.
7[7] A. Krishnamurthy and A. Singh, “Low-rank matrix and tensor completion via adaptive sampling,” in Advances in Neural Information Processing Systems , 2013, pp. 836–844.
8[8] S.-J. Huang, M. Xu, M.-K. Xie, M. Sugiyama, G. Niu, and S. Chen, “Active feature acquisition with supervised matrix completion,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . ACM, 2018, pp. 1571–1579.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Graph Sampling for Matrix Completion Using Recurrent Gershgorin Disc Shift

Abstract

Index Terms:

I Introduction

II Related works

II-A *Subset Sampling of Graph Signals *

II-B Active Matrix Completion

III Problem Formulation

III-A Dual Graph Smoothness based Matrix Completion

III-B Graph Sampling for Matrix Completion based on DGLR Formulation

Lemma 1**.**

Proof.

IV Fast Sampling on Product Graph via Gershgorin Circle Theorem

IV-A Gershgorin Disc Shift based Graph Sampling

Theorem 1**.**

Corollary 1**.**

Proposition 1**.**

Proof.

Lemma 2**.**

Proof.

IV-B Fast Repeated Eigenvector Computation with Warm Start

IV-C Complexity Analysis

IV-D Explanation from Graph Spectral Energy Perspective

Lemma 3**.**

Proof.

V Iterative Graph Spectral Sampling for Matrix Completion

V-A Block Diagonal Structure and Inner Connections

V-B Iterative Sampling between Clusters and Groups

VI Graph Construction

VII Experimentation

VII-A Experimental Setup

VII-B Performance on Small Datasets

VII-C Performance on Real-world Large Datasets

VIII Conclusion

Acknowledgement

Appendix A Derivations of linear equation

Appendix B The property of matrix Q{\mathbf{Q}}Q

Appendix C The upper-bound of λmax⁡(Q)\lambda_{\max}({\mathbf{Q}})λmax​(Q)

II-A Subset Sampling of Graph Signals

Lemma 1.

Theorem 1.

Corollary 1.

Proposition 1.

Lemma 2.

Lemma 3.

Appendix B The property of matrix ${\mathbf{Q}}$

Appendix C The upper-bound of $\lambda_{\max}({\mathbf{Q}})$