Generalization error bounds for kernel matrix completion and   extrapolation

Pere Gim\'enez-Febrer; Alba Pag\`es-Zamora; and Georgios B. Giannakis

arXiv:1906.08770·stat.ML·April 22, 2020

Generalization error bounds for kernel matrix completion and extrapolation

Pere Gim\'enez-Febrer, Alba Pag\`es-Zamora, and Georgios B. Giannakis

PDF

TL;DR

This paper analyzes the generalization error bounds for kernel matrix completion methods that incorporate prior information via reproducing kernel Hilbert spaces, supported by numerical experiments.

Contribution

It provides theoretical error bounds for kernel-based matrix completion and extrapolation methods, enhancing understanding of their reliability.

Findings

01

Theoretical error bounds are derived for kernel matrix completion.

02

Numerical tests confirm the accuracy of the theoretical bounds.

03

Incorporating prior information improves matrix completion performance.

Abstract

Prior information can be incorporated in matrix completion to improve estimation accuracy and extrapolate the missing entries. Reproducing kernel Hilbert spaces provide tools to leverage the said prior information, and derive more reliable algorithms. This paper analyzes the generalization error of such approaches, and presents numerical tests confirming the theoretical results.

Equations52

{\hat{W}, \hat{H}} = ar g W \in R^{N \times p} H \in R^{L \times p} min P_{S_{m}} (M - W H^{T})_{F}^{2} + μ (∣ ∣ W ∣ ∣_{F}^{2} + ∣ ∣ H ∣ ∣_{F}^{2}) \vspace - 0.30 c m

{\hat{W}, \hat{H}} = ar g W \in R^{N \times p} H \in R^{L \times p} min P_{S_{m}} (M - W H^{T})_{F}^{2} + μ (∣ ∣ W ∣ ∣_{F}^{2} + ∣ ∣ H ∣ ∣_{F}^{2}) \vspace - 0.30 c m

{\hat{W}, \hat{H}} = ar g W \in R^{N \times p} H \in R^{L \times p} min

{\hat{W}, \hat{H}} = ar g W \in R^{N \times p} H \in R^{L \times p} min

+ μ (Tr (W^{T} K_{w}^{- 1} W) + Tr (H^{T} K_{h}^{- 1} H)) .

H_{f} := {f : f (x, y) = n = 1 \sum N l = 1 \sum L d_{n, l} κ_{f} ((x, x_{n}), (y, y_{l})), d_{n, l} \in R} .

H_{f} := {f : f (x, y) = n = 1 \sum N l = 1 \sum L d_{n, l} κ_{f} ((x, x_{n}), (y, y_{l})), d_{n, l} \in R} .

\overline{m} = S f + \overset{ˉ}{e} = S K_{f} d + \overset{ˉ}{e} .

\overline{m} = S f + \overset{ˉ}{e} = S K_{f} d + \overset{ˉ}{e} .

\hat{d} =

\hat{d} =

\hat{\overset{ˉ}{d}} = (S K_{f} S^{T} + μ I)^{- 1} \overline{m}

\hat{\overset{ˉ}{d}} = (S K_{f} S^{T} + μ I)^{- 1} \overline{m}

\hat{\overset{ˉ}{d}} = ar g \overset{ˉ}{d} \in R^{n} min \overline{m} - \overset{ˉ}{K}_{f} \overset{ˉ}{d}_{2}^{2} + μ \overset{ˉ}{d}^{T} \overset{ˉ}{K}_{f} \overset{ˉ}{d}

\hat{\overset{ˉ}{d}} = ar g \overset{ˉ}{d} \in R^{n} min \overline{m} - \overset{ˉ}{K}_{f} \overset{ˉ}{d}_{2}^{2} + μ \overset{ˉ}{d}^{T} \overset{ˉ}{K}_{f} \overset{ˉ}{d}

\hat{F} = ar g F \in F min \frac{1}{m} \sum_{(i, j) \in S_{m}} l (M_{i, j}, F_{i, j})

\hat{F} = ar g F \in F min \frac{1}{m} \sum_{(i, j) \in S_{m}} l (M_{i, j}, F_{i, j})

\frac{1}{u} \sum_{(i, j) \in S_{u}} l (M_{i, j}, \hat{F}_{i, j}) - \frac{1}{m} \sum_{(i, j) \in S_{m}} l (M_{i, j}, \hat{F}_{i, j}) .

\frac{1}{u} \sum_{(i, j) \in S_{u}} l (M_{i, j}, \hat{F}_{i, j}) - \frac{1}{m} \sum_{(i, j) \in S_{m}} l (M_{i, j}, \hat{F}_{i, j}) .

R_{n} (F) = q E_{σ} {F \in F sup \sum_{(i, j) \in S_{n}} σ_{i, j} F_{i, j}}

R_{n} (F) = q E_{σ} {F \in F sup \sum_{(i, j) \in S_{n}} σ_{i, j} F_{i, j}}

\frac{1}{u} \sum_{(i, j) \in S_{u}} l (M_{i, j}, F_{i, j}) - \frac{1}{m} \sum_{(i, j) \in S_{m}} l (M_{i, j}, F_{i, j})

\frac{1}{u} \sum_{(i, j) \in S_{u}} l (M_{i, j}, F_{i, j}) - \frac{1}{m} \sum_{(i, j) \in S_{m}} l (M_{i, j}, F_{i, j})

\leq R_{n} (l \circ F) + 5.05 q min (m, u) + 2 q ln (1/ δ) .

R_{n} (F_{M C}) \leq q E_{σ} {F \in F_{M C} sup ∣∣ Σ ∣ ∣_{2} ∣ ∣ F ∣ ∣_{*}} \leq Gq t (N + L)

R_{n} (F_{M C}) \leq q E_{σ} {F \in F_{M C} sup ∣∣ Σ ∣ ∣_{2} ∣ ∣ F ∣ ∣_{*}} \leq Gq t (N + L)

R_{n} (F_{K}) \leq λ_{m a x} Gq t_{B} (N + L)

R_{n} (F_{K}) \leq λ_{m a x} Gq t_{B} (N + L)

∣ ∣ F ∣ ∣_{*} = \frac{1}{2} (∣ ∣ W ∣ ∣_{F}^{2} + ∣ ∣ H ∣ ∣_{F}^{2}) = \frac{1}{2} (Tr (B^{T} K_{w}^{2} B) + Tr (C^{T} K_{h}^{2} C))

∣ ∣ F ∣ ∣_{*} = \frac{1}{2} (∣ ∣ W ∣ ∣_{F}^{2} + ∣ ∣ H ∣ ∣_{F}^{2}) = \frac{1}{2} (Tr (B^{T} K_{w}^{2} B) + Tr (C^{T} K_{h}^{2} C))

\leq \frac{λ _{m a x}}{2} [Tr (B^{T} K_{w} B) + Tr (C^{T} K_{h} C)] \leq \frac{λ _{m a x} t _{B}}{2}

P_{S_{m}} (M - Φ_{w} Φ_{w}^{T} B C^{T} Φ_{h} Φ_{h}^{T})_{F}^{2} + μ (Tr (B^{T} Φ_{w} Φ_{w}^{T} B)

P_{S_{m}} (M - Φ_{w} Φ_{w}^{T} B C^{T} Φ_{h} Φ_{h}^{T})_{F}^{2} + μ (Tr (B^{T} Φ_{w} Φ_{w}^{T} B)

+ Tr (C^{T} Φ_{h} Φ_{h}^{T} C))

= P_{S_{m}} (M - Φ_{w} A_{w} A_{h}^{T} Φ_{h}^{T})_{F}^{2} + μ (∣ ∣ A_{w} ∣ ∣_{F}^{2} + ∣ ∣ A_{h} ∣ ∣_{F}^{2})

R_{n} (F_{I}) \leq q t_{w} t_{h} Tr (S_{n} K S_{n}^{T}) .

R_{n} (F_{I}) \leq q t_{w} t_{h} Tr (S_{n} K S_{n}^{T}) .

R_{n} (F_{I}) = q E_{σ} {b_{w} \leq t_{w}, b_{h} \leq t_{h} sup σ^{T} vec (Φ_{w} A_{w} A_{h}^{T} Φ_{h}^{T})}

R_{n} (F_{I}) = q E_{σ} {b_{w} \leq t_{w}, b_{h} \leq t_{h} sup σ^{T} vec (Φ_{w} A_{w} A_{h}^{T} Φ_{h}^{T})}

= q E_{σ} {b_{w} \leq t_{w}, b_{h} \leq t_{h} sup σ^{T} (Φ_{h} \otimes Φ_{w}) vec (A_{w} A_{h}^{T})}

\leq q E_{σ} {b_{w} \leq t_{w}, b_{h} \leq t_{h} sup σ^{T} (Φ_{h} \otimes Φ_{w})_{2} vec (A_{w} A_{h}^{T})_{2}}

= q E_{σ} {b_{w} \leq t_{w}, b_{h} \leq t_{h} sup σ^{T} K σ A_{w} A_{h}^{T}_{F}}

\leq q E_{σ} {b_{w} \leq t_{w}, b_{h} \leq t_{h} sup σ^{T} K σ ∣ ∣ A_{w} ∣ ∣_{F} A_{h}^{T}_{F}}

\leq q t_{w} t_{h} E_{σ} {σ^{T} K σ} = q t_{w} t_{h} Tr (S_{n} K S_{n}^{T})

R_{n} (F_{R}) \leq q b Tr (S_{n} K_{f} S^{T} \overset{ˉ}{K}_{f}^{- 1} S K_{f} S_{n}^{T}) .

R_{n} (F_{R}) \leq q b Tr (S_{n} K_{f} S^{T} \overset{ˉ}{K}_{f}^{- 1} S K_{f} S_{n}^{T}) .

R_{n} (F_{R}) = q E_{σ} {\overset{ˉ}{d}^{T} K_{f} \overset{ˉ}{d} \leq b sup σ^{T} K_{f} S^{T} \overset{ˉ}{d}}

R_{n} (F_{R}) = q E_{σ} {\overset{ˉ}{d}^{T} K_{f} \overset{ˉ}{d} \leq b sup σ^{T} K_{f} S^{T} \overset{ˉ}{d}}

= q E_{σ} {\overset{ˉ}{d}^{T} \overset{ˉ}{K}_{f} \overset{ˉ}{d} \leq b sup σ^{T} K_{f} S^{T} \overset{ˉ}{K}_{f}^{- \frac{1}{2}} \overset{ˉ}{K}_{f}^{\frac{1}{2}} \overset{ˉ}{d}}

\leq q E_{σ} {\overset{ˉ}{d}^{T} \overset{ˉ}{K}_{f} \overset{ˉ}{d} \leq b sup σ^{T} K_{f} S^{T} \overset{ˉ}{K}_{f}^{\frac{- 1}{2}}_{2} \overset{ˉ}{K}_{f}^{\frac{1}{2}} \overset{ˉ}{d}_{2}}

\leq q b E_{σ} {σ^{T} K_{f} S^{T} \overset{ˉ}{K}_{f}^{- \frac{1}{2}}_{2}}

= q b Tr (S_{n} K_{f} S \overset{ˉ}{K}_{f}^{- 1} S^{T} K_{f} S_{n}^{T}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Generalization error bounds for kernel matrix completion and extrapolation

Pere Giménez-Febrer, Alba Pagès-Zamora, and Georgios B. Giannakis P. Giménez-Febrer and A. Pagès-Zamora are with the SPCOM Group, Universitat Politècnica de Catalunya-Barcelona Tech, Spain.

G. B. Giannakis is with the Dept. of ECE and Digital Technology Center, University of Minnesota, USA.

This work is supported by ERDF funds (TEC2013-41315-R and TEC2016-75067-C4-2), the Catalan Government (2017 SGR 578), and NSF grants (1500713, 1514056, 1711471 and 1509040).

Abstract

Prior information can be incorporated in matrix completion to improve estimation accuracy and extrapolate the missing entries. Reproducing kernel Hilbert spaces provide tools to leverage the said prior information, and derive more reliable algorithms. This paper analyzes the generalization error of such approaches, and presents numerical tests confirming the theoretical results.

I Introduction

Matrix completion (MC) deals with the recovery of missing entries in a matrix – a task emerging in several applications such as image restoration [1], collaborative filtering [2] or positioning [3]. MC relies on the low rank of data matrices to enable reliable, even exact [4], recovery of the full unknown matrix. Exploiting this property, mainstream approaches to MC involve the minimization of the nuclear norm [5, 6] or a surrogate involving the data matrix factorization into a product of two low-rank matrices [7, 8].

One main assumption in the aforementioned approaches to MC is that the unknown matrix is incoherent, meaning the entries of its singular vectors are uniformly distributed, which implies that matrices with structured form are not allowed. For instance, data matrices with clustered form lead to segmented singular vectors that violate the incoherence assumption. Such structures may be induced by prior information embedded in, e.g., graphs [9], dictionaries [10], or heuristic assumptions [11]. Main approaches to MC leverage prior information with proper regularization [12, 13, 14, 15], or, by restricting the solution space [16, 17, 18, 19]. Most of these approaches can be unified using a reproducing kernel Hilbert space (RKHS) framework [17, 18], which presents theoretical tools to exploit prior information.

When analyzing the performance of MC algorithms, several works, e.g. [20, 5, 2, 16], focus on the derivation of sample complexity bounds; that is, the evolution of the distance to the optimum across the number of samples and iterations. Other analyses are based on the generalization error (GE) [21, 22, 23], a metric that measures the difference between the value of the loss function applied to a training dataset, and its expected value [24]. When the probability distribution of the data is unknown, the expected value is replaced by the average loss on a testing dataset [25]. Due to the potentially large matrix sizes and the small size of the training dataset, it is important that the estimated matrix exhibits low GE in order to prevent overfitting.

In [18], we introduced a novel Kronecker kernel matrix completion and extrapolation (KKMCEX) algorithm for MC. This algorithm relies on kernel ridge regression with the number of coefficients equal to the number of observations, thus being attractive for imputing matrices with a minimal number of observations. The present paper presents GE analysis for MC with prior information, and establishes that different from other MC approaches, the GE of KKMCEX does not depend on the matrix size, thus making it more reliable when dealing with few observations.

II MC with prior information

Consider a matrix $\bm{M}=\bm{F}+\bm{E}$ , where $\bm{F}\in\mathbb{R}^{N\times L}$ denotes an unknown rank $r$ matrix, and $\bm{E}$ is a noise matrix. We can only observe a subset of the entries in $\bm{M}$ whose indices are given by the sampling set $\mathcal{S}_{m}\subseteq\{1,\ldots,N\}\times\{1,\ldots,L\}$ of cardinality $m=|\mathcal{S}_{m}|$ . Factorizing the unknown matrix as $\bm{F}=\bm{W}\bm{H}$ , where $\bm{W}\in\mathbb{R}^{N\times p}$ , $\bm{H}\in\mathbb{R}^{L\times p}$ and $p\geq r$ , the unknown entries can be recovered by estimating

[TABLE]

where $P_{\mathcal{S}_{m}}(\cdot)$ denotes an operator that sets to zero the entries with index $(i,j)\notin\mathcal{S}_{m}$ and leaves the rest unchanged, while $\mu$ is a regularization scalar. Hereafter we refer to (1) as the base MC formulation, which can also be written with the nuclear norm as a regularizer through the property $\left|\left|\bm{F}\right|\right|_{*}=\min_{\bm{F}=\bm{W}\bm{H}^{T}}{1\over 2}\left(\left|\left|\bm{W}\right|\right|_{\text{F}}^{2}+\left|\left|\bm{H}\right|\right|_{\text{F}}^{2}\right)$ [22].

While the basic MC formulation makes no use of prior information, kernel (K)MC incorporates such knowledge by means of kernel functions that measure similarities between points in their input spaces. Let $\mathcal{X}:=\{x_{1},\ldots,x_{N}\}$ and $\mathcal{Y}:=\{y_{1},\ldots,y_{L}\}$ be spaces of entities with one-to-one correspondence with the rows and columns of $\bm{F}$ , respectively. Given the input spaces $\mathcal{X}$ and $\mathcal{Y}$ , KMC defines the pair of RKHSs $\mathcal{H}_{w}:=\left\{w:\>w(x)=\sum\nolimits_{n=1}^{N}b_{j}\kappa_{w}(x,x_{j}),\>b_{j}\in\mathbb{R}\right\}$ and $\mathcal{H}_{h}:=\left\{h:\>h(y)=\sum\nolimits_{l=1}^{L}c_{j}\kappa_{h}(y,y_{j}),\>c_{j}\in\mathbb{R}\right\}$ , where $\kappa_{w}:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ and $\kappa_{h}:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}$ are kernel functions. Then, KMC postulates that the columns of the factor matrices in (1) are functions in $\mathcal{H}_{w}$ and $\mathcal{H}_{h}$ . Thus, we write $\bm{W}=\bm{K}_{w}\bm{B}$ and $\bm{H}=\bm{K}_{h}\bm{C}$ , where $\bm{B}$ and $\bm{C}$ are coefficient matrices, while $\bm{K}_{w}\in\mathbb{R}^{N\times N}$ and $\bm{K}_{h}\in\mathbb{R}^{L\times L}$ are the kernel matrices with entries $(\bm{K}_{w})_{i,j}=\kappa_{w}(x_{i},x_{j})$ and $(\bm{K}_{h})_{i,j}=\kappa_{h}(y_{i},y_{j})$ . The KMC formulations proposed in [17, 14], recover the factor matrices as

[TABLE]

The coefficient matrices are obtained as $\hat{\bm{B}}\!=\!\bm{K}_{w}^{-1}\hat{\bm{W}}$ and $\hat{\bm{C}}\!=\!\bm{K}_{h}^{-1}\hat{\bm{H}}$ , although this step is usually omitted [17, 14].

Algorithms solving (1) and (2) rely on alternating minimization schemes that do not converge to the optimum in a finite number of iterations [26]. To overcome this limitation and obtain a closed-form solution, we introduced the Kronecker kernel MC and extrapolation (KKMCEX) method [18]. Associated with entries of $\bm{F}$ , consider the two-dimensional function $f:\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R}$ with $f(x_{i},y_{j})=\bm{F}_{i,j}$ , and the RKHS it belongs to

[TABLE]

Upon vectorizing $\bm{F}$ , we obtain $\bm{f}=\text{vec}(\bm{F})=$ $\bm{K}_{f}\bm{d}$ , where $\bm{K}_{f}$ has entries $\kappa_{f}$ and $\bm{d}:=[d_{1,1},\ldots,d_{N,1},\ldots,d_{N,L}]^{T}$ . Accordingly, the data matrix is vectorized as $\overline{\bm{m}}=\bm{S}\text{vec}(\bm{M})$ , where $\bm{S}$ is an $m\times NL$ binary sampling matrix with a single nonzero entry per row, and $\bar{\bm{e}}=\bm{S}\text{vec}(\bm{E})$ denotes the noise vector. With these definitions, the signal model for the observed entries becomes

[TABLE]

Recovery of the vectorized matrix is then performed using the kernel ridge regression estimate of $\bm{d}$ given by

[TABLE]

The closed-form solution to (4) satisfies $\hat{\bm{d}}=\bm{S}^{T}\hat{\bar{\bm{d}}}$ , where

[TABLE]

is the result of using the matrix inversion lemma on the solution to (4). Since (5) only depends on the observations in $\mathcal{S}_{m}$ , KKMCEX can be equivalently rewritten as

[TABLE]

where $\bar{\bm{K}}_{f}=\bm{S}\bm{K}_{f}\bm{S}^{T}$ . Given $\kappa_{w}$ and $\kappa_{h}$ , it becomes possible to use $\kappa_{f}((x,x_{n}),(y,y_{l}))=\kappa_{w}(x,x_{n})\kappa_{h}(y,y_{l})$ as a kernel, which corresponds to a kernel matrix $\bm{K}_{f}=\bm{K}_{h}\otimes\bm{K}_{w}$ [18].

III Generalization error in MC

In this section, we derive bounds for the GE of the MC in (1), KMC in (2) and KKMCEX in (4) algorithms. There are two approaches to GE analysis, namely the inductive [24] and the transductive one in [25]. In the inductive one GE measures the difference between the expected value of a loss function and the empirical loss over a finite number of samples. Consider rewriting MC in the general form

[TABLE]

where $l:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}$ denotes the loss, and $\mathcal{F}$ is the hypothesis class. For instance, choosing the square loss and setting the class to the set of matrices with a nuclear norm smaller than a constant $t$ results in the base MC formulation (1). Assuming a sampling distribution $\mathcal{D}$ over $\{1,\ldots,N\}\times\{1,\ldots,L\}$ for the observed indices in $\mathcal{S}_{m}$ , the GE for a specific estimate $\hat{\bm{F}}$ is given by the expected difference $\mathbb{E}_{\mathcal{D}}\{l(\bm{M}_{i,j},\hat{\bm{F}}_{i,j})\}-{(1/m)}\sum_{(i,j)\in\mathcal{S}_{m}}l(\bm{M}_{i,j},\hat{\bm{F}}_{i,j})$ . However, this definition of GE does not fit the MC framework because it assumes that: i) the data distribution is known; and, ii) the entries are sampled with repetition. In order to come up with distribution-free claims for MC, one may resort to the transductive GE analysis [25]. In this scenario, we are given $\mathcal{S}_{n}=\mathcal{S}_{m}\cup\mathcal{S}_{u}$ of $n$ data comprising the union of the training set $\mathcal{S}_{m}$ and the testing set $\mathcal{S}_{u}$ , where $|\mathcal{S}_{u}|=u$ . These data are taken without repetition, and the objective is to minimize the loss on the testing set. Thus, the GE is the difference between the testing and training loss functions

[TABLE]

By making this difference as small as possible, we ensure that the chosen $\hat{\bm{F}}$ has good generalization properties, meaning we expect to obtain a similar empirical loss when we choose a different testing set of samples. Since MC algorithms find their solution among a class of matrices under different restrictions or hypotheses, we are interested in bounding (8) for any matrix in the solution space. Before we present such bounds, we need to introduce the notion of transductive Rademacher complexity (TRC) as follows.

Definition 1.

Transductive Rademacher complexity[25] Given a set $\mathcal{S}_{n}=\mathcal{S}_{m}\cup\mathcal{S}_{u}$ with $q:={1\over u}+{1\over m}$ , the transductive Rademacher complexity (TRC) of a matrix class $\mathcal{F}$ is

[TABLE]

where $\sigma_{i,j}$ is a Rademacher random variable that takes values $[-1,1]$ with probability $0.5$ . We may also write (9) in vectorized form as $R_{n}(\mathcal{F})=q\mathbb{E}_{\sigma}\left\{\sup_{\bm{F}\in\mathcal{F}}\bm{\sigma}^{T}\text{vec}(\bm{F})\right\}$ , where $\bm{\sigma}=\text{vec}(\bm{\Sigma})$ , and $\bm{\Sigma}\in\mathbb{R}^{N\times L}$ has entries $\bm{\Sigma}_{i,j}=\sigma_{i,j}$ if $(i,j)\in\mathcal{S}_{n}$ , and $\bm{\Sigma}_{i,j}=0$ otherwise.

TRC measures the expected maximum correlation between any function in the class and the random vector $\bm{\sigma}$ . Intuitively, the greater this correlation is, the larger is the chance of finding a solution in the hypothesis class that will fit any observation draw, that is, $\hat{\bm{F}}_{i,j}\!\simeq\!\bm{M}_{i,j}\forall\>(i,j)\in\mathcal{S}_{n}$ . Although TRC measures the ability to fit both the testing and training data at once, a model for $\bm{F}$ is learnt using only the training data. While having a small loss across all entries in $\mathcal{S}_{n}$ is desirable, making it too small can lead to overfitting, and an increased error when predicting entries outside $\mathcal{S}_{n}$ . Using the TRC, the GE is bounded as follows.

Theorem 1.

[25]** Let $\mathcal{F}$ be a matrix hypothesis class. For a loss function $l$ with Lipschitz constant $\gamma$ , and any $\bm{F}\in\mathcal{F}$ , it holds with probability $1-\delta$ that

[TABLE]

Theorem 1 asserts that in order to bound the GE, it only suffices to bound the TRC. Moreover, using the contraction property, which states that $R_{n}(l\circ\mathcal{F})\leq{1\over\gamma}R_{n}(\mathcal{F})$ [25], we only need to calculate the TRC of $\mathcal{F}$ . Given that the same loss function is used in MC, KMC and KKMCEX, in order to assess the GE upper bound of the three methods we will pursue the TRC for the hypothesis class of each algorithm.

III-A Rademacher complexity for base MC

In the base MC formulation (1), the hypothesis class is $\mathcal{F}_{MC}:=\{\bm{F}:\left|\left|\bm{F}\right|\right|_{*}\leq t,\>t\in\mathbb{R}\}$ , where the value of $t$ is regulated by $\mu$ . As derived in [21], the TRC for this class of matrices is bounded as

[TABLE]

where $G$ is a universal constant. The bound in (11) decays as $\mathcal{O}({1\over m}+{1\over u})\subseteq\mathcal{O}\left(1/\min(m,u)\right)$ for fixed $N$ and $L$ . However, the GE does not since the sum of the second and third terms on the right-hand side of (10) decays as $\mathcal{O}(1/\sqrt{\min{(m,u)}}\,)$ . Ideally, the sizes of the training and testing datasets should be comparable for the TRC to scale well with $n$ . Concerning the matrix size, the bound shows that increasing $N$ or $L$ results in a larger TRC bound regardless of the number of data points $n$ . Moreover, the nuclear norm of a matrix is $\mathcal{O}(\sqrt{NL})$ since $\left|\left|\bm{F}\right|\right|_{\text{F}}\leq\left|\left|\bm{F}\right|\right|_{*}\leq\sqrt{r}\left|\left|\bm{F}\right|\right|_{\text{F}}$ . Therefore, $t$ should also scale with $N$ and $L$ in order to match the hypothesis class, and obtain a good estimate of $\bm{F}$ .

III-B Rademacher complexity for KMC

Unlike base MC that maximizes the nuclear norm of the data matrix, KMC does not directly employ the rank in its objective function. Instead, it imposes constraints on the maximum norm of the factor matrices in their respective RKHSs. Similar to [21], the TRC for KMC is bounded as follows.

Theorem 2.

If the KMC hypothesis class is $\mathcal{F}_{K}:=\left\{\bm{F}\right.:$ $\left.\bm{F}=\bm{K}_{w}\bm{B}\bm{C}^{T}\bm{K}_{h},\text{Tr}(\bm{B}^{T}\bm{K}_{w}\bm{B})\!+\!\text{Tr}(\bm{C}^{T}\bm{K}_{h}\bm{C})\!<\!t_{B}\right\}$ , then

[TABLE]

where $\lambda_{\max}$ is the largest eigenvalue of $\bm{K}_{w}$ and $\bm{K}_{h}$ .

Proof.

Rewrite the nuclear norm in (11) in terms of the KMC constraint as

[TABLE]

where we used that $\text{Tr}(\bm{B}^{T}\bm{K}_{w}^{2}\bm{B})=\sum_{i=1}^{N}\bm{b}_{i}^{T}\bm{K}_{w}^{2}\bm{b}_{i}$ with $\bm{b}_{i}$ denoting the $i^{th}$ column of $\bm{B}$ , and $\bm{b}_{i}^{T}\bm{K}_{w}^{1\over 2}\bm{K}_{w}\bm{K}_{w}^{1\over 2}\bm{b}_{i}\leq\lambda_{\max}\bm{b}_{i}^{T}\bm{K}_{w}\bm{b}_{i}$ . ∎

Theorem 2 establishes that the TRC bound expressions of KMC and MC are identical within a scale. With $t_{B}=t$ , $\lambda_{\max}$ controls whether KMC has a larger or smaller TRC bound than MC. Thus, according to Theorem 2, the GE bound for KMC shrinks with $n$ and grows with $N,\leavevmode\nobreak\ L$ and $\lambda_{\max}$ .

Interestingly, we will show next that it is possible to have a TRC bound that does not depend on the matrix size.

Consider the factorizations $\bm{K}_{w}=\bm{\Phi}_{w}\bm{\Phi}_{w}^{T}$ and $\bm{K}_{h}=\bm{\Phi}_{h}\bm{\Phi}_{h}^{T}$ , where $\bm{\Phi}_{w}\in\mathbb{R}^{N\times d_{w}}$ and $\bm{\Phi}_{h}\in\mathbb{R}^{N\times d_{h}}$ . Plugging the latter into the objective of (2) and substituting $\bm{W}=\bm{K}_{w}\bm{B}$ and $\bm{H}=\bm{K}_{h}\bm{C}$ , yields

[TABLE]

where $\bm{A}_{w}=\bm{\Phi}_{w}^{T}\bm{B}$ and $\bm{A}_{h}=\bm{\Phi}_{h}^{T}\bm{C}$ are coefficient matrices of size $d_{w}\times p$ and $d_{h}\times p$ , respectively. Optimizing for $\{\bm{B},\bm{C}\}$ in (14) or for $\{\bm{A}_{w},\bm{A}_{h}\}$ in (15) yields the same $\hat{\bm{F}}$ provided that $\{\bm{\Phi}_{w}^{T},\bm{\Phi}_{h}^{T}\}$ have full column rank. Under this assumption, we consider the hypothesis class $\mathcal{F}_{I}:=\left\{\bm{F}:\bm{F}=\bm{\Phi}_{w}\bm{A}_{w}\bm{A}_{h}^{T}\bm{\Phi}_{h}^{T},\left|\left|\bm{A}_{w}\right|\right|_{\text{F}}^{2}\leq t_{w},\left|\left|\bm{A}_{h}\right|\right|_{\text{F}}^{2}<t_{h}\right\}$ , which satisfies $\mathcal{F}_{I}=\mathcal{F}_{K}$ . Clearly, (15) is the objective used by the inductive MC [16]; and therefore, we have shown that inductive MC is a special case of KMC. This leads to the following result.

Theorem 3.

If $\bm{K}=(\bm{\Phi}_{h}\otimes\bm{\Phi}_{w})(\bm{\Phi}_{h}\otimes\bm{\Phi}_{w})^{T}$ , and $\bm{S}_{n}$ is a binary sampling matrix that selects the entries in $\mathcal{S}_{n}$ , then

[TABLE]

Proof.

With $\bm{\sigma}:=\text{vec}(\bm{\Sigma})$ , $b_{w}:=\left|\left|\bm{A}_{w}\right|\right|_{\text{F}}^{2}$ , and $b_{h}:=\left|\left|\bm{A}_{h}\right|\right|_{\text{F}}^{2}$ , we have that

[TABLE]

where we have successively used the Cauchy-Schwarz inequality, the sub-multiplicative property of the Frobenius norm, and Jensen’s inequality in the first, second and third inequalities, respectively. ∎

If entries in the diagonal of $\bm{K}$ are bounded by a constant, and $m=u$ , Theorem 3 provides a bound that decays as $\mathcal{O}({\sqrt{t_{w}t_{h}\over m}})$ . Thus, if $t_{w}$ and $t_{h}$ are constant, the bound does not grow with $N$ or $L$ . These values can reasonably be kept constant when the coefficients in $\{\bm{A}_{w},\bm{A}_{h}\}$ are not expected to change much as new rows or columns are added to $\bm{F}$ , e.g., when the existing entries in the kernel matrices are largely unchanged as the matrices grow. For instance, let us rewrite the loss in (15) as $\left|\left|\overline{\bm{m}}-\bm{S}(\bm{\Phi}_{h}\otimes\bm{\Phi}_{w})\text{vec}(\bm{A}_{w}\bm{A}_{h})\right|\right|^{2}_{2}$ . If when increasing $N$ or $L$ we add a few rows to $\bm{\Phi}_{w}$ or $\bm{\Phi}_{h}$ , as it would have happened with a linear kernel, optimizing for $\{\bm{A}_{w},\bm{A}_{h}\}$ in (15) should yield similar results as with smaller $N$ and $L$ , as long as the space spanned by $\bm{S}(\bm{\Phi}_{h}\otimes\bm{\Phi}_{w})$ is not significantly altered.

III-C Rademacher complexity for KKMCEX

In KKMCEX, the restriction is set on the magnitude of $\bar{\bm{d}}^{T}\bar{\bm{K}}_{f}\bar{\bm{d}}$ , which depends on $\bm{S}$ . Therefore, the hypothesis class for (6) is not altered by changes in the matrix size. The TRC bound is then given by the next theorem.

Theorem 4.

If $\mathcal{F}_{R}:=\{\bm{F}\!:\!\bm{F}=\text{unvec}(\bm{K}_{f}\bm{S}^{T}\bar{\bm{d}}),\bar{\bm{d}}^{T}\bar{\bm{K}}_{f}\bar{\bm{d}}\leq b^{2},\>b\in\mathbb{R}\}$ is the hypothesis class for KKMCEX, it holds that

[TABLE]

Proof.

[TABLE]

∎

Supposing that the diagonal entries of $\bm{K}_{f}$ are bounded by a constant, the bound in (17) decays as $\mathcal{O}(\sqrt{n}/\min(m,u))$ . For $m=u$ , this yields a rate $\mathcal{O}({1\over\sqrt{m}})$ . Thus, the GE bound induced by (17) only scales with the number of samples. As a result, we can expect the same performance on the testing dataset regardless of the data matrix size. Moreover, thanks to its simplicity and speed [18], KKMCEX can be used to confidently initialize other algorithms when needed, e.g., when the prior information is not accurate enough to provide a reliable hypothesis space.

IV Numerical tests

This section compares the GE of MC and KMC, solved via alternating least-squares (ALS) [26], with the KKMCEX solved with (5). Besides comparing the GE of these algorithms, we also assess how the matrix size impacts the GE. To this end, we use a fixed-rank synthetic data matrix with $N=L$ generated as $\bm{F}=\bm{K}_{w}\bm{B}\bm{C}^{T}\bm{K}_{h}+\bm{E}$ . The kernel matrices are $\bm{K}_{w}=\bm{K}_{h}=\text{abs}(\bm{R}\bm{D}\bm{R}^{T})$ , where $\bm{R}\in\mathbb{C}^{N\times N}$ is the DFT basis and $\bm{D}\in\mathbb{R}^{N\times N}$ is a diagonal matrix with decreasing weights on its diagonal. The coefficient matrices $\{\bm{B},\bm{C}\}$ have $p=30$ columns, with entries drawn from a zero-mean Gaussian distribution with variance 1. The entries of $\bm{E}\in\mathbb{R}^{N\times N}$ are drawn from a zero-mean Gaussian distribution with variance set according to the signal-to-noise ratio $snr=\left|\left|\bm{F}\right|\right|_{\text{F}}^{2}/\left|\left|\bm{E}\right|\right|_{\text{F}}^{2}$ .

The tests are run over 1,000 realizations. A new matrix $\bm{F}$ is generated per realization with $m\!=\!1,000$ entries drawn uniformly at random, followed by a run of each algorithm. Then, the loss on the testing set, which consists of the remaining $u=N^{2}-m$ entries, is measured. A single value of $\mu$ chosen by cross-validation is used for all realizations. For KMC and KKMCEX, $\mu$ is scaled with the matrix size to compensate for the trace growth of the kernel matrices, and thus keep the loss and regularization terms balanced.

Fig. 1a shows the training, testing, and GEs for square matrices with size ranging from $N=100$ to $N=3,200$ , and $snr=\infty$ . We observe for base MC that the training loss is small, whereas it is much larger on the testing dataset, and also it grows with $N$ . Moreover, since the training loss is minimal, the GE coincides with the testing loss. Clearly, the MC solution (1) is not able to predict the unobserved entries due to the lack of prior information that would allow for extrapolation. In addition, the GE approaches saturation for large matrix sizes since most entries in the estimated matrix are [math], and the testing loss tends to the average ${1\over u}\sum_{(i,j)\in\mathcal{S}_{u}}\bm{M}_{i,j}$ . Regarding the performance of KMC and KKMCEX, we observe that both algorithms achieve a constant training loss. Although not visible on the plot, the training loss of KKMCEX is one order of magnitude smaller than that of KMC. On the other hand, the testing and GE of KKMCEX are constant unlike in KMC for which both are higher and grow with $N$ . These results confirm what was asserted by the TRC bounds in Section III.

Fig. 1b shows the same simulation results as Fig. 1a, but with noisy data at $snr=4$ . We observe that MC overfits the noisy observations since the training loss is, again, very small, while the testing loss is much larger. For KMC and KKMCEX, the presence of noise increases the training and testing losses. Due to the noise, a larger $\mu$ is selected to prevent overfitting at the cost of a higher training loss. Nevertheless, the testing loss of KMC slightly grows with $N$ . In terms of GE, KKMCEX outperforms KMC with a lower value that tends to a constant.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Ji, C. Liu, Z. Shen, and Y. Xu, “Robust video denoising using low rank matrix completion,” in Proc. of Computer Vision and Pattern Recognition Conf. , San Francisco, USA, Jun. 2010, pp. 1791–1798.
2[2] N. Rao, H.-F. Yu, P. K. Ravikumar, and I. S. Dhillon, “Collaborative filtering with graph information: Consistency and scalable methods,” in Advances in Neural Information Processing Systems , Montreal, Canada, Dec. 2015, pp. 2107–2115.
3[3] T. L. Nguyen and Y. Shin, “Matrix completion optimization for localization in wireless sensor networks for intelligent Io T,” Sensors (Switzerland) , vol. 16, no. 5, pp. 1–11, 2016.
4[4] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics , vol. 9, no. 6, pp. 717–772, Dec. 2009.
5[5] J. F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optimization , vol. 20, no. 4, pp. 1956–1982, Jan. 2010.
6[6] S. Ma, D. Goldfarb, and L. Chen, “Fixed point and Bregman iterative methods for matrix rank minimization,” Mathematical Programming , vol. 128, no. 1-2, pp. 321–353, Jun. 2011.
7[7] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer , vol. 42, no. 8, pp. 30–37, Aug. 2009.
8[8] R. Sun, “Matrix Completion via Nonconvex Factorization: Algorithms and Theory,” Ph.D. dissertation, UNIVERSITY OF MINNESOTA, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Generalization error bounds for kernel matrix completion and extrapolation

Abstract

I Introduction

II MC with prior information

III Generalization error in MC

Definition 1**.**

Theorem 1**.**

III-A Rademacher complexity for base MC

III-B Rademacher complexity for KMC

Theorem 2**.**

Proof.

Theorem 3**.**

Proof.

III-C Rademacher complexity for KKMCEX

Theorem 4**.**

Proof.

IV Numerical tests

Definition 1.

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.