Approximate matrix completion based on cavity method

Chihiro Noguchi; Yoshiyuki Kabashima

arXiv:1907.00138·math.NA·January 8, 2020

Approximate matrix completion based on cavity method

Chihiro Noguchi, Yoshiyuki Kabashima

PDF

TL;DR

This paper introduces cavity-based matrix factorization algorithms derived from statistical mechanics, which outperform traditional methods like ALS and SGD in convergence speed and efficiency, especially with fewer observed entries.

Contribution

The paper proposes novel cavity-based algorithms (CBMF and ACBMF) for matrix completion, offering faster convergence and lower computational costs compared to existing methods.

Findings

01

CBMF has lower computational cost than ALS.

02

ACBMF reduces memory usage compared to CBMF.

03

Proposed methods outperform ALS and SGD in convergence speed.

Abstract

In order to solve large matrix completion problems with practical computational cost, an approximate approach based on matrix factorization has been widely used. Alternating least squares (ALS) and stochastic gradient descent (SGD) are two major algorithms to this end. In this study, we propose a new algorithm, namely cavity-based matrix factorization (CBMF) and approximate cavity-based matrix factorization (ACBMF), which are developed based on the cavity method from statistical mechanics. ALS yields solutions with less iterations when compared to those of SGD. This is because its update rules are described in a closed form although it entails higher computational cost. CBMF can also write its update rules in a closed form, and its computational cost is lower than that of ALS. ACBMF is proposed to compensate a disadvantage of CBMF in terms of relatively high memory cost. We…

Tables2

Table 1. Table 1: Comparison of computational costs to update all variables at least once. Specifically, | Ω | Ω |\Omega| denotes the number of observed entries, and this is assumed to exceed or be equal to the number of variables to be determined ( N + M ) R 𝑁 𝑀 𝑅 (N+M)R .

	CBMF	ACBMF	ALS	SGD
Computational costs	$O (\| Ω \| R)$	$O (\| Ω \| R)$	$O ((\| Ω \| R^{2} + (N + M) R^{3}))$	$O ((N + M) R)$
Memory costs	$O (\| Ω \| R)$	$O ((N + M) R + \| Ω \|)$	$O ((N + M) R + \| Ω \|)$	$O ((N + M) R + \| Ω \|)$

Table 2. Table 2: The details of the datasets used in this study. MovieLens is a dataset that consists of the ratings for movies from users who watched the movies, and the ratings of 1M dataset takes an integer value from 1 to 5 and those of 10M and 20M datasets take a value from 0.5 to 5 with step 0.5. When a user likes a movie very much, he or she rates the movie as 5. #Users and #Items correspond to the row and column sizes of the observed matrix, respectively, and #Ratings denotes the number of observations.

Dataset	Rating set	#Users	#Items	#Ratings
MovieLens 1M	{1,2,3,4,5}	6,040	3,900	1,000,209
MovieLens 10M	{0.5,1,1.5,2,2.5,3,3.5,4,4.5,5}	10,681	71,567	10,000,054
MovieLens 20M	{0.5,1,1.5,2,2.5,3,3.5,4,4.5,5}	138,493	27,278	20,000,263

Equations119

X min

X min

∥ X ∥_{*} = k \sum m i n {N, M} σ_{k},

∥ X ∥_{*} = k \sum m i n {N, M} σ_{k},

X \in R^{N \times M} min \frac{1}{2} (μ, i) \in Ω \sum (Y_{μ i} - X_{μ i})^{2} + λ ∥ X ∥_{*} .

X \in R^{N \times M} min \frac{1}{2} (μ, i) \in Ω \sum (Y_{μ i} - X_{μ i})^{2} + λ ∥ X ∥_{*} .

∥ X ∥_{*} = in f {\frac{1}{2} ∥ U ∥_{F} + \frac{1}{2} ∥ V ∥_{F} : X = U V^{T}},

∥ X ∥_{*} = in f {\frac{1}{2} ∥ U ∥_{F} + \frac{1}{2} ∥ V ∥_{F} : X = U V^{T}},

U \in R^{N \times R}, V \in R^{M \times R} min \frac{1}{2} (μ, i) \in Ω \sum (Y_{μ i} - r = 1 \sum R U_{μ r} V_{i r})^{2} + \frac{1}{2} λ ∥ U ∥_{F}^{2} + \frac{1}{2} λ ∥ V ∥_{F}^{2} .

U \in R^{N \times R}, V \in R^{M \times R} min \frac{1}{2} (μ, i) \in Ω \sum (Y_{μ i} - r = 1 \sum R U_{μ r} V_{i r})^{2} + \frac{1}{2} λ ∥ U ∥_{F}^{2} + \frac{1}{2} λ ∥ V ∥_{F}^{2} .

u_{μ} min \frac{1}{2} i \in \partial μ \sum (y_{μ i} - u_{μ}^{T} v_{i})^{2} + λ ∥ u_{μ} ∥^{2},

u_{μ} min \frac{1}{2} i \in \partial μ \sum (y_{μ i} - u_{μ}^{T} v_{i})^{2} + λ ∥ u_{μ} ∥^{2},

u_{μ}^{*} = i \in \partial μ \sum v_{i} v_{i}^{T} + λ I_{R}^{- 1} i \in \partial μ \sum y_{μ i} v_{i},

u_{μ}^{*} = i \in \partial μ \sum v_{i} v_{i}^{T} + λ I_{R}^{- 1} i \in \partial μ \sum y_{μ i} v_{i},

u_{μ}

u_{μ}

v_{i}

\hat{f}_{(μ i) \to μ r} (u_{μ r}) = {u_{μ}, v_{i}} \ u_{μ r} min ⎩ ⎨ ⎧ \frac{1}{2} (y_{μ i} - s \sum u_{μ s} v_{i s})^{2} + s \neq = r \sum f_{μ s \to (μ i)} (u_{μ s}) + s \sum g_{i s \to (μ i)} (v_{i s}) ⎭ ⎬ ⎫,

\hat{f}_{(μ i) \to μ r} (u_{μ r}) = {u_{μ}, v_{i}} \ u_{μ r} min ⎩ ⎨ ⎧ \frac{1}{2} (y_{μ i} - s \sum u_{μ s} v_{i s})^{2} + s \neq = r \sum f_{μ s \to (μ i)} (u_{μ s}) + s \sum g_{i s \to (μ i)} (v_{i s}) ⎭ ⎬ ⎫,

\overset{g}{^}_{(μ i) \to i r} (v_{i r}) = {u_{μ}, v_{i}} \ v_{i r} min ⎩ ⎨ ⎧ \frac{1}{2} (y_{μ i} - s \sum u_{μ s} v_{i s})^{2} + s \sum f_{μ s \to (μ i)} (u_{μ s}) + s \neq = r \sum g_{i s \to (μ i)} (v_{i s}) ⎭ ⎬ ⎫,

\overset{g}{^}_{(μ i) \to i r} (v_{i r}) = {u_{μ}, v_{i}} \ v_{i r} min ⎩ ⎨ ⎧ \frac{1}{2} (y_{μ i} - s \sum u_{μ s} v_{i s})^{2} + s \sum f_{μ s \to (μ i)} (u_{μ s}) + s \neq = r \sum g_{i s \to (μ i)} (v_{i s}) ⎭ ⎬ ⎫,

f_{μ r \to (μ i)} (u_{μ r}) = (μ j) \in \partial μ r \ (μ i) \sum \hat{f}_{(μ j) \to μ r} (u_{μ r}) + \frac{1}{2} λ u_{μ r}^{2},

f_{μ r \to (μ i)} (u_{μ r}) = (μ j) \in \partial μ r \ (μ i) \sum \hat{f}_{(μ j) \to μ r} (u_{μ r}) + \frac{1}{2} λ u_{μ r}^{2},

g_{i r \to (μ i)} (v_{i r}) = (ν i) \in \partial i r \ (μ i) \sum \overset{g}{^}_{(ν i) \to i r} (v_{i r}) + \frac{1}{2} λ v_{i r}^{2},

g_{i r \to (μ i)} (v_{i r}) = (ν i) \in \partial i r \ (μ i) \sum \overset{g}{^}_{(ν i) \to i r} (v_{i r}) + \frac{1}{2} λ v_{i r}^{2},

f_{μ} (u_{μ r}) = (μ i) \in \partial μ r \sum \hat{f}_{(μ i) \to μ r} (u_{μ r}) + \frac{1}{2} λ u_{μ r}^{2},

f_{μ} (u_{μ r}) = (μ i) \in \partial μ r \sum \hat{f}_{(μ i) \to μ r} (u_{μ r}) + \frac{1}{2} λ u_{μ r}^{2},

g_{i} (v_{i r}) = (μ i) \in \partial i r \sum \overset{g}{^}_{(μ i) \to i r} (v_{i r}) + \frac{1}{2} λ v_{i r}^{2} .

g_{i} (v_{i r}) = (μ i) \in \partial i r \sum \overset{g}{^}_{(μ i) \to i r} (v_{i r}) + \frac{1}{2} λ v_{i r}^{2} .

u_{μ r}^{*} = u_{μ r} arg min {f_{μ} (u_{μ r})},

u_{μ r}^{*} = u_{μ r} arg min {f_{μ} (u_{μ r})},

v_{i r}^{*} = v_{i r} arg min {g_{i} (v_{i r})} .

v_{i r}^{*} = v_{i r} arg min {g_{i} (v_{i r})} .

\hat{f}_{(μ i) \to μ r} (u_{μ r})

\hat{f}_{(μ i) \to μ r} (u_{μ r})

\overset{g}{^}_{(μ i) \to i r} (v_{i r})

f_{μ r \to (μ i)} (u_{μ r})

g_{i r \to (μ i)} (v_{i r})

\hat{f}_{(μ i) \to μ r} (u_{μ r}) =

\hat{f}_{(μ i) \to μ r} (u_{μ r}) =

{u_{μ}} \ u_{μ r} min {\frac{1}{2} (u_{μ}^{r})^{T} (Γ_{a_{μ \to (μ i)}^{r}} + v_{i}^{r} (v_{i}^{r})^{T}) u_{μ}^{r} - {b_{μ \to (μ i)}^{r} + (y_{μ i} - u_{μ r} v_{i r}) v_{i}}^{T} u_{μ}^{r}},

(u_{μ}^{r})^{*} = (Γ_{a_{μ \to (μ i)}^{r}} + v_{i}^{r} (v_{i}^{r})^{T})^{- 1} {b_{μ \to (μ i)}^{r} - (y_{μ i} - u_{μ r} v_{i r}) v_{i}^{r}} .

(u_{μ}^{r})^{*} = (Γ_{a_{μ \to (μ i)}^{r}} + v_{i}^{r} (v_{i}^{r})^{T})^{- 1} {b_{μ \to (μ i)}^{r} - (y_{μ i} - u_{μ r} v_{i r}) v_{i}^{r}} .

(Γ_{a_{μ \to (μ i)}^{r}} + v_{i}^{r} (v_{i}^{r})^{T})^{- 1} = Γ_{a_{μ \to (μ i)}^{r}} - \frac{Γ _{a_{μ \to (μ i)}^{r}}^{- 1} v _{i}^{r} ( v _{i}^{r} ) ^{T} Γ _{a_{μ \to (μ i)}^{r}}^{- 1}}{1 + ( v _{i}^{r} ) ^{T} Γ _{a_{μ \to (μ i)}^{r}}^{- 1} v _{i}^{r}} .

(Γ_{a_{μ \to (μ i)}^{r}} + v_{i}^{r} (v_{i}^{r})^{T})^{- 1} = Γ_{a_{μ \to (μ i)}^{r}} - \frac{Γ _{a_{μ \to (μ i)}^{r}}^{- 1} v _{i}^{r} ( v _{i}^{r} ) ^{T} Γ _{a_{μ \to (μ i)}^{r}}^{- 1}}{1 + ( v _{i}^{r} ) ^{T} Γ _{a_{μ \to (μ i)}^{r}}^{- 1} v _{i}^{r}} .

\hat{f}_{(μ i) \to μ r} (u_{μ r}) = \frac{1}{2} \frac{v _{i r}^{2}}{1 + χ _{(μ i)} - \frac{v _{i r}^{2}}{a _{μ r \to (μ i)} + λ}} u_{μ r}^{2} - \frac{y _{(μ i)} - Δ _{(μ i)} + u _{μ r \to (μ i)} v _{i r}}{1 + χ _{(μ i)} - \frac{v _{i r}^{2}}{a _{μ r \to (μ i)} + λ}} v_{i r} u_{μ r},

\hat{f}_{(μ i) \to μ r} (u_{μ r}) = \frac{1}{2} \frac{v _{i r}^{2}}{1 + χ _{(μ i)} - \frac{v _{i r}^{2}}{a _{μ r \to (μ i)} + λ}} u_{μ r}^{2} - \frac{y _{(μ i)} - Δ _{(μ i)} + u _{μ r \to (μ i)} v _{i r}}{1 + χ _{(μ i)} - \frac{v _{i r}^{2}}{a _{μ r \to (μ i)} + λ}} v_{i r} u_{μ r},

χ_{(μ i)}

χ_{(μ i)}

Δ_{(μ i)}

u_{μ r \to (μ i)}

\overset{a}{^}_{(μ i) \to μ r}

\overset{a}{^}_{(μ i) \to μ r}

\hat{b}_{(μ i) \to μ r}

a_{μ r \to (μ i)}

a_{μ r \to (μ i)}

b_{μ r \to (μ i)}

a_{μ r}

a_{μ r}

b_{μ r}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Stochastic Gradient Descent

Full text

Approximate matrix completion based on cavity method

Chihiro Noguchi and Yoshiyuki Kabashima

Department of Mathematical and Computing Science, Tokyo Institute of Technology, 2-12-1, Ookayama, Meguro-ku, Tokyo, Japan

[email protected]

Abstract

In order to solve large matrix completion problems with practical computational cost, an approximate approach based on matrix factorization has been widely used. Alternating least squares (ALS) and stochastic gradient descent (SGD) are two major algorithms to this end. In this study, we propose a new algorithm, namely cavity-based matrix factorization (CBMF) and approximate cavity-based matrix factorization (ACBMF), which are developed based on the cavity method from statistical mechanics. ALS yields solutions with less iterations when compared to those of SGD. This is because its update rules are described in a closed form although it entails higher computational cost. CBMF can also write its update rules in a closed form, and its computational cost is lower than that of ALS. ACBMF is proposed to compensate a disadvantage of CBMF in terms of relatively high memory cost. We experimentally illustrate that the proposed methods outperform the two existing algorithms in terms of convergence speed per iteration, and it can work under the condition where observed entries are relatively fewer. Additionally, in contrast to SGD, (A)CBMF does not require scheduling of the learning rate.

1 Introduction

Recent technological advances triggered the generation and accumulation of significant amounts of data. In response to the trend, several methods are proposed to extract useful information from them. This produced significant results in various fields including science and engineering. A typical example can be found in collaborative filtering, which is a methodology that is used in recommender systems [1]. As a comprehensive example, we consider a user-movie matrix $Y\in\mathbb{R}^{N\times M}$ , where $N$ and $M$ denote the number of users and movies, respectively, and an entry of $Y$ , $Y_{ij}$ , denotes rating from user $i$ movie $j$ . Users normally evaluate only a small fraction of movies, and thus most entries of $Y$ are missing. Under the aforementioned types of setting, the primary objective of matrix completion involves predicting missing entries.

A natural approach for this involves minimizing the rank of the matrix under constraints yielded by observed entries, and this is generally referred to as “low-rank matrix completion”. Unfortunately, it is NP-hard to literally solve the rank minimization problem. In order to practically overcome the difficulty, relaxation of matrix rank to nuclear norm was proposed [2]. Interestingly, it is guaranteed that the solution of the nuclear norm minimization is exactly in agreement with that of the original rank minimization if certain conditions are satisfied [3, 4, 5, 6]. The minimization of nuclear norm belongs to the class of convex optimization problems, and thus the optimal solution is determined via versatile semidefinite programming solvers when the matrix size is relatively small. However, in several realistic problems, matrix sizes are not so small, and computational and memory costs required by the nuclear norm minimization often exceed practically acceptable levels.

In order to deal with such situations, a non-convex approach using matrix factorization was proposed more recently [1]. When the objective matrix is factorized into two matrices of lower rank, nuclear norm is evaluated as the sum of their Frobenius norms. The non-convex formulation significantly reduces necessary computational and memory costs while we can generally find only local minima. However, a recent study [7] indicated that under a certain condition, the objective function of matrix factorization does not exhibit spurious local minima. Each local minimum is transformed to another via trivial operations such as permutations of column/rows with high probabilities.

Two major algorithms, alternating least squares (ALS) [8, 9, 10] and stochastic gradient descent (SGD) [11, 12, 13] are proposed for the matrix factorization to date. The main objective of this study is to develop a new algorithm by borrowing an idea from the cavity method from statistical mechanics. Even if the absence of spurious local minima is guaranteed, the performance of the solution search is determined via dynamical properties of the used algorithm. We experimentally illustrate that the proposed cavity-based algorithms exhibit better performance than the two algorithms without delicate tuning of control parameters when the number of observed data is relatively small.

Several extant studies apply the cavity method for the matrix factorization problems. An approximate message passing (AMP) based approach to generalized bilinear inference problem including the matrix completion was proposed in [14, 15]. A detailed derivation of AMP-type algorithms and performance analysis for the Bayes optimal cases are provided in [16]. Reference [17] presents an AMP based algorithm for low-rank matrix reconstruction and its application to K-means type clustering. All of these methods follow the Bayesian framework. The differences of the present study from these are as follows. We do not employ the Bayesian approach, and thus it is not necessary to select a prior distribution. Additionally, we focus on the matrix completion as a particular application of matrix factorization, and aim to develop efficient algorithms exploiting the properties of the specific problem.

The remainder is organized as follows. In section 2, the problem setting is detailed. In section 3, we explain the details of the proposed algorithm. In section 4, the performance of the proposed algorithms is illustrated via applications for synthetic and realistic data. The final section presents the summary.

2 Problem Setting

In the simplest case, low rank matrix completion is defined as follows:

[TABLE]

where $X$ and $Y$ are denoted as decision variables and observed entries, respectively, and $\Omega$ stands for the set of indices of $Y$ . The problem is guaranteed to exhibit a unique solution with high probability when the size of $\Omega$ is sufficiently large. However, there is no known algorithm that solves (1) in practical time. Hence, relaxation of the matrix rank to the nuclear norm is typically employed and defined as follows:

[TABLE]

where $\sigma_{k}$ denotes the $k$ th highest singular value of $X$ . In Lagrange form, the nuclear norm relaxation converts (1) as follows:

[TABLE]

The solution of (3) is determined in a polynomial time via versatile solvers of semi-definite programming. However, such solvers require singular value decomposition per iteration, and their computational and memory costs easily exceed practically permissible levels when the system size increases.

A popular approach to overcome this disadvantage involves using non-convex relaxation. Let us assume that rank of $X$ is $R$ , and this means that $X$ is expressed as $X=UV^{T}$ by using two smaller matrices as $U\in\mathbb{R}^{N\times R},V\in\mathbb{R}^{M\times R}$ . An attractive property of the nuclear norm is that it is evaluated by another norm as follows:

[TABLE]

where $\|A\|_{F}=\sqrt{\sum_{ij}A_{ij}^{2}}$ denotes the Frobenius norm of matrix $A$ [18]. We insert (4) into (3) to yield a non-convex version of (3) as

[TABLE]

In contrast to (3), (5) ceases to be convex, and this implies that multiple local minima can exist. However, it was recently illustrated that a spurious local minimum is absent with a high probability if a few conditions are satisfied [7].

Without any constraints, the degree of freedom of this problem is given as $R(N+M)$ . This implies that the number of observations $|\Omega|$ must not be less than $R(N+M)$ to determine a solution. In the following, we assume that this condition is satisfied.

Two major algorithms, ALS and SGD, are known to solve (5). ALS is widely known as a standard approach to non-convex optimization problems due to its simplicity. When $V$ is fixed, each row of $U$ is independently calculated, and the objective function (5) is then expressed as follows:

[TABLE]

where ${\bf u}_{\mu}$ and ${\bf v}_{i}$ denote the $\mu$ -th and $i$ -th rows of $U$ and $V$ respectively, and $\partial\mu$ denotes a set of observed indices of $\mu$ -th row of $Y$ . Thus, (6) leads to the following closed form solution:

[TABLE]

where ${\bf I}_{R}$ denotes $R\times R$ unit matrix. Subsequently, we fix $U$ and solve $V$ in turn, and ALS repeats this operation until convergence. The main advantage of ALS is the ease of parallelization although the computational cost per iteration exceeds that of SGD.

The other algorithm, SGD, is also widely known as a standard algorithm for continuous optimization problems. Specifically, SGD computes a gradient only with respect to pairwise indices $(\mu,i)\in\Omega$ selected at random per iteration, and the gradient updates the corresponding variables based on the given learning rate $\eta$ as follows:

[TABLE]

The algorithm exhibits an advantage wherein its computational cost in the elemental update is lower. However, it has two major disadvantages. The first is that an overwriting issue can arise when the several updates are conducted in parallel. The second is that it is highly sensitive to the learning rate. Distributed SGD (DSGD) [13] (the name Jellyfish used in [13]) overcomes the first disadvantage by dividing the observed matrix into a few blocks, considering a set of independent blocks, and updating a pair of indices from each block in it. However, the second disadvantage still remains, and the learning rate should be carefully tuned and scheduled. The adjustment of the learning rate significantly affects the convergence of the algorithm.

3 A Cavity-Based Approach

In order to explore the possibility of achieving a better performance, we develop an algorithm for the matrix factorization based on the cavity method [19]. Thus, we first express the variable dependence of (5) by a factor graph (Figure 1). The variable nodes are expressed by circles and denote entries of two matrices $U$ and $V$ while the factor nodes are represented by squares and stand for factors constituting (5), namely, $(1/2)\left(Y_{\mu i}-\sum_{r=1}^{R}U_{\mu r}V_{ir}\right)^{2}$ , $(\lambda/2)U_{\mu r}^{2}$ and $(\lambda/2)V_{ir}^{2}$ . An edge for a pair of variable and factor nodes is provided if and only if the variable and factor nodes are directly related.

The basic idea of the cavity method is to approximate the multivariate minimization problem (5) via a bunch of minimization problems with respect to single variables. Hence, we introduce “cavity objective functions” $f_{\mu r\rightarrow(\mu i)}(u_{\mu r})$ and $g_{ir\rightarrow(\mu i)}(v_{ir})$ . The function $f_{\mu r\rightarrow(\mu i)}(u_{\mu r})$ denotes the objective function after the minimization with respect to all variables other than $u_{\mu r}$ is performed in the “ $(\mu i)$ -cavity system” that is defined by removing $(1/2)\left(Y_{\mu i}-\sum_{r=1}^{R}U_{\mu r}V_{ir}\right)^{2}$ from (5), and similarly for $g_{ir\rightarrow(\mu i)}(v_{ir})$ . The summation of the cavity objective functions and $(1/2)\left(Y_{\mu i}-\sum_{r=1}^{R}U_{\mu r}V_{ir}\right)^{2}$ approximates the full objective function of (5). Conversely, we remove the contribution of $f_{\mu r\rightarrow(\mu i)}(u_{\mu r})$ from the full summation and minimize the resulting function with respect to all variables except for $u_{\mu r}$ . This yields “cavity bias function” $\hat{f}_{(\mu i)\rightarrow\mu r}(u_{\mu r})$ , and this denotes the effective influence of the factor $(1/2)\left(Y_{\mu i}-\sum_{r=1}^{R}U_{\mu r}V_{ir}\right)^{2}$ to the variable $u_{\mu r}$ , and similarly for $\hat{g}_{(\mu i)\rightarrow ir}(v_{ir})$ . The summation of the cavity bias functions with the exception of $\hat{f}_{(\mu i)\rightarrow\mu r}(u_{\mu r})$ and $(\lambda/2)u_{\mu r}^{2}$ yields $f_{\mu r\rightarrow(\mu i)}(u_{\mu r})$ , and similarly for $g_{ir\rightarrow(\mu i)}(v_{ir})$ . They constitute a closed set of functional equations to determine the cavity objective and bias functions as follows:

[TABLE]

where ${\bf u}_{\mu}$ and ${\bf v}_{i}$ denote the $\mu$ -th and $i$ -th rows of $U$ and $V$ , respectively, and $A\backslash a$ generally indicates a set that is defined via eliminating an element $a$ from a set $A$ . The indices of factor nodes are denoted with parentheses while those of variable nodes are not. The notation $\partial\mu r$ stands for the set of factor nodes that directly connect variable node indexed by $\mu r$ . After determining the cavity objective and bias functions from (10)-(13), “marginal” objective functions for each variable are provided as follows:

[TABLE]

Thus, entries of the factorized matrices are evaluated as follows:

[TABLE]

3.1 Derivation of the algorithm

Two issues are emphasized here. First, when the factor graph does not contain any cycles, the solution given by the cavity method is exact. However, cycles generally exist in the matrix factorization problem. However, if the positions of the observed entries are randomly selected and their number is limited up to $O(N)$ as assumed in the following, then the resulting factor graph is considered as a sparse random graph. Thus, the lengths of the cycles typically scale as $O(\ln N)$ when the system size $N$ increases. Therefore, it is reasonable to expect that the cavity method yields reasonably accurate approximates for large $N$ as the effect of the cycles becomes negligible. Second, solving (10)-(13) is, unfortunately, technically difficult since they are provided as functional equations. In order to overcome the difficulty, we parameterize the cavity objective and bias functions in the form of quadratic functions as follows:

[TABLE]

However, the insertion of (18)-(21) into (10)-(13) does not yield a closed form of equations to determine the parameters. This indicates that a further approximation is required. Hence, we assign ${\bf v}_{i}$ the value in the previous step to solve the minimization problem of (10). Similarly for equation (11). This leads to quadratic forms with respect to ${\bf u}^{r}_{\mu}$ and ${\bf v}^{r}_{i}$ from (10) and (11), respectively. Here, ${\bf u}^{r}_{\mu}$ denotes a vector excluding $u_{\mu r}$ from ${\bf u}_{\mu}$ . Similarly, this stands for ${\bf v}^{r}_{i}$ . Accordingly, when ${\bf v}_{i}$ is fixed, the equation $(\ref{eq:bp1})$ is re-expressed as follows:

[TABLE]

where ${\bf a}^{r}_{\mu\rightarrow(\mu i)}$ denotes a vector excluding $a_{\mu r}$ from ${\bf a}_{\mu\rightarrow(\mu i)}=(a_{\mu 1\rightarrow(\mu i)},...,a_{\mu R\rightarrow(\mu i)})$ , and $\Gamma_{{\bf a}^{r}_{\mu\rightarrow(\mu i)}}$ and $\Gamma_{{\bf b}^{r}_{\mu\rightarrow(\mu i)}}$ indicate, respectively, ${\rm diag}({\bf a}^{r}_{\mu\rightarrow(\mu i)}+\lambda{\bf 1})$ and ${\rm diag}({\bf b}^{r}_{\mu\rightarrow(\mu i)})$ . Similarly for ${\bf b}^{r}_{\mu\rightarrow(\mu i)}$ .

The minimization problem in (22) is solved as follows:

[TABLE]

Based on Sherman–Morrison formula, the inverse matrix in (23) is re-expressed as follows:

[TABLE]

We insert (23) and (24) into (22) to yield the following expression:

[TABLE]

where $\chi_{(\mu i)}$ , $\Delta_{(\mu i)}$ and $u_{\mu r\rightarrow(\mu i)}$ are defined as follows:

[TABLE]

From the equations (18) and (25), we obtain the following:

[TABLE]

Further, we insert (18) and (20) into (12) to yield the following expression:

[TABLE]

where $a_{\mu r}$ and $b_{\mu r}$ are defined as follows:

[TABLE]

Finally, entries of the factorized matrices $u_{\mu r}^{*}$ are re-expressed from the equation (16) as follows:

[TABLE]

Similarly, we can re-express equations with respect to $\hat{c}_{(\mu i)\rightarrow ir},\hat{d}_{(\mu i)\rightarrow ir}$ and $c_{ir\rightarrow(\mu i)},d_{ir\rightarrow(\mu i)}$ based on (11),(13) and (19),(21).

In summary, the resulting equations are expressed as follows:

•

Update equations for $U$ :

[TABLE]

•

Update equations for $V$ :

[TABLE]

Here, $t$ denotes the counter index for the update. It should be noted that in order to update variables for $V$ at time $t$ , $u_{\mu r}^{t+1}$ is used instead of $u_{\mu r}^{t}$ . We term the algorithm composed of (36)-(55) as cavity-based matrix factorization (CBMF).

The computational cost per update of each equation is $O(|\Omega|R)$ and the necessary memory cost corresponds to $O(|\Omega|R)$ . The computational cost is competitive, and this is discussed later. Conversely, the necessary memory cost of CBMF exceeds those of ALS and SGD (Table 1). Although this is a disadvantage of CBMF, its necessary memory size is reduced to that of ALS and SGD by utilizing an approximation that is similar to that for deriving AMP from belief propagation [20] as shown below.

3.2 Derivation of the approximate algorithm

CBMF entails $O(|\Omega|R)$ memory cost, and this is equivalent to the number of edges in the factor graph. When $R$ and $c$ are sufficiently large, the effect caused by omitting a variable node is expected to be negligible. Thus, the variables corresponding to the edges can be replaced by those corresponding to nodes. The goal of this subsection involves deriving update equations with respect to the variables corresponding to the nodes. In the following, $R$ and $c$ are assumed as sufficiently large.

The equation (38) is approximately re-expressed as follows:

[TABLE]

where $\chi_{(\mu i)}$ is also approximated by ignoring one of $c$ terms as follows:

[TABLE]

Operating $\sum_{(\mu i)\in\partial\mu r}$ on both sides of (56) yields

[TABLE]

Similarly, the equation (39) is re-expressed as follows:

[TABLE]

where $u_{\mu s\rightarrow(\mu i)}$ is also approximated by ignoring one of $R$ or $c$ terms as follows:

[TABLE]

where $\phi_{(\mu i)}$ is defined as follows:

[TABLE]

The second line is derived from (57) and (60). We insert (60) into (59) and operate $\sum_{(\mu i)\in\mu r}$ on both sides to yield the following expression:

[TABLE]

Similarly, the update equations (46)-(55) are re-expressed by the same procedure. Finally, the approximate update equations are summarized as follows:

•

Update equations for $U$ :

[TABLE]

•

Update equations for $V$ :

[TABLE]

We term the algorithm composed of (64)-(74) as the approximate cavity-based matrix factorization (ACBMF). The necessary memory cost to execute the algorithm is $O((N+M)R+|\Omega|)$ , which is equivalent to the number of nodes in the factor graph. When compared to CBMF, ACBMF significantly reduces the required memory cost while the necessary computational cost is unchanged.

Additionally, one can illustrate that the fixed point of ACBMF is in agreement with that of ALS. Equation (62) is solved with respect to $\phi_{(\mu i)}$ , and we obtain the following expression:

[TABLE]

We insert (58) and (75) into (63) to yield the following expression:

[TABLE]

From the equations (35), we obtain the following expression:

[TABLE]

We solve (77) with respect to ${\bf u}_{r}$ to yield the following expression:

[TABLE]

and this is equivalent to (7). Similarly for ${\bf v}^{*}_{r}$ .

In contrast to ALS, ACBMF does not completely optimize $U$ ( $V$ ) for a given $V$ ( $U$ ) in each step, and thus the necessary computation is reduced. Evidently, this may decrease the convergence speed. However, the complete optimization for it does not necessarily bring $U$ ( $V$ ) to a better state when $V$ ( $U$ ) is far from the convergent solution. Therefore, it is not advised to expend significant computational cost on this. Additionally, the optimization in each step tends to strengthen time correlations of the variables, and this may make the cavity treatment inappropriate. Actually, the results of experiments shown below indicate that this concern is the case.

3.3 Comparison with ALS and SGD

We briefly compare (A)CBMF with ALS and SGD. ALS and SGD are algorithms that attempt to iteratively minimize the multivariate objective function (5). Although their working principle is natural, the performance of these algorithms can be negatively affected by the self-feedback effect caused by cycles from the graph. Conversely, (A)CBMF reduces such effect by introducing the seemingly artificial cavity functions, and this may lead to the performance improvement. In a manner similar to ALS, (A)CBMF can also be easily parallelized, and is free from learning parameters unlike SGD.

The computational and memory costs of the four algorithms are summarized in Table 1. The computational cost is defined as that necessary to update all variables at least once. Given this definition, SGD only updates the variables based on the gradients although the computational cost of SGD appears the lowest. Conversely, CBMF and ALS update them with closed forms, and thus it is expected that their convergence speeds can increase. A comparison of (A)CBMF and ALS indicates that the computational cost of the former is lower. Conversely, the memory cost of CBMF is the highest while that of ACBMF is identical to that of ALS and SGD.

4 Numerical Experiments

4.1 Synthetic Data Analysis

In order to systematically compare the performance of the four algorithms, namely ALS, SGD, and (A)CBMF (C++ implementation is available at [21]), we performed extensive numerical experiments using synthetic datasets. A dataset for the experiment was prepared as follows: The original matrix $Y^{0}\in\mathbb{R}^{N\times M}$ is provided from $U^{0}\in\mathbb{R}^{N\times R},V^{0}\in\mathbb{R}^{M\times R}$ , and $Z\in\mathbb{R}^{N\times M}$ as $Y^{0}=U^{0}(V^{0})^{T}+Z$ , where entries of $U^{0}$ and $V^{0}$ are independently sampled from the standard Gaussian distribution while those of $Z$ are independently and identically distributed based on a Gaussian of zero mean and variance 0.09. We randomly select “observed entries” out of $Y^{0}$ with probability of $c/N$ where $c\sim O(1)$ denotes the average number of the observed entries per column. The collection of the observed entries constitutes the observed matrix $Y$ . We assume that true rank $R$ is known in advance.

We evaluate the performance of the algorithms via the relative root mean square error (rRMSE) and the reconstruction rate. Given the effect of the noise $Z$ , it is impossible to perfectly reconstruct $Y^{0}$ in the current setting. Therefore, we consider estimated factorized matrices $U$ and $V$ as successful if $\sqrt{\sum_{\mu i}(y_{\mu i}^{0}-u_{\mu r}v_{ir})^{2}}/\sqrt{\sum_{\mu i}(y_{\mu i}^{0})^{2}}\leq 0.15$ holds. The convergence of the three algorithms is not guaranteed, and thus we attempt ten random initial conditions for each sample and algorithm and counted a “success” if at least one of the ten initial conditions leads to the successful reconstruction. Additionally, rRMSE is evaluated via the mean of the minimum value of $\sqrt{\sum_{\mu i}(y_{\mu i}^{0}-u_{\mu r}v_{ir})^{2}}/\sqrt{\sum_{\mu i}(y_{\mu i}^{0})^{2}}$ out of the ten initial conditions over 50 samples. Conversely, the reconstruction rate denotes the fraction of the reconstruction success over the 50 samples.

Figure 3 plots the experimental results as function of the average number $c$ of observations per column for $R=10$ . The figure indicates that (A)CBMF outperforms the other algorithms. It should be noted that (A)CBMF exhibits a better reconstruction rate up to a smaller value of $c$ than ALS while they are theoretically guaranteed to share the same fixed point. We speculate that this is because (A)CBMF weakens the self-feedback effect via the cavity treatment and by not performing optimization in each step. In order to verify the validity of this speculation, we examine the manner in which the reconstruction rate changes when the number of iterations of ACBMF for each step increases, and this is plotted in figure 3. When the iteration is repeated until convergence in each step, $U$ ( $V$ ) is optimized for a given $V$ ( $U$ ). This implies that the performance would become worse when the number of the iterations increases by spending more computational cost. The figure shows that this is actually the case and supports our speculation.

Figure 4 shows the results for rRMSE. The performance of SGD is significantly worse when compared to that of (A)CBMF and ALS. This is potentially because the scheduling of the learning rate used in the SGD experiments is not optimally tuned. The default scheduling that is provided in a code distribution [22] leads to a terrible result, and thus we select a better scheduling although it is non-trivial to determine the optimal one. Conversely, (A)CBMF and ALS are free from such issues as they involve no scheduling of parameters. (A)CBMF exhibits slightly better performance when compared to that ALS. Similarly, for reconstruction rate, the performance of ACBMF approaches that of ALS when the number of iterations per update increases (figure4).

4.2 Real Data Analysis

We also examined the usefulness of the proposed algorithm via application to three benchmark datasets of recommender systems, namely MovieLens 1M, 10M, and 20M [23]. Specifically, the 1M dataset is composed of rating values $s$ from 1 to 5 with step 1, and 10M and 20M are from 0.5 to 5 with step 0.5. The higher values correspond to higher evaluations for movies or music provided by users. Details of the datasets are summarized in Table2.

The performance of each algorithm for the matrix is evaluated as follows: We randomly split the matrix entries into 10 groups, matrix factorization is performed by using data of 9-of-the-10 groups, and the performance of the obtained factorization is measured by using data of the remaining group. We employ root mean square error (RMSE) as a performance measure, and it is averaged over 50 samples of the experiment. In all the experiments, we set $R=10$ .

Figures 5-7 show the performance measure of (A)CBMF, ALS and SGD evaluated for the three datasets. The figures represent RMSE relative to the number of iterations. The figures indicate that all the algorithms finally achieve similar performance although the number of iterations necessary for convergence is minimized for ALS. However, it should be noted that the ALS requires a significantly higher computational cost than (A)CBMF and SGD per iteration (Table1). Thus, (A)CBMF converges faster than the other algorithms in terms of actual time when $R$ is relatively large.

5 Summary

In summary, we developed matrix factorization algorithms that are abbreviated as CBMF and ACBMF based on the cavity method. In terms of computational cost, CBMF is competitive with SGD because CBMF updates variables in closed forms (which generally reduces the number of iterations necessary for convergence) although a comparison of the necessary computational cost to update all variables at least once indicates that the computational cost of SGD is the smallest of the three. In a manner similar to CBMF, ALS updates variables in closed form although its computational cost exceeds that of CBMF because ALS requires the matrix inversion operation, which CBMF does not require. Conversely, in terms of the memory cost, CBMF requires more capacity than the others, and thus we developed ACBMF by utilizing an approximation that is similar to that for deriving AMP from belief propagation. The necessary memory cost of ACBMF is identical to that of SGD and ALS.

Experiments involving synthetic data indicated that (A)CBMF exhibits better performance without the necessity of parameter tuning when observed entries are not sufficiently large. The superiority of the performance presumably stems from the reduction of self-feedback effects via the introduction of cavity treatment and avoidance of the complete optimization in each update. Experiments using real world dataset indicated that all algorithms achieved similar performance although (A)CBMF converges faster than the other two in actual time when rank $R$ is relatively large.

Future work includes generalization of CBMF to matrix factorization problems with additional constraints such as non-negative matrix factorization [24].

Acknowledgements

Useful discussion with Tomoyuki Obuchi is acknowledged. This study was partially supported by KAKENHI No.17H00764.

Appendix A Benchmark datasets

We performed numerical experiments on three different benchmark datasets as follows: the MovieLens 1M, 10M, and 20M datasets (https://movielens.org/). The characteristics of each dataset is represented in Table 2.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer , (8):30–37, 2009.
2[2] Maryam Fazel. Matrix rank minimization with applications . Ph D thesis, Ph D thesis, Stanford University, 2002.
3[3] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics , 9(6):717, 2009.
4[4] Emmanuel J Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory , 56(5):2053–2080, 2010.
5[5] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research , 12(Dec):3413–3430, 2011.
6[6] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE transactions on information theory , 56(6):2980–2998, 2010.
7[7] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems , pages 2973–2981, 2016.
8[8] Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. Large-scale parallel collaborative filtering for the netflix prize. In International Conference on Algorithmic Applications in Management , pages 337–348. Springer, 2008.