Convex Coupled Matrix and Tensor Completion

Kishan Wimalawarne; Makoto Yamada; Hiroshi Mamitsuka

arXiv:1705.05197·stat.ML·June 15, 2018·Neural Comput.

Convex Coupled Matrix and Tensor Completion

Kishan Wimalawarne, Makoto Yamada, Hiroshi Mamitsuka

PDF

TL;DR

This paper introduces convex low-rank norms for coupled matrix and tensor completion, enabling globally optimal solutions and improved theoretical bounds, with demonstrated effectiveness on synthetic and real data.

Contribution

It proposes a novel convex norm for coupled tensors that combines overlapped and latent norms, providing a globally optimal completion algorithm with better risk bounds.

Findings

01

The proposed norms outperform existing methods in synthetic data experiments.

02

The completion algorithm achieves superior accuracy on real-world datasets.

03

Theoretical analysis shows improved excess risk bounds for coupled tensor completion.

Abstract

We propose a set of convex low rank inducing norms for a coupled matrices and tensors (hereafter coupled tensors), which shares information between matrices and tensors through common modes. More specifically, we propose a mixture of the overlapped trace norm and the latent norms with the matrix trace norm, and then, we propose a new completion algorithm based on the proposed norms. A key advantage of the proposed norms is that it is convex and can find a globally optimal solution, while existing methods for coupled learning are non-convex. Furthermore, we analyze the excess risk bounds of the completion model regularized by our proposed norms which show that our proposed norms can exploit the low rankness of coupled tensors leading to better bounds compared to uncoupled norms. Through synthetic and real-world data experiments, we show that the proposed completion algorithm compares…

Equations158

T, M min \frac{1}{2} ∥ Ω_{M} (M - \hat{M}) ∥_{F}^{2} + \frac{1}{2} ∥ Ω_{T} (T - \hat{T}) ∥_{F}^{2} + λ ∥ T, M ∥_{cn},

T, M min \frac{1}{2} ∥ Ω_{M} (M - \hat{M}) ∥_{F}^{2} + \frac{1}{2} ∥ Ω_{T} (T - \hat{T}) ∥_{F}^{2} + λ ∥ T, M ∥_{cn},

∥ M ∥_{tr} = j = 1 \sum J σ_{j},

∥ M ∥_{tr} = j = 1 \sum J σ_{j},

∥ T ∥_{overlap} = k = 1 \sum K ∥ T_{(k)} ∥_{tr} .

∥ T ∥_{overlap} = k = 1 \sum K ∥ T_{(k)} ∥_{tr} .

∥ T ∥_{latent} = T^{(1)} + \dots + T^{(K)} = T in f k = 1 \sum K ∥ T_{(k)}^{(k)} ∥_{tr} .

∥ T ∥_{latent} = T^{(1)} + \dots + T^{(K)} = T in f k = 1 \sum K ∥ T_{(k)}^{(k)} ∥_{tr} .

∥ T ∥_{scaled} = T^{(1)} + \dots + T^{(K)} = T in f k = 1 \sum K \frac{1}{n _{k}} ∥ T_{(k)}^{(k)} ∥_{tr} .

∥ T ∥_{scaled} = T^{(1)} + \dots + T^{(K)} = T in f k = 1 \sum K \frac{1}{n _{k}} ∥ T_{(k)}^{(k)} ∥_{tr} .

∥ T, M ∥_{(O, O, O)}^{1} := ∥ [T_{(1)}; M] ∥_{tr} + k = 2 \sum 3 ∥ T_{(k)} ∥_{tr} .

∥ T, M ∥_{(O, O, O)}^{1} := ∥ [T_{(1)}; M] ∥_{tr} + k = 2 \sum 3 ∥ T_{(k)} ∥_{tr} .

\|\mathcal{T},M\|^{1}_{(\mathrm{L},\mathrm{L},\mathrm{L})}=\underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}+\mathcal{T}^{(3)}=\mathcal{T}}{\inf}\bigg{(}\|[T_{(1)}^{(1)};M]\|_{\mathrm{tr}}+\sum_{k=2}^{3}\|T_{(k)}^{(k)}\|_{\mathrm{tr}}\bigg{)}.

\|\mathcal{T},M\|^{1}_{(\mathrm{L},\mathrm{L},\mathrm{L})}=\underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}+\mathcal{T}^{(3)}=\mathcal{T}}{\inf}\bigg{(}\|[T_{(1)}^{(1)};M]\|_{\mathrm{tr}}+\sum_{k=2}^{3}\|T_{(k)}^{(k)}\|_{\mathrm{tr}}\bigg{)}.

\|\mathcal{T},M\|^{1}_{(\mathrm{S},\mathrm{S},\mathrm{S})}=\underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}+\mathcal{T}^{(3)}=\mathcal{T}}{\inf}\bigg{(}\frac{1}{\sqrt{n_{1}}}\|[T_{(1)}^{(1)};M]\|_{\mathrm{tr}}+\sum_{k=2}^{3}\frac{1}{\sqrt{n_{k}}}\|T_{(k)}^{(k)}\|_{\mathrm{tr}}\bigg{)}.

\|\mathcal{T},M\|^{1}_{(\mathrm{S},\mathrm{S},\mathrm{S})}=\underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}+\mathcal{T}^{(3)}=\mathcal{T}}{\inf}\bigg{(}\frac{1}{\sqrt{n_{1}}}\|[T_{(1)}^{(1)};M]\|_{\mathrm{tr}}+\sum_{k=2}^{3}\frac{1}{\sqrt{n_{k}}}\|T_{(k)}^{(k)}\|_{\mathrm{tr}}\bigg{)}.

\|\mathcal{T},M\|^{1}_{(\mathrm{O},\mathrm{S},\mathrm{O})}=\underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}=\mathcal{T}}{\inf}\bigg{(}\|[T_{(1)}^{(1)};M]\|_{\mathrm{tr}}+\frac{1}{\sqrt{n_{2}}}\|T_{(2)}^{(2)}\|_{\mathrm{tr}}\\ +\|T_{(3)}^{(1)}\|_{\mathrm{tr}}\bigg{)}.

\|\mathcal{T},M\|^{1}_{(\mathrm{O},\mathrm{S},\mathrm{O})}=\underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}=\mathcal{T}}{\inf}\bigg{(}\|[T_{(1)}^{(1)};M]\|_{\mathrm{tr}}+\frac{1}{\sqrt{n_{2}}}\|T_{(2)}^{(2)}\|_{\mathrm{tr}}\\ +\|T_{(3)}^{(1)}\|_{\mathrm{tr}}\bigg{)}.

\|\mathcal{T},M_{1},M_{2}\|^{1,3}_{(\mathrm{O},\mathrm{S},\mathrm{O})}=\underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}=\mathcal{T}}{\inf}\Big{(}\|[T_{(1)}^{(1)};M_{1}]\|_{\mathrm{tr}}+\frac{1}{\sqrt{n_{2}}}\|T_{(2)}^{(2)}\|_{\mathrm{tr}}\\ +\|[T_{(3)}^{(1)};M_{2}]\|_{\mathrm{tr}}\Big{)}.

\|\mathcal{T},M_{1},M_{2}\|^{1,3}_{(\mathrm{O},\mathrm{S},\mathrm{O})}=\underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}=\mathcal{T}}{\inf}\Big{(}\|[T_{(1)}^{(1)};M_{1}]\|_{\mathrm{tr}}+\frac{1}{\sqrt{n_{2}}}\|T_{(2)}^{(2)}\|_{\mathrm{tr}}\\ +\|[T_{(3)}^{(1)};M_{2}]\|_{\mathrm{tr}}\Big{)}.

\|\mathcal{T},M\|_{(\mathrm{O},\mathrm{O},\mathrm{O}),\underline{S_{p}/q}}^{1}:=\Bigg{(}\bigg{(}\sum_{i}^{r_{1}}\sigma_{i}\big{(}[T_{(1)};M]\big{)}^{p}\bigg{)}^{\frac{q}{p}}+\bigg{(}\sum_{j}^{r_{2}}\sigma_{j}\big{(}T_{(2)}\big{)}^{p}\bigg{)}^{\frac{q}{p}}\\ +\bigg{(}\sum_{k}^{r_{3}}\sigma_{k}\big{(}T_{(3)}\big{)}^{p}\bigg{)}^{\frac{q}{p}}\Bigg{)}^{\frac{1}{q}},

\|\mathcal{T},M\|_{(\mathrm{O},\mathrm{O},\mathrm{O}),\underline{S_{p}/q}}^{1}:=\Bigg{(}\bigg{(}\sum_{i}^{r_{1}}\sigma_{i}\big{(}[T_{(1)};M]\big{)}^{p}\bigg{)}^{\frac{q}{p}}+\bigg{(}\sum_{j}^{r_{2}}\sigma_{j}\big{(}T_{(2)}\big{)}^{p}\bigg{)}^{\frac{q}{p}}\\ +\bigg{(}\sum_{k}^{r_{3}}\sigma_{k}\big{(}T_{(3)}\big{)}^{p}\bigg{)}^{\frac{q}{p}}\Bigg{)}^{\frac{1}{q}},

\|\mathcal{T},M\|_{(\mathrm{O},\mathrm{O},\mathrm{O}),{\overline{S_{p^{*}}/q^{*}}}}^{1}=\underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}+\mathcal{T}^{(3)}=\mathcal{T}}{\inf}\Bigg{(}\bigg{(}\sum_{i}^{r_{1}}\sigma_{i}\big{(}[T_{(1)}^{(1)};M]\big{)}^{p^{*}}\bigg{)}^{\frac{q^{*}}{p^{*}}}\\ +\bigg{(}\sum_{j}^{r_{2}}\sigma_{j}\big{(}T_{(2)}^{(2)}\big{)}^{p^{*}}\bigg{)}^{\frac{q^{*}}{p^{*}}}+\bigg{(}\sum_{k}^{r_{3}}\sigma_{k}\big{(}T_{(3)}^{(3)}\big{)}^{p^{*}}\bigg{)}^{\frac{q^{*}}{p^{*}}}\Bigg{)}^{\frac{1}{q^{*}}},

\|\mathcal{T},M\|_{(\mathrm{O},\mathrm{O},\mathrm{O}),{\overline{S_{p^{*}}/q^{*}}}}^{1}=\underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}+\mathcal{T}^{(3)}=\mathcal{T}}{\inf}\Bigg{(}\bigg{(}\sum_{i}^{r_{1}}\sigma_{i}\big{(}[T_{(1)}^{(1)};M]\big{)}^{p^{*}}\bigg{)}^{\frac{q^{*}}{p^{*}}}\\ +\bigg{(}\sum_{j}^{r_{2}}\sigma_{j}\big{(}T_{(2)}^{(2)}\big{)}^{p^{*}}\bigg{)}^{\frac{q^{*}}{p^{*}}}+\bigg{(}\sum_{k}^{r_{3}}\sigma_{k}\big{(}T_{(3)}^{(3)}\big{)}^{p^{*}}\bigg{)}^{\frac{q^{*}}{p^{*}}}\Bigg{)}^{\frac{1}{q^{*}}},

\|\mathcal{T},M\|_{(\mathrm{O},\mathrm{O},\mathrm{O}),{\overline{S_{\infty}/\infty}}}^{1}=\\ \underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}+\mathcal{T}^{(3)}=\mathcal{T}}{\inf}\max\Big{(}\|[T_{(1)}^{(1)};M]\|_{\mathrm{op}},\|T_{(2)}^{(2)}\|_{\mathrm{op}},\|T_{(3)}^{(3)}\|_{\mathrm{op}}\Big{)}.

\|\mathcal{T},M\|_{(\mathrm{O},\mathrm{O},\mathrm{O}),{\overline{S_{\infty}/\infty}}}^{1}=\\ \underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}+\mathcal{T}^{(3)}=\mathcal{T}}{\inf}\max\Big{(}\|[T_{(1)}^{(1)};M]\|_{\mathrm{op}},\|T_{(2)}^{(2)}\|_{\mathrm{op}},\|T_{(3)}^{(3)}\|_{\mathrm{op}}\Big{)}.

\|\mathcal{T},M\|_{(\mathrm{L},\mathrm{O},\mathrm{O}),\underline{S_{p}/q}}^{1}=\underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}=\mathcal{T}}{\inf}\Bigg{(}\bigg{(}\sum_{i}^{r_{1}}\sigma_{i}\big{(}[T_{(1)}^{(1)};M]\big{)}^{p}\bigg{)}^{\frac{q}{p}}\\ +\bigg{(}\sum_{j}^{r_{2}}\sigma_{j}\big{(}T_{(2)}^{(2)}\big{)}^{p}\bigg{)}^{\frac{q}{p}}+\bigg{(}\sum_{k}^{r_{3}}\sigma_{k}\big{(}T_{(3)}^{(2)}\big{)}^{p}\Big{)}^{\frac{q}{p}}\Bigg{)}^{\frac{1}{q}}.

\|\mathcal{T},M\|_{(\mathrm{L},\mathrm{O},\mathrm{O}),\underline{S_{p}/q}}^{1}=\underset{\mathcal{T}^{(1)}+\mathcal{T}^{(2)}=\mathcal{T}}{\inf}\Bigg{(}\bigg{(}\sum_{i}^{r_{1}}\sigma_{i}\big{(}[T_{(1)}^{(1)};M]\big{)}^{p}\bigg{)}^{\frac{q}{p}}\\ +\bigg{(}\sum_{j}^{r_{2}}\sigma_{j}\big{(}T_{(2)}^{(2)}\big{)}^{p}\bigg{)}^{\frac{q}{p}}+\bigg{(}\sum_{k}^{r_{3}}\sigma_{k}\big{(}T_{(3)}^{(2)}\big{)}^{p}\Big{)}^{\frac{q}{p}}\Bigg{)}^{\frac{1}{q}}.

\|\mathcal{T},M\|_{(\mathrm{L},\mathrm{O},\mathrm{O}),\overline{S_{p^{*}}/q^{*}}}^{1}=\Bigg{(}\Big{(}\sum_{i}^{r_{1}}\sigma_{i}\big{(}[T_{(1)};M]\big{)}^{p^{*}}\Big{)}^{\frac{q^{*}}{p^{*}}}+\\ \underset{\mathcal{\hat{T}}^{(1)}+\mathcal{\hat{T}}^{(2)}=\mathcal{T}}{\inf}\Bigg{(}\bigg{(}\sum_{j}^{r_{2}}\sigma_{j}\big{(}\hat{T}_{(2)}^{(1)}\big{)}^{p^{*}}\bigg{)}^{\frac{q^{*}}{p^{*}}}+\bigg{(}\sum_{k}^{r_{3}}\sigma_{k}\big{(}\hat{T}_{(3)}^{(2)}\big{)}^{p^{*}}\bigg{)}^{\frac{q^{*}}{p^{*}}}\Bigg{)}\Bigg{)}^{\frac{1}{q^{*}}},

\|\mathcal{T},M\|_{(\mathrm{L},\mathrm{O},\mathrm{O}),\overline{S_{p^{*}}/q^{*}}}^{1}=\Bigg{(}\Big{(}\sum_{i}^{r_{1}}\sigma_{i}\big{(}[T_{(1)};M]\big{)}^{p^{*}}\Big{)}^{\frac{q^{*}}{p^{*}}}+\\ \underset{\mathcal{\hat{T}}^{(1)}+\mathcal{\hat{T}}^{(2)}=\mathcal{T}}{\inf}\Bigg{(}\bigg{(}\sum_{j}^{r_{2}}\sigma_{j}\big{(}\hat{T}_{(2)}^{(1)}\big{)}^{p^{*}}\bigg{)}^{\frac{q^{*}}{p^{*}}}+\bigg{(}\sum_{k}^{r_{3}}\sigma_{k}\big{(}\hat{T}_{(3)}^{(2)}\big{)}^{p^{*}}\bigg{)}^{\frac{q^{*}}{p^{*}}}\Bigg{)}\Bigg{)}^{\frac{1}{q^{*}}},

\underset{\mathcal{T}^{(1)},\mathcal{T}^{(2)},M}{\min}\frac{1}{2}\|\Omega_{M}(M-\hat{M})\|_{\mathrm{F}}^{2}+\frac{1}{2}\|\Omega_{\mathcal{T}}(\mathcal{T}^{(1)}+\mathcal{T}^{(2)}-\hat{\mathcal{T}})\|_{\mathrm{F}}^{2}\\ +\lambda\Bigg{(}\frac{1}{\sqrt{n_{1}}}\|[T_{(1)}^{(1)};M]\|_{\mathrm{tr}}+\|T_{(2)}^{(2)}\|_{\mathrm{tr}}+\|T_{(3)}^{(2)}\|_{\mathrm{tr}}\Bigg{)}

\underset{\mathcal{T}^{(1)},\mathcal{T}^{(2)},M}{\min}\frac{1}{2}\|\Omega_{M}(M-\hat{M})\|_{\mathrm{F}}^{2}+\frac{1}{2}\|\Omega_{\mathcal{T}}(\mathcal{T}^{(1)}+\mathcal{T}^{(2)}-\hat{\mathcal{T}})\|_{\mathrm{F}}^{2}\\ +\lambda\Bigg{(}\frac{1}{\sqrt{n_{1}}}\|[T_{(1)}^{(1)};M]\|_{\mathrm{tr}}+\|T_{(2)}^{(2)}\|_{\mathrm{tr}}+\|T_{(3)}^{(2)}\|_{\mathrm{tr}}\Bigg{)}

\underset{\mathcal{T}^{(1)},\mathcal{T}^{(2)},M}{\min}\frac{1}{2}\|\Omega_{M}(M-\hat{M})\|_{\mathrm{F}}^{2}+\frac{1}{2}\|\Omega_{\mathcal{T}}(\mathcal{T}^{(1)}+\mathcal{T}^{(2)}-\hat{\mathcal{T}})\|_{\mathrm{F}}^{2}\\ +\lambda\Bigg{(}\frac{1}{\sqrt{n_{1}}}\|[Y_{(1)}^{(1)};X]\|_{\mathrm{tr}}+\|Y_{(2)}^{(2)}\|_{\mathrm{tr}}+\|Y_{(3)}^{(2)}\|_{\mathrm{tr}}\Bigg{)}

\underset{\mathcal{T}^{(1)},\mathcal{T}^{(2)},M}{\min}\frac{1}{2}\|\Omega_{M}(M-\hat{M})\|_{\mathrm{F}}^{2}+\frac{1}{2}\|\Omega_{\mathcal{T}}(\mathcal{T}^{(1)}+\mathcal{T}^{(2)}-\hat{\mathcal{T}})\|_{\mathrm{F}}^{2}\\ +\lambda\Bigg{(}\frac{1}{\sqrt{n_{1}}}\|[Y_{(1)}^{(1)};X]\|_{\mathrm{tr}}+\|Y_{(2)}^{(2)}\|_{\mathrm{tr}}+\|Y_{(3)}^{(2)}\|_{\mathrm{tr}}\Bigg{)}

s.t. X = M, Y^{(1)} = T^{(1)}, Y^{(k)} = T^{(2)} k = 2, 3.

s.t. X = M, Y^{(1)} = T^{(1)}, Y^{(k)} = T^{(2)} k = 2, 3.

\underset{\mathcal{T}^{(1)},\mathcal{T}^{(2)},M}{\min}\frac{1}{2}\|\Omega_{M}(M-\hat{M})\|_{\mathrm{F}}^{2}+\frac{1}{2}\|\Omega_{\mathcal{T}}(\mathcal{T}^{(1)}+\mathcal{T}^{(2)}-\hat{\mathcal{T}})\|_{\mathrm{F}}^{2}\\ +\lambda\Bigg{(}\frac{1}{\sqrt{n_{1}}}\|[Y_{(1)}^{(1)};X]\|_{\mathrm{tr}}+\|Y_{(2)}^{(2)}\|_{\mathrm{tr}}+\|Y_{(3)}^{(2)}\|_{\mathrm{tr}}\Bigg{)}+\big{\langle}W^{M},M-X\big{\rangle}\\ +\big{\langle}\mathcal{W}^{\mathcal{T}^{(1)}},\mathcal{T}^{(1)}-\mathcal{Y}^{(1)}\big{\rangle}+\sum_{k=2}^{3}\big{\langle}\mathcal{W}^{\mathcal{T}^{(k)}},\mathcal{T}^{(2)}-\mathcal{Y}^{(k)}\big{\rangle}+\frac{\beta}{2}\|M-X\|_{F}^{2}\\ +\frac{\beta}{2}\|\mathcal{T}^{(1)}-\mathcal{Y}^{(1)}\|_{F}^{2}+\frac{\beta}{2}\sum_{k=2}^{3}\|\mathcal{T}^{(2)}-\mathcal{Y}^{(k)}\|_{F}^{2}

\underset{\mathcal{T}^{(1)},\mathcal{T}^{(2)},M}{\min}\frac{1}{2}\|\Omega_{M}(M-\hat{M})\|_{\mathrm{F}}^{2}+\frac{1}{2}\|\Omega_{\mathcal{T}}(\mathcal{T}^{(1)}+\mathcal{T}^{(2)}-\hat{\mathcal{T}})\|_{\mathrm{F}}^{2}\\ +\lambda\Bigg{(}\frac{1}{\sqrt{n_{1}}}\|[Y_{(1)}^{(1)};X]\|_{\mathrm{tr}}+\|Y_{(2)}^{(2)}\|_{\mathrm{tr}}+\|Y_{(3)}^{(2)}\|_{\mathrm{tr}}\Bigg{)}+\big{\langle}W^{M},M-X\big{\rangle}\\ +\big{\langle}\mathcal{W}^{\mathcal{T}^{(1)}},\mathcal{T}^{(1)}-\mathcal{Y}^{(1)}\big{\rangle}+\sum_{k=2}^{3}\big{\langle}\mathcal{W}^{\mathcal{T}^{(k)}},\mathcal{T}^{(2)}-\mathcal{Y}^{(k)}\big{\rangle}+\frac{\beta}{2}\|M-X\|_{F}^{2}\\ +\frac{\beta}{2}\|\mathcal{T}^{(1)}-\mathcal{Y}^{(1)}\|_{F}^{2}+\frac{\beta}{2}\sum_{k=2}^{3}\|\mathcal{T}^{(2)}-\mathcal{Y}^{(k)}\|_{F}^{2}

M^{[t]}=unvec\bigg{(}(\Omega_{M}^{\top}\Omega_{M}+\beta I_{M})^{-1}vec\big{(}\Omega_{M}(\hat{M})-W^{M[t-1]}+\beta X^{[t-1]}\big{)}\bigg{)}.

M^{[t]}=unvec\bigg{(}(\Omega_{M}^{\top}\Omega_{M}+\beta I_{M})^{-1}vec\big{(}\Omega_{M}(\hat{M})-W^{M[t-1]}+\beta X^{[t-1]}\big{)}\bigg{)}.

\begin{bmatrix}\Omega_{\mathcal{T}}^{\top}\Omega_{\mathcal{T}}+2\beta I_{\mathcal{T}}&I_{\mathcal{T}}\\ I_{\mathcal{T}}&\Omega_{\mathcal{T}}^{\top}\Omega_{\mathcal{T}}+2\beta I_{\mathcal{T}}\end{bmatrix}\left[\begin{array}[]{c}vec(\mathcal{T}^{(1)[t]})\\ vec(\mathcal{T}^{(2)[t]})\end{array}\right]=\\ \\ \left[\begin{array}[]{c}vec\bigg{(}\Omega_{\hat{\mathcal{T}}}(\hat{\mathcal{T}})-\sum_{k=2}^{3}\mathcal{W}^{\mathcal{T}^{(k)}[t-1]}+\beta\sum_{k=2}^{3}\mathcal{Y}^{(k)[t-1]}\bigg{)}\\ vec\bigg{(}\Omega_{\hat{\mathcal{T}}}(\hat{\mathcal{T}})-\sum_{k=2}^{3}\mathcal{W}^{\mathcal{T}^{(k)}[t-1]}+\beta\sum_{k=2}^{3}\mathcal{Y}^{(k)[t-1]}\bigg{)}\end{array}\right],

\begin{bmatrix}\Omega_{\mathcal{T}}^{\top}\Omega_{\mathcal{T}}+2\beta I_{\mathcal{T}}&I_{\mathcal{T}}\\ I_{\mathcal{T}}&\Omega_{\mathcal{T}}^{\top}\Omega_{\mathcal{T}}+2\beta I_{\mathcal{T}}\end{bmatrix}\left[\begin{array}[]{c}vec(\mathcal{T}^{(1)[t]})\\ vec(\mathcal{T}^{(2)[t]})\end{array}\right]=\\ \\ \left[\begin{array}[]{c}vec\bigg{(}\Omega_{\hat{\mathcal{T}}}(\hat{\mathcal{T}})-\sum_{k=2}^{3}\mathcal{W}^{\mathcal{T}^{(k)}[t-1]}+\beta\sum_{k=2}^{3}\mathcal{Y}^{(k)[t-1]}\bigg{)}\\ vec\bigg{(}\Omega_{\hat{\mathcal{T}}}(\hat{\mathcal{T}})-\sum_{k=2}^{3}\mathcal{W}^{\mathcal{T}^{(k)}[t-1]}+\beta\sum_{k=2}^{3}\mathcal{Y}^{(k)[t-1]}\bigg{)}\end{array}\right],

[Y_{(1)}^{(1)[t-1]};X^{[t-1]}]={\rm{prox}}_{{\lambda/(\sqrt{n_{1}}\beta)}}\bigg{(}[\frac{W_{(1)}^{\mathcal{T}^{(1)}[t-1]}}{\beta};\frac{W^{M[t-1]}}{\beta}]+[T_{(1)}^{(1)[t]};M^{[t]}]\bigg{)},

[Y_{(1)}^{(1)[t-1]};X^{[t-1]}]={\rm{prox}}_{{\lambda/(\sqrt{n_{1}}\beta)}}\bigg{(}[\frac{W_{(1)}^{\mathcal{T}^{(1)}[t-1]}}{\beta};\frac{W^{M[t-1]}}{\beta}]+[T_{(1)}^{(1)[t]};M^{[t]}]\bigg{)},

Y_{(k)}^{(k)[t-1]}={\rm{prox}}_{{\lambda/\beta}}\bigg{(}\frac{W_{(k)}^{\mathcal{T}^{(t)}[t-1]}}{\beta}+T_{(k)}^{(2)[t]}\bigg{)},\quad k=2,3,

Y_{(k)}^{(k)[t-1]}={\rm{prox}}_{{\lambda/\beta}}\bigg{(}\frac{W_{(k)}^{\mathcal{T}^{(t)}[t-1]}}{\beta}+T_{(k)}^{(2)[t]}\bigg{)},\quad k=2,3,

W^{M [t]} = W^{M [t - 1]} + β (M^{[t]} - X^{[t]}),

W^{M [t]} = W^{M [t - 1]} + β (M^{[t]} - X^{[t]}),

W^{T^{(1)} [t - 1]} = W^{T^{(1)} [t]} + β (T^{(1) [t]} - Y^{(1) [t]}),

W^{T^{(1)} [t - 1]} = W^{T^{(1)} [t]} + β (T^{(1) [t]} - Y^{(1) [t]}),

W^{T^{(k)} [t - 1]} = W^{T^{(k)} [t]} + β (T^{(k) [t]} - Y^{(k) [t]}), k = 2, 3.

W^{T^{(k)} [t - 1]} = W^{T^{(k)} [t]} + β (T^{(k) [t]} - Y^{(k) [t]}), k = 2, 3.

\big{(}\Omega_{\mathcal{T}}^{\top}\Omega_{\mathcal{T}}+3\beta I_{\mathcal{T}}\big{)}vec(\mathcal{T}^{[t]})=vec\bigg{(}\Omega_{\hat{\mathcal{T}}}(\hat{\mathcal{T}})-\sum_{k=1}^{3}\mathcal{W}^{\mathcal{T}^{(k)}[t-1]}+\beta\sum_{k=1}^{3}\mathcal{Y}^{[t-1]}\bigg{)},

\big{(}\Omega_{\mathcal{T}}^{\top}\Omega_{\mathcal{T}}+3\beta I_{\mathcal{T}}\big{)}vec(\mathcal{T}^{[t]})=vec\bigg{(}\Omega_{\hat{\mathcal{T}}}(\hat{\mathcal{T}})-\sum_{k=1}^{3}\mathcal{W}^{\mathcal{T}^{(k)}[t-1]}+\beta\sum_{k=1}^{3}\mathcal{Y}^{[t-1]}\bigg{)},

[Y_{(1)}^{(1)[t]};X^{[t]}]={\rm{prox}}_{{\lambda/\beta}}\bigg{(}[\frac{W_{(1)}^{\mathcal{T}^{(k)}[t-1]}}{\beta};\frac{W^{M[t-1]}}{\beta}]+[T_{(1)}^{[t]};M^{[t]}]\bigg{)},

[Y_{(1)}^{(1)[t]};X^{[t]}]={\rm{prox}}_{{\lambda/\beta}}\bigg{(}[\frac{W_{(1)}^{\mathcal{T}^{(k)}[t-1]}}{\beta};\frac{W^{M[t-1]}}{\beta}]+[T_{(1)}^{[t]};M^{[t]}]\bigg{)},

Y_{(k)}^{(k)[t-1]}={\rm{prox}}_{{\lambda/\beta}}\bigg{(}\frac{W_{(k)}^{\mathcal{T}^{(k)}[t-1]}}{\beta}+T_{(k)}^{[t]}\bigg{)},\quad k=1,2,3.

Y_{(k)}^{(k)[t-1]}={\rm{prox}}_{{\lambda/\beta}}\bigg{(}\frac{W_{(k)}^{\mathcal{T}^{(k)}[t-1]}}{\beta}+T_{(k)}^{[t]}\bigg{)},\quad k=1,2,3.

W^{T^{(k)} [t - 1]} = W^{T^{(k)} [t]} + β (T^{[t]} - Y^{(k) [t]}), k = 1, 2, 3.

W^{T^{(k)} [t - 1]} = W^{T^{(k)} [t]} + β (T^{[t]} - Y^{(k) [t]}), k = 1, 2, 3.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Convex Coupled Matrix and Tensor Completion

Kishan Wimalawarne

Bioinformatics Center,

Institute for Chemical Research,

Kyoto University,

Gokasho, Uji, Japan.

[email protected]

Makoto Yamada

RIKEN, Center for Advanced Intelligence Project,

Nihonbashi 1-chome Mitsui Building, 15th floor,

1-4-1 Nihonbashi, Chuo-ku,

Tokyo 103-0027, Japan.

The Institute of Statistical Mathematics,

10-3 Midori-cho, Tachikawa,

Tokyo 190-8562, Japan.

PRESTO, Japan Science and Technological Agency (JST), Japan.

[email protected]

Hiroshi Mamitsuka

Bioinformatics Center,

Institute for Chemical Research,

Kyoto University,

Gokasho, Uji, Japan.

Department of Computer Science,

Aalto University,

Espoo 02150 Finland.

[email protected]

Abstract

We propose a set of convex low-rank inducing norms for coupled matrices and tensors (hereafter coupled tensors), in which information is shared between the matrices and tensors through common modes. More specifically, we first propose a mixture of the overlapped trace norm and the latent norms with the matrix trace norm, and then, propose a completion model regularized using these norms to impute coupled tensors. A key advantage of the proposed norms is that they are convex and can be used to find a globally optimal solution, whereas existing methods for coupled learning are non-convex. We also analyze the excess risk bounds of the completion model regularized using our proposed norms and show that they can exploit the low-rankness of coupled tensors, leading to better bounds compared to those obtained using uncoupled norms. Through synthetic and real-data experiments, we show that the proposed completion model compares favorably with existing ones.

1 Introduction

Learning from a matrix or a tensor has long been an important problem in machine learning. In particular, matrix and tensor factorization using low-rank inducing norms has been studied extensively, and many applications have been considered, such as missing value imputation (Signoretto et al., 2013; Liu et al., 2009), multi-task learning (Argyriou et al., 2006; Romera-Paredes et al., 2013; Wimalawarne et al., 2014), subspace clustering (Liu et al., 2010), and inductive learning (Signoretto et al., 2013; Wimalawarne et al., 2016). Though useful in many applications, factorization based on an individual matrix or tensor tends to perform poorly under the cold start setup condition (Singh and Gordon, 2008), when, for example, it is not possible to observe click information for new users in collaborative filtering. It therefore cannot be used to recommend possible items for new users. Potential ways to address this issue, are matrix or tensor factorization with side information (Narita et al., 2011). Both have been applied to recommendation systems (Singh and Gordon, 2008; Gunasekar et al., 2015) and personalized medicine (Khan and Kaski, 2014).

Both matrix and tensor factorization with side information can be regarded as the joint factorization of coupled matrices and tensors (hereafter coupled tensors) (See Figure 1). Acar et al. (2011) introduced a coupled factorization method based on CP decomposition, that simultaneously factorizes matrices and tensors by sharing the low-rank structures in the matrices and tensors. The coupled factorization approach has been applied to joint analysis of fluorescence and proton nuclear magnetic resonance (NMR) measurements (Acar et al., 2014a) and joint NMR and liquid chromatography-mass spectrometry (LC–MS) (Acar et al., 2015). More recently, a Bayesian approach proposed by Ermis et al. (2015) was applied to link prediction problems. However, existing coupled factorization methods are non-convex and can obtain only a poor local optimum. Moreover, the ranks of the coupled tensors need to be determined beforehand. In practice, it is difficult to specify the true ranks of the tensor and the matrix without prior knowledge. Furthermore, existing algorithms are not theoretically guaranteed.

We propose in this paper, convex norms for coupled tensors that overcome the non-convexity problem. The norms are a mixtures of tensor norms: the overlapped trace norm (Tomioka et al., 2011), the latent trace norm (Tomioka and Suzuki, 2013), the scaled latent norm (Wimalawarne et al., 2014), and the matrix trace norm (Argyriou et al., 2006). A key advantage of the proposed norms is that they are convex and thus can be used to find a globally optimal solution, whereas existing coupled factorization approaches are non-convex. Furthermore, we analyze the excess risk bounds of the completion model regularized using our proposed norms. Through synthetic and real-data experiments, we show that it compares favorably with existing ones.

Our contributions in this paper are to

•

Propose a set of convex coupled norms for matrices and tensors that extend low-rank tensor and matrix norms.

•

Propose mixed norms that combine features from both the overlapped norm and latent norms.

•

Propose a convex completion model regularized using the proposed coupled norms.

•

Analyze the excess risk bounds for the proposed completion model with respect to the proposed norms and show that it leads to lower excess risk.

•

Show through synthetic and real-data experiments, that our norms lead to performance comparable to that of existing non-convex methods.

•

Show that our norms are applicable to coupled tensors based on both the CP rank and the multilinear rank without prior assumptions about their low-rankness.

•

Show that convexity of the proposed norms leads to global solutions, eliminating the need to deal with local optimal solutions as is necessary with non-convex methods.

The remainder of the paper is organized as follows. In Section 2, we discuss related work on coupled tensor completion. In Section 3, we present our proposed method, first introducing a coupled completion model and then proposing a set of norms called coupled norms. In Section 4, we give optimization methods for solving the coupled completion model. In Section 5, we theoretically analyze it using excess risk bounds for the proposed coupled norms. In Section 6, we present the results of our evaluation using synthetic and real-world data experiments. Finally, in Section 7, we summarize the key points and suggest future work.

2 Related Work

Most of the models that proposed for learning with multiple matrices or tensors use joint factorization of matrices and tensors. The regularization-based model proposed by Acar et al. (2011) for completion of coupled tensors and which was further studied (Acar et al., 2014a; Acar et al., 2014b; Acar et al., 2015) uses CANDECOMP/PARAFAC (CP) decomposition (Carroll and Chang, 1970; Harshman, 1970; Hitchcock, 1927; Kolda and Bader, 2009) to factorize the tensor and operates under the assumption that the factorized components of its coupled mode are in common with the factorized components of the matrix on the same mode. Bayesian models have also been proposed for imputing missing values with applications in link prediction (Ermis et al., 2015) and non-negative factorization (Takeuchi et al., 2013) which use similar factorization models. Applications that have used collective factorization of tensors are multi-view factorization (Khan and Kaski, 2014) and multi-way clustering (Banerjee et al., 2007). Due to their use of factorization-based learning, all of these models are non-convex.

The use of common adjacency graphs has more recently been proposed for incorporating similarities among heterogeneous tensor data (Li et al., 2015). Though this method does not require assumptions about rank for explicit factorization of tensors, it depends on the modeling of the common adjacency graph and does not incorporate the low-rankness created by the coupling of tensors.

3 Proposed Method

We investigate a method for coupling a matrix and a tensor that forms when they share a common mode (Acar et al., 2015, 2014a; Acar et al., 2014b). An example of the most basic coupling is shown in Figure 1 where a $3$ -way (third-order) tensor is attached to a matrix on a specific mode. As depicted, we may have a problem of predicting recommendations for customers on the basis of their preferences of restaurants in different locations and we may also have side information about the characteristics for each customer. We can utilize this side information, by coupling the customer-characteristic matrix with the sparse customer-restaurant-location tensor of the customer mode and then impute the missing values in the tensor.

Let us consider a partially observed matrix $\hat{M}\in\mathbb{R}^{n_{1}\times m}$ and a partially observed $3$ -way tensor $\hat{\mathcal{T}}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}}$ with mappings to observed elements indexed by $\Omega_{M}$ and $\Omega_{\mathcal{T}}$ , respectively, and let us assume that they are coupled on the first mode. Our ultimate goal of this paper is to introduce convex coupled norms $\|\mathcal{T},M\|_{\mathrm{cn}}$ for use in solving

[TABLE]

where $\lambda\geq 0$ is the regularization parameter. We also investigate the theoretical properties of problem (1).

Notations: The mode- $k$ unfolding of tensor $\mathcal{T}\in\mathbb{R}^{n_{1}\times\cdots\times n_{K}}$ is represented as $T_{(k)}\in\mathbb{R}^{n_{k}\times\prod_{j\neq k}^{K}n_{j}}$ , which is obtained by concatenating all the $\prod_{j\neq k}^{K}n_{j}$ vectors with dimension $n_{k}$ obtained by fixing all except the $k$ th index on mode- $k$ along its columns. We use $vec()$ to indicate the conversion of a matrix or a tensor into a vector and $unvec()$ to represent the reverse operation. The spectral norm (operator norm) of a matrix $X$ is the $\|X\|_{{\mathrm{op}}}$ that is the largest singular value of $X$ . The Frobenius norm of a tensor $\mathcal{T}$ is defined as $\|\mathcal{T}\|_{{\mathrm{F}}}=\sqrt{\left\langle\mathcal{T},\mathcal{T}\right\rangle}=\sqrt{vec(\mathcal{T})^{\top}vec(\mathcal{T})}$ . We use $[M;N]$ as the concatenation of matrices $M\in\mathbb{R}^{m_{1}\times m_{2}}$ and $N\in\mathbb{R}^{m_{1}\times m_{3}}$ along their mode $1$ .

3.1 Existing Matrix and Tensor Norms

Before we introduce our new norms, let us first briefly review the existing low-rank inducing matrix and tensor norms. Among matrices, the matrix trace norm (Argyriou et al., 2006) is a commonly used convex relaxation for the minimization of the rank of a matrix. For a given matrix $M\in\mathbb{R}^{n_{1}\times m}$ with rank $J$ , we can define its trace norm as

[TABLE]

where $\sigma_{j}$ is the $j$ th non-zero singular value of the matrix.

Low-rank inducing norms for tensors have received revived attention in recent years. One of the earliest low-rank inducing tensor norm is the tensor nuclear norm (Liu et al., 2009) (also known as the overlapped trace norm (Tomioka and Suzuki, 2013)) which can be expressed for a tensor $\mathcal{T}\in\mathbb{R}^{n_{1}\times\cdots\times n_{K}}$ as

[TABLE]

Tomioka and Suzuki (2013) proposed the latent trace norm:

[TABLE]

The scaled latent trace norm was proposed as an extension of the latent trace norm (Wimalawarne et al., 2014):

[TABLE]

The behaviors of these two tensor norms have been studied on the basis of multitask learning (Wimalawarne et al., 2014) and inductive learning (Wimalawarne et al., 2016). The results show that for a tensor $\mathcal{T}\in\mathbb{R}^{n_{1}\times\cdots\times n_{K}}$ with multilinear rank $(r_{1},\ldots,r_{K})$ , the excess risk is bounded above with respect to regularization with the overlapped trace norm by $\mathcal{O}(\sum_{k=1}^{K}\sqrt{r_{k}})$ , the latent trace norm by $\mathcal{O}(\min_{k}\sqrt{r_{k}})$ , and the scaled latent trace norm by $\mathcal{O}\Big{(}\min_{k}\sqrt{\frac{r_{k}}{n_{k}}}\Big{)}$ .

3.2 Coupled Tensor Norms

As with individual matrices and tensors, having convex and low-rank inducing norms for coupled tensors would be useful in achieving global solutions for coupled tensor completion with theoretical guarantees. To achieve this, we propose a set of norms for coupled tensors that are coupled on specific modes using existing matrix and tensor trace norms. Let us first define a new coupled norm with the format $\|.\|^{a}_{(b,c,d)}$ , where the superscript $a$ specifies the mode in which the tensor and matrix are coupled and the subscripts $b,c,d\in\{\mathrm{O},\mathrm{L},\mathrm{S},-\}$ indicate how the modes are regularized. The notations for $b,c,d$ are defined as

$\mathrm{O}$ : The mode is regularized with the trace norm. The same tensor is regularized on other modes similarly to the overlapped trace norm.

$\mathrm{L}$ : The mode is considered to be a latent tensor that is regularized using the trace norm only with respect to that mode.

$\mathrm{S}$ : The mode is regularized as a latent tensor but it is scaled similarly to the scaled latent trace norm.

$-$ : The mode is not regularized.

Given a matrix $M\in\mathbb{R}^{n_{1}\times m}$ and a tensor $\mathcal{T}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}}$ , we introduce three norms that are coupled extensions of the overlapped trace norm, the latent trace norm, and the scaled latent trace norm, respectively.

Coupled overlapped trace norm:

[TABLE]

Coupled latent trace norm:

[TABLE]

Coupled scaled latent trace norm:

[TABLE]

In addition to these norms, we can also create norms as mixtures of overlapped and latent/scaled latent norms. For example, if we want to create a norm that is regularized using the scaled latent trace norm on the second mode while the other modes are regularized using the overlapped trace norm, we can define it as

[TABLE]

This norm has two latent tensors, $\mathcal{T}^{(1)}$ and $\mathcal{T}^{(2)}$ . Tensor $\mathcal{T}^{(1)}$ is regularized using the overlapped method for modes $1$ and $3$ while the tensor $\mathcal{T}^{(2)}$ is regularized as a scaled latent tensor on mode $2$ . Given this use of a mixture of regularization methods, we call the resulting norm a mixed norm.

In a similar manner, we can create other mixed norms distinguished by their subscripts: $(\mathrm{L},\mathrm{O},\mathrm{O})$ , $(\mathrm{O},\mathrm{L},\mathrm{O})$ , $(\mathrm{O},\mathrm{O},\mathrm{L})$ , $(\mathrm{S},\mathrm{O},\mathrm{O})$ , $(\mathrm{O},\mathrm{S},\mathrm{O})$ , and $(\mathrm{O},\mathrm{O},\mathrm{S})$ . The main advantage gained by using these mixed norms is the additional freedom to regularize low-rank constraints among coupled tensors. Other combinations of norms in which two modes are latent tensors such as $(\mathrm{L},\mathrm{L},\mathrm{O})$ will make the third mode also a latent tensor since overlapped regularization requires that more than one mode be regularized of the same tensor. Though we have considered using the latent trace norm, in practice it has been shown to be weaker in performance than the scaled latent trace norm (Wimalawarne et al., 2014; Wimalawarne et al., 2016). Therefore, in our experiments, we considered only mixed norms based on the scaled latent trace norm.

3.2.1 Extensions for Multiple Matrices and Tensors

Our newly defined norms can be extended to multiple matrices coupled to a tensor on different modes. For instance, we can couple two matrices $M_{1}\in\mathbb{R}^{n_{1}\times m_{1}}$ and $M_{2}\in\mathbb{R}^{n_{3}\times m_{2}}$ to a $3$ -way tensor $\mathcal{T}$ on its first and third modes. If we regularize the coupled tensor with the overlapped trace norm on mode $1$ and mode $3$ and the scaled latent trace norm on mode $2$ , we obtain a mixed norm,

[TABLE]

Coupled norms for multiple $3$ -mode or higher dimensional tensors could also be designed using our proposed method. However, such extension may require extending coupled norms further. Extensions to coupled norms for multiple tensors are a promising area for future research.

3.3 Dual Norms

Let us now briefly look at dual norms for the above defined coupled norms. Dual norms are useful in deriving excess risk bounds, as discussed in Section 4. Due to space limitations we derive dual norms for only two coupled norms to better understand their nature. To derive them, we first need to know the Schatten norm (Tomioka and Suzuki, 2013) for the coupled tensor norms. Let us first define the Schatten- $(p,q)$ norm for the coupled norm $\|\mathcal{T},M\|^{1}_{(\mathrm{O},\mathrm{O},\mathrm{O})}$ with an additional subscript notation $\underline{S_{p}/q}$ :

[TABLE]

where $p$ and $q$ are constants, $r_{1}$ , $r_{2}$ and $r_{3}$ are the ranks and $\sigma_{i}$ , $\sigma_{j}$ and $\sigma_{k}$ are the singular values for each unfolding.

The following theorem presents the dual norm of $\|\mathcal{T},M\|_{(\mathrm{O},\mathrm{O},\mathrm{O}),\underline{S_{p}/q}}^{1}$ (see Appendix A for proof).

Theorem 1.

Let a matrix $M\in\mathbb{R}^{n_{1}\times m}$ and a tensor $\mathcal{T}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}}$ be coupled on their first modes. The dual norm of $\|\mathcal{T},M\|_{(\mathrm{O},\mathrm{O},\mathrm{O}),\underline{S_{p}/q}}^{1}$ with $1/p+1/p^{*}=1$ and $1/q+1/q^{*}=1$ is

[TABLE]

where $r_{1}$ , $r_{2}$ , and $r_{3}$ are the ranks for each mode and $\sigma_{i}$ , $\sigma_{j}$ , and $\sigma_{k}$ are the singular values for each unfolding of the coupled tensor.

In the special case of $p=1$ and $q=1$ , we see that $\|\mathcal{T},M\|_{(\mathrm{O},\mathrm{O},\mathrm{O}),\underline{S_{1}/1}}^{1}=\|\mathcal{T},M\|_{(\mathrm{O},\mathrm{O},\mathrm{O})}^{1}$ . Its dual norm is the spectral norm, as shown in the following corollary.

Corollary 1.

Let a matrix $M\in\mathbb{R}^{n_{1}\times m}$ and a tensor $\mathcal{T}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}}$ be coupled on their first mode. The dual norm of $\|\mathcal{T},M\|_{(\mathrm{O},\mathrm{O},\mathrm{O}),\underline{S_{1}/1}}^{1}$ is

[TABLE]

The Schatten- $(p,q)$ norm for the mixed norm $\|\cdot\|^{1}_{(\mathrm{L},\mathrm{O},\mathrm{O})}$ is defined as

[TABLE]

Its dual norm is defined by the following theorem (see Appendix A for proof).

Theorem 2.

Let a matrix $M\in\mathbb{R}^{n_{1}\times m}$ and a tensor $\mathcal{T}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}}$ be coupled on their first mode. The dual norm of the mixed coupled norm $\|\mathcal{T},M\|_{(\mathrm{L},\mathrm{O},\mathrm{O}),\underline{S_{p}/q}}^{1}$ with $1/p+1/p^{*}=1$ and $1/q+1/q^{*}=1$ is

[TABLE]

where $r_{1}$ , $r_{2}$ , and $r_{3}$ are the ranks of $T_{(1)}$ , ${\hat{T}}^{(1)}_{(2)}$ and ${\hat{T}}^{(2)}_{(3)}$ , respectively, and $\sigma_{i}$ , $\sigma_{j}$ , and $\sigma_{k}$ are their singular values.

The dual norms of other mixed norms can be similarly derived.

4 Optimization

In this section, we discuss optimization of the proposed completion model (1). The completion model (1) can be easily solved for each coupled norm using a state of the art optimization method such as the alternating direction method of multipliers (ADMM) method (Boyd et al., 2011). The optimization steps for the coupled norm $\|\mathcal{T},M\|^{1}_{(\mathrm{S},\mathrm{O},\mathrm{O})}$ are derived using the ADMM method. The optimization steps for the other norms are similarly derived.

We express (1) using the $\|\mathcal{T},M\|^{1}_{(\mathrm{S},\mathrm{O},\mathrm{O})}$ norm

[TABLE]

By introducing auxiliary variables $X\in\mathbb{R}^{n_{1}\times m}$ and $\mathcal{Y}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}}$ , we can formulate the objective function of ADMM for (10)

[TABLE]

We introduce Lagrangian multipliers $W^{M}\in\mathbb{R}^{n_{1}\times m}$ and $\mathcal{W}^{\mathcal{T}^{(k)}}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}}$ , $(k=1,2,3)$ and formulate the Lagrangian as

[TABLE]

where $\beta$ is a proximity parameter. Using this Lagrangian formulation, we can obtain solutions for unknown variables $M$ , $\mathcal{T}^{(1)}$ , $\mathcal{T}^{(2)}$ , $W^{M}$ , $\mathcal{W}^{\mathcal{T}^{(k)}}\;(k=1,2,3)$ , $X$ , and $\mathcal{Y}^{(k)}\;(k=1,2,3)$ iteratively. We use superscripts $[t]$ and $[t-1]$ to represent the variables at iteration steps $t$ and $t-1$ , respectively.

The solutions for $M$ at each iteration can be obtained by solving the following sub-problem:

[TABLE]

Solutions for $\mathcal{T}^{(1)}$ and $\mathcal{T}^{(2)}$ at iteration step $t$ can be obtained from the following sub-problem:

[TABLE]

where $I_{M}$ and $I_{\mathcal{T}}$ are unit diagonal matrices with dimensions $n_{1}m\times n_{1}m$ and $n_{1}n_{2}n_{3}\times n_{1}n_{2}n_{3}$ , respectively.

The updates for $X$ and $\mathcal{Y}^{(k)}$ , $(k=1,2,3)$ at iteration step $t$ are given as

[TABLE]

and

[TABLE]

where $\rm{prox}_{\lambda}(X)=U(S-\lambda)_{+}V^{\top}$ for $X=USV^{\top}$ .

The update rules for the dual variables are

[TABLE]

We can modify the above optimization procedures by replacing the variables in (10) in accordance with the norm that is used to regularize the tensor and by adjusting operations in (11), (13), (14), and (15). For example, for the norm $\|\cdot\|^{1}_{(\mathrm{O},\mathrm{O},\mathrm{O})}$ , there is only a single $\mathcal{T}$ , so the sub-problem for (13) becomes

[TABLE]

and that for (14) becomes

[TABLE]

and

[TABLE]

Additionally, the dual update rule with $\mathcal{T}$ becomes

[TABLE]

The optimization procedures for the other norms can be similarly derived.

5 Theoretical Analysis

In this section, we analyze the excess risk bounds of the completion model introduced in (1) for the coupled norms defined in Section 3 using transductive Rademacher complexity (El-Yaniv and Pechyony, 2007; Shamir and Shalev-Shwartz, 2014). Let us again consider matrix $M$ and tensor $\mathcal{T}$ and use them as a single structure $\mathbb{X}=\mathcal{T}\cup M$ with a training sample index set $\mathrm{S}_{\mathrm{Train}}$ and a testing sample index set $\mathrm{S}_{\mathrm{Test}}$ with the total set of observed samples $\mathrm{S}=\mathrm{S}_{\mathrm{Train}}\cup\mathrm{S}_{\mathrm{Test}}$ . We rewrite (1) with our new notations as an equivalent model:

[TABLE]

where $l(a,b)=(a-b)^{2}$ , $\mathbb{W}=\mathcal{W}\cup W_{M}$ is the learned coupled structure consisting of components $\mathcal{W}$ and $W_{M}$ of the tensor and matrix, respectively, $B$ is a constant, and $\|\cdot\|_{\mathrm{cn}}$ is any norm defined in Section 3.2.

Given that $l(\cdot,\cdot)$ is a $\Lambda$ -Lipschitz loss function bounded by $\sup_{i_{1},i_{2},i_{3}}|l(\mathbb{X}_{i_{1},i_{2},i_{3}},\mathbb{W}_{i_{1},i_{2},i_{3}})|\leq b_{l}$ and assuming that $|\mathrm{S}_{\mathrm{Train}}|=|\mathrm{S}_{\mathrm{Test}}|=|S|/2$ , we can obtain the following excess risk bound based on transductive Rademacher complexity theory (El-Yaniv and Pechyony, 2007; Shamir and Shalev-Shwartz, 2014) with probability $1-\delta$ ,

[TABLE]

where $R(\mathbb{W})$ is the transductive Rademacher complexity defined as

[TABLE]

where $\sigma_{i_{1},i_{2},i_{3}}\in\{-1,1\}$ with probability $0.5$ if $(i_{1},i_{2},i_{3})\in\mathrm{S}$ , or [math] otherwise (See Appendix B for derivation).

Next we give the bounds for (18) with respect to different coupled norms. We assume that $|\mathrm{S}_{\mathrm{Train}}|=|\mathrm{S}_{\mathrm{Test}}|$ , as in (Shamir and Shalev-Shwartz, 2014), but our theorem can be extended to more general cases. Detailed proofs of the theorems in this section are given in Appendix B.

The following two theorems give the Rademacher complexities for coupled completion regularized using the coupled norms $\|\cdot\|^{1}_{(\mathrm{O},\mathrm{O},\mathrm{O})}$ and $\|\cdot\|^{1}_{(\mathrm{S},\mathrm{S},\mathrm{S})}$ .

Theorem 3.

Let $\|\cdot\|_{\mathrm{cn}}=\|\cdot\|^{1}_{(\mathrm{O},\mathrm{O},\mathrm{O})}$ ; then, with probability $1-\delta$ ,

[TABLE]

where $(r_{1},r_{2},r_{3})$ is the multilinear rank of $\mathcal{W}$ , $r_{(1)}$ is the rank of the coupled unfolding on mode $1$ , and $B_{M}$ , $B_{T}$ , $C_{1}$ , and $C_{2}$ are constants.

Theorem 4.

Let $\|\cdot\|_{\mathrm{cn}}=\|\cdot\|^{1}_{(\mathrm{S},\mathrm{S},\mathrm{S})}$ , then, with probability $1-\delta$ ,

[TABLE]

where $(r_{1},r_{2},r_{3})$ is the multilinear rank of $\mathcal{W}$ , $r_{(1)}$ is the rank of the coupled unfolding on mode $1$ , and $B_{M}$ , $B_{T}$ , $C_{1}$ , and $C_{2}$ are constants.

We can see that in both of these theorems, the Rademacher complexity of the coupled tensor is divided by the total number of observed samples of both the matrix and the tensor. If the tensor or the matrix is completed separately, then the Rademacher complexity is only divided by their individual samples (see Theorems 7–9 in the Appendix B and a discussion elsewhere (Shamir and Shalev-Shwartz, 2014)). This means that coupled tensor learning can lead to better performance than separate matrix or tensor learning. We can also see that, due to coupling, the excess risks are bounded by the ranks of both the tensors and the concatenated matrix of the unfolded tensors on the coupled mode. Additionally, the maximum term on the right takes the combinations of both the tensor and the concatenated matrix of the unfolded tensors on the coupled mode.

Finally, we consider the Rademacher complexity of the mixed norm $\|\cdot\|_{\mathrm{cn}}=\|\cdot\|^{1}_{(\mathrm{S},\mathrm{O},\mathrm{O})}$ .

Theorem 5.

Let $\|\cdot\|_{\mathrm{cn}}=\|\cdot\|^{1}_{(\mathrm{S},\mathrm{O},\mathrm{O})}$ ; then, with probability $1-\delta$ ,

[TABLE]

where $(r_{1},r_{2},r_{3})$ is the multilinear rank of $\mathcal{W}$ , $r_{(1)}$ is the rank of the coupled unfolding on mode $1$ , and $B_{M}$ , $B_{\mathcal{T}}$ , $C_{1}$ , and $C_{2}$ are constants.

We see that, for the mixed norm $\|\cdot\|_{\mathrm{cn}}=\|\cdot\|^{1}_{(\mathrm{S},\mathrm{O},\mathrm{O})}$ , the excess risk is bounded by the scaled rank of the coupled unfolding along the first mode. For this norm, we can see that the terms related to ranks are smaller in Theorem 3 and that the maximum term could be smaller than in Theorem 4. This means that this norm can perform better than $\|\cdot\|^{1}_{(\mathrm{O},\mathrm{O},\mathrm{O})}$ and $\|\cdot\|^{1}_{(\mathrm{S},\mathrm{S},\mathrm{S})}$ depending on the ranks and mode dimensions of the coupled tensor. The bounds of the other two mixed norms can also be derived and explained in a manner similar to Theorem 5.

6 Evaluation

We evaluated our proposed method experimentally using synthetic and real-world data.

6.1 Synthetic Data

Our main objectives were to evaluate how the proposed norms perform depending on the ranks and dimensions of the coupled tensors. We used simulation data based on CP rank and Tucker rank in these experiments.

6.1.1 Experiments Using CP Rank

To create coupled tensors with the CP rank, we first generated a $3$ -mode tensor $\mathcal{T}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}}$ with CP rank $r$ using CP decomposition (Kolda and Bader, 2009) as $\mathcal{T}=\sum_{i=1}^{r}c_{i}u_{i}\circ v_{i}\circ w_{i}$ where $u_{i}\in\mathbb{R}^{n_{1}}$ , $v_{i}\in\mathbb{R}^{n_{2}}$ and $w_{i}\in\mathbb{R}^{n_{3}}$ and $c_{i}\in\mathbb{R}^{+}$ . For our experiments, we used two approaches to create CP-rank-based tensors in which all the component vectors $u_{i},v_{i}$ , and $w_{i}$ were nonorthogonal vectors or orthogonal vectors. We coupled matrix $X\in\mathbb{R}^{n_{1}\times m}$ with rank $r$ to $\mathcal{T}$ on mode $1$ by generating $X=USV^{\top}$ with $U(1:r,:)=[u_{1},\ldots,u_{r}]$ , $S\in\mathbb{R}^{r\times r}$ , and $V\in\mathbb{R}^{m\times r}$ is an orthogonal matrix. We also added noise sampled from a Gaussian distribution with mean zero and variance of $0.01$ to the elements of the matrix and the tensor.

In our experiments using synthetic data, we considered coupled structures of tensors with dimension $20\times 20\times 20$ and matrices with dimension $20\times 30$ coupled on their first modes. To simulate completion, we randomly selected observed samples with percentages of $30$ , $50$ , and $70$ of the total number of elements in both the matrix and the tensor, selected a validation set with a percentage of $10$ and took the remainder as test samples. We performed coupled completion using the proposed coupled norms of $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{O})}^{1}$ , $\|\cdot\|_{(\mathrm{S},\mathrm{S},\mathrm{S})}^{1}$ , $\|\cdot\|_{(\mathrm{S},\mathrm{O},\mathrm{O})}^{1}$ , $\|\cdot\|_{(\mathrm{O},\mathrm{S},\mathrm{O})}^{1}$ , and $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{S})}^{1}$ . For all the learning models with these norms, we cross-validated their regularization parameters ranging from $0.01$ to $5.0$ with intervals of $0.05$ . We ran our experiments with $10$ random selections and plotted the mean square error (MSE) for the test samples.

As benchmark methods we used the overlapped trace norm (OTN) and the scaled latent trace norm (SLTN) for individual tensors and the matrix trace norm (MTN) for individual matrices. For all these norms, we cross-validated the regularization parameters ranging from $0.01$ to $5.0$ with intervals of $0.05$ . We compared our results with those of advanced coupled matrix-tensor factorization ACMTF (Acar et al., 2014b), for which the regularization parameters were selected using cross-validation in the range $0,0.0001,0.001,\ldots,1$ . To select ranks to use with the ACMTF method, we first ran experiments using ranks of $1,3,5,\dots,19$ and selected the rank that gave the best performance. Due to the non-convex nature of ACMTF, we ran experiments with $5$ random initializations to select the best local optimal solution.

We first ran experiments on coupled tensor completion based on CP rank in different settings. In the first experiment, we considered coupled tensors with no shared components. In this experiment, we created a tensor with CP rank $5$ in which the component vectors were nonorthogonal and generated from a normal distribution. We also created a matrix of rank $5$ and without any components in common with the tensor. Figure 2 shows that the coupled norms did not perform better than individual matrix completion using the matrix trace norm. However, for tensor completion, the coupled norm $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{O})}^{1}$ had performance comparable to that of the overlapped trace norm.

We next ran experiments on coupled tensors with some components in common and with both orthogonal and nonorthogonal component vectors. We created coupled tensors with CP rank of $5$ , and both the tensor and matrix shared all components along mode $1$ . We generated the tensor with orthogonal component vectors. As shown in Figure 3, the coupled norm $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{O})}^{1}$ had good performance for both the matrix and tensor.

Figure 4, we shows the performance of coupled tensors with the same rank as in the previous experiment with tensors created from nonorthogonal component vectors. Again the coupled norm $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{O})}^{1}$ had better performance than individual matrix and tensor completions.

In our final experiment, we created tensors with CP rank $5$ and coupled them with a matrix of rank $10$ sharing all $5$ component vectors along mode $1$ . Figures 5 and 6 show the results for tensors created with orthogonal and nonorthogonal component vectors, respectively. In both cases, the coupled norms $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{O})}^{1}$ , $\|\cdot\|_{(\mathrm{S},\mathrm{S},\mathrm{S})}^{1}$ , and $\|\cdot\|_{(\mathrm{S},\mathrm{O},\mathrm{O})}^{1}$ had better matrix completion performance than individual completion by the matrix trace norm. Similarly, as in the previous experiments, both the overlapped trace norm and the coupled norm $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{O})}^{1}$ had comparable performances.

6.1.2 Simulations Using Tucker Rank

To create coupled tensors with the Tucker rank, we first generated a tensor $\mathcal{T}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}}$ using Tucker decomposition (Kolda and Bader, 2009) as $\mathcal{T}=\mathcal{C}\times_{1}U_{1}\times_{2}U_{2}\times_{3}U_{3}$ , where $\mathcal{C}\in\mathbb{R}^{r_{1}\times r_{2}\times r_{3}}$ was the core tensor generated from a normal distribution specifying multilinear rank $(r_{1},r_{2},r_{3})$ and component matrices $U_{1}\in\mathbb{R}^{r_{1}\times p_{1}}$ , $U_{2}\in\mathbb{R}^{r_{2}\times p_{2}}$ , and $U_{3}\in\mathbb{R}^{r_{3}\times p_{3}}$ were orthogonal matrices. Next we generated a matrix that was coupled with mode $1$ of the tensor using singular value decomposition $X=USV^{\top}$ , where we specified its rank $r$ using diagonal matrix $S$ and generated matrices $U$ and $V$ as orthogonal matrices. For sharing between the matrix and the tensor, we computed $T_{(1)}=U_{n}S_{n}V^{\top}_{n}$ , and replaced the first $s$ singular values of $S$ with the first $s$ singular values of $S_{n}$ , replaced the first basis vectors $s$ of $U$ with the first $s$ basis vectors of $U_{n}$ , and computed $X=USV^{\top}$ such that the coupled structure shared $s$ common components. We also added noise sampled from a Gaussian distribution with mean zero and variance $0.01$ to the elements of the coupled tensor.

As in the synthetic experiments using the CP rank, we considered coupled structures with tensors with dimension $20\times 20\times 20$ and matrices with dimension $20\times 30$ coupled on their mode $1$ . We considered different multilinear ranks of tensors, ranks of matrices, and degrees of sharing among them. We used the same percentages in selecting the training, testing, and validation sets as we did in the CP rank experiments. We again compared our results with those of ACMTF.

We also used an additional non-convex coupled learning model to incorporate multilinear ranks of the coupled tensor by considering Tucker decomposition under the assumption that the components of the coupled mode were shared between both the matrix and tensor. We used the Tensorlab framework (Vervliet et al., 2016) to implement this model. We regularized the factorized components of the tensor (including the core tensor) and the matrix using the Frobenius norm. We used a regularization parameter selected from the range $0.01$ to $50$ in logarithmic linear scale with $5$ divisions (in Matlab syntax exp(linspace(log(0.01), log(50), 5))). We refer to this benchmark method as NC-Tucker. Due to the non-convex nature of the model, we ran $5-10$ simulations with different random initializations and selected the best local optimal solution. Specifying the multilinear rank a priori for this model would be challenging in real applications, but since we knew the rank in our simulations, we could specify the multilinear ranks to be used to create the tensors.

In our first simulations, we considered a coupled tensor with a matrix rank of $5$ and a tensor multilinear rank $(5,5,5)$ with no shared components. Figure 7 shows that, with this setting, individual matrix and tensor completion had better performance than that of the coupled norms. The non-convex NC-Tucker benchmark method had the best performance for the tensor but performed poorly in matrix completion compared to the coupled norms.

In our next simulation, we considered coupling of tensors and matrices with some degree of sharing among them. We created a matrix of rank $5$ and a tensor of multilinear rank $(5,5,5)$ and let them share all $5$ singular components along mode $1$ . Figure 8 shows that the coupled norm $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{O})}^{1}$ had the best performance among the coupled norms for both matrix and tensor completion. Individual tensor completion with the overlapped trace norm had the same performance as $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{O})}^{1}$ . The NC-Tucker method performed better than the coupled norms for tensor and matrix completion.

In our next simulation, we considered a matrix of rank $5$ and a tensor of multilinear rank $(5,15,5)$ that shared all $5$ singular components along mode $1$ . Figure 9 shows that, with this setting, although the coupled norm $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{S})}^{1}$ had the best performance among the coupled norms and individual tensor completion, it was outperformed by the NC-Tucker method. However, the NC-Tucker method performed poorly in matrix completion compared to the coupled norms. For the matrix completion, individual matrix completion by the matrix trace norm had the best performance while coupled norms $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{S})}^{1}$ and $\|\cdot\|_{(\mathrm{S},\mathrm{O},\mathrm{O})}^{1}$ had the next best performance.

For our final simulation, we created a coupled matrix with rank $5$ and a tensor with multilinear rank $(15,5,5)$ , all sharing $5$ singular components along mode $1$ . Figure 10 shows that the mixed coupled norms $\|\cdot\|_{(\mathrm{O},\mathrm{S},\mathrm{O})}^{1}$ and $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{S})}^{1}$ performed equally and had better performance for tensor completion than the individual tensor completion. The NC-Tucker method had better performance than the coupled norms for tensor completion, while the performance was comparable for matrix completion. For matrix completion when the percentage of training samples was small, coupled norms $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{O})}^{1}$ and $\|\cdot\|_{(\mathrm{S},\mathrm{O},\mathrm{O})}^{1}$ had better performance. As the percentage of training samples was increased, the performance of individual matrix completion improved while those of $\|\cdot\|_{(\mathrm{O},\mathrm{S},\mathrm{O})}^{1}$ and $\|\cdot\|_{(\mathrm{O},\mathrm{O},\mathrm{S})}^{1}$ were close but second best.

The results of these simulations show that the ACMTF performed poorly compared to our proposed methods.

6.2 Real-World Data

As a real-world data experiment, we applied our proposed method to the UCLAF dataset (Zheng et al., 2010), which consists of GPS data for $164$ users in $168$ locations performing $5$ activities, resulting in a sparse user-location-activity tensor $\mathcal{T}\in\mathbb{R}^{164\times 168\times 5}$ . This dataset also has a user-location matrix $X\in\mathbb{R}^{164\times 168}$ , which we used as side information coupled to the user mode of $\mathcal{T}$ . Using similar observed element percentages as in the synthetic data simulations we performed completion experiments on $\mathcal{T}$ . We considered all the elements of the user-location matrix as observed elements and used them as training data. We repeated the evaluation for $10$ random sample selections. We cross-validated the regularization parameters from $0.01$ to $500$ divided into $50$ in logarithmic linear scale. As a baseline method, we again used the ACMTF method (Acar et al., 2014b) with CP rank $5$ . Additionally, we used the coupled (Tucker) method (Ermis et al., 2015) and the NC-Tucker method with multilinear rank $(3,3,3)$ , where we selected the best performances among $5$ random initializations. Figure 10 shows the completion performances for the coupled tensor.

We can see that the best performance among coupled norms was that of mixed coupled norm $\|\cdot\|_{(\mathrm{S},\mathrm{O},\mathrm{O})}^{1}$ , indicating that learning with side information as a coupled structure improves tensor completion performance compared to completion using only tensor norms. This also indicates that mode $1$ may have a lower rank than the other modes and that mode $2$ and $3$ may have ranks closer to each other. The non-convex coupled (Tucker) method and the NC-Tucker method had better performance than $\|\cdot\|_{(\mathrm{S},\mathrm{O},\mathrm{O})}^{1}$ when the number of observed samples was less than $70$ percent of the total elements.

7 Conclusion and Future Work

We have proposed a new set of convex norms for the completion problem of coupled tensors. We restricted our study to coupling a $3$ -way tensor with a matrix and defined low-rank inducing norms by extending trace norms such as the overlapped trace norm and scaled latent trace norm of tensors and the matrix trace norm. We also introduced the concept of mixed norms, which combines the features of both overlapped and latent trace norms. We looked at the theoretical properties of our convex completion model and evaluated it using synthetic and real-world data. We found that the proposed coupled norms perform comparably with existing non-convex ones. However, our norms lead to global optimal solutions and eliminate the need for specifying the ranks of the coupled tensors beforehand. While there are still many aspects to be studied, we believe that our work is the first step in modeling convex norms for coupled tensors.

Although coupling can occur among many tensors with different dimensions and multiple matrices on different modes, this study focused on a $3$ -mode tensor and a single matrix. The methodology used to create coupled norms can be extended to any of those settings, but mere extensions may not lead to the optimal design of norms for those settings. Particularly, the square tensor norm (Mn et al., 2014) has shown to be better suited to tensors beyond three modes and thus can also be used to model novel coupled norms in the future. Furthermore, theoretical analysis using methods such as the Gaussian width (Amelunxen et al., 2014) may provide deeper understanding of coupled tensors, which should enable design of better norms. Such studies could be interesting directions for future research.

Acknowledgment

MY was supported by the JST PRESTO program JPMJPR165A. HM has been partially supported by JST ACCEL Grant Number JPMJAC1503 (Japan), MEXT Kakenhi 16H02868 (Japan), FiDiPro by Tekes (currently Business Finland) and AIPSE programme by Academy of Finland.

Appendices

Appendix A Proofs of Dual Norms

We first provide the proofs of the dual norms of Theorems 1 and 2.

Proof of Theorem 1. We use Lemma 3 of (Tomioka and Suzuki, 2013) to prove the duality. Consider a linear operator $\Phi$ such that $\Phi(\mathcal{T},M)=[\mathrm{vec}(M);\mathrm{vec}(T_{(1)});\mathrm{vec}(T_{(2)});\mathrm{vec}(T_{(3)})]\in\mathbb{R}^{d_{1}+3d_{2}}$ , where $d_{1}=n_{1}m$ and $d_{2}=n_{1}n_{2}n_{3}$ . We define

[TABLE]

where $\mathcal{Z}^{(k)}$ is the inverse vectorization of elements $z_{(d_{1}+(k-1)d_{2}+1):(d_{1}+kd_{2})}$ and $X$ is the inverse vectorization of $z_{1:d_{1}}$ . The dual of the above norm is expressed as

[TABLE]

Let

[TABLE]

then following the Lemma 3 of (Tomioka and Suzuki, 2013), we write

[TABLE]

Given that

[TABLE]

and following Lemma 3 in (Tomioka and Suzuki, 2013) we obtain the dual of $\|[\mathcal{T},M]\|_{(\mathrm{O},\mathrm{O},\mathrm{O}),\underline{S_{p}/q}}^{1}$ as $\|[\mathcal{T},M]\|_{(\mathrm{L},\mathrm{L},\mathrm{L}),\overline{S_{p^{*}}/q^{*}}}^{1}$ . $\square$

Proof of Theorem 2. We can apply Theorem 1 to latent tensors $\mathcal{T}^{(1)}$ and $\mathcal{T}^{(2)}$ as well as the dual of the overlapping norm to $\mathcal{T}$ . First consider the dual with respect to $\mathcal{T}^{(1)}$ and $\mathcal{T}^{(2)}$ ; by applying Theorem 1, we obtain

[TABLE]

Next, by applying Lemma 1 of (Tomioka and Suzuki, 2013) to $\|\mathcal{T}\|_{(-,\mathrm{O},\mathrm{O})}$ , we obtain

[TABLE]

This completes the proof. $\square$

Appendix B Proofs of Excess Risk Bounds

Here we derive the excess risk bounds for the coupled completion problem.

From previous work (El-Yaniv and Pechyony, 2007; Shamir and Shalev-Shwartz, 2014), we know that for a loss function $l(\cdot,\cdot)$ that is a $\Lambda$ -Lipschitz loss function and bounded as $\sup_{i_{1},i_{2},i_{3}}|l(\mathbb{X}_{i_{1},i_{2},i_{3}},\mathbb{W}_{i_{1},i_{2},i_{3}})|\leq b_{l}$ and with the assumption that $|\mathrm{S}_{\mathrm{Train}}|=|\mathrm{S}_{\mathrm{Test}}|=|S|/2$ , we have the following bound for (16) based on transductive Rademacher complexity theory (El-Yaniv and Pechyony, 2007; Shamir and Shalev-Shwartz, 2014) with probability $1-\delta$ ,

[TABLE]

where $R(\mathbb{W})$ is transductive Rademacher complexity defined as

[TABLE]

where $\sigma_{i_{1},i_{2},i_{3}}\in\{-1,1\}$ with probability $0.5$ if $(i_{1},i_{2},i_{3})\in\mathrm{S}$ , or [math] otherwise.

We can rewrite (20) as

[TABLE]

where we have used that $\|\mathcal{W}\|_{\mathrm{F}}\leq B_{\mathcal{T}}$ and $\|W_{M}\|_{\mathrm{F}}\leq B_{M}$ , and $\Sigma$ is of dimensions of the coupled tensor consisting Rademacher variables ( $\Sigma_{i_{1},i_{2},i_{3}}=\sigma_{i_{1},i_{2},i_{3}}$ if $(i_{1},i_{2},i_{3})\in\mathrm{S}$ , else $\Sigma_{i_{1},i_{2},i_{3}}=0$ ).

Proof of Theorem 3: Let $\mathbb{W}=\mathcal{W}\cup W_{M}$ , where $\mathcal{W}$ and $W_{M}$ are the completed tensors of $\mathcal{T}$ and $M$ , and let $\Sigma=\Sigma_{\mathcal{T}}\cup\Sigma_{M}$ , where $\Sigma_{\mathcal{T}}$ and $\Sigma_{M}$ consist of the corresponding Rademacher variables ( $\sigma_{i_{1},i_{2},i_{3}}$ ) for $\mathcal{T}$ and $M$ . Since we use an overlapping norm, we have $\|\mathbb{W}\|_{\mathrm{cn}}=\|\mathcal{W},W_{M}\|^{1}_{(\mathrm{O},\mathrm{O},\mathrm{O})}$ from which we obtain

[TABLE]

where $(r_{1},r_{2},r_{3})$ is the multilinear rank of $\mathcal{W}$ and $r_{(1)}$ is the rank of the concatenated matrix of unfolding tensors on mode $1$ . To obtain the above inequality, we used the fact that, for any matrix $U$ with rank $r$ , we have $\|U\|_{\mathrm{tr}}\leq\sqrt{r}\|U\|_{\mathrm{F}}$ (Tomioka and Suzuki, 2013).

Using Latała’s Theorem (Latała, 2005; Shamir and Shalev-Shwartz, 2014) for the mode $k$ unfolding, we can bound $\|\Sigma_{\mathcal{T}{(k)}}\|_{\mathrm{op}}$

[TABLE]

and since $\sqrt[4]{|\Sigma_{\mathcal{T}(k)}|}\leq\sqrt[4]{\prod_{i=1}^{3}{n_{i}}}\leq\frac{1}{2}\Bigg{(}\sqrt{n_{k}}+\sqrt{\prod_{j\neq k}^{3}{n_{j}}}\Bigg{)}$ , we have,

[TABLE]

Similarly, using the Latała’s Theorem, we obtain

[TABLE]

To bound $\mathbb{E}\|\Sigma_{\mathcal{T}},\Sigma_{M}\|^{1}_{(\mathrm{O},\mathrm{O},\mathrm{O})^{*}}$ , we use the duality relationship from Theorem 1 and Corollary 1

[TABLE]

Since we can take any $\Sigma^{(k)}_{\mathcal{T}}$ to be equal to $\Sigma_{\mathcal{T}}$ , the above norm can be upper bounded:

[TABLE]

Taking the expectation leads to

[TABLE]

Finally, we have

[TABLE]

$\square$

Before we give the excess risk bound for the $\|\cdot\|^{1}_{(\mathrm{S},\mathrm{S},\mathrm{S})}$ , in the following theorem we give the excess risk of coupled completion with the $\|\cdot\|^{1}_{(\mathrm{L},\mathrm{L},\mathrm{L})}$ .

Theorem 6.

Let $\|\cdot\|_{\mathrm{cn}}=\|\cdot\|^{1}_{(\mathrm{L},\mathrm{L},\mathrm{L})}$ ; then, with probability $1-\delta$

[TABLE]

where $(r_{1},r_{2},r_{3})$ is the multilinear rank of $\mathcal{W}$ , $r_{(1)}$ is the rank of the coupled unfolding on mode $1$ and $B_{M}$ , $B_{\mathcal{T}}$ , $C_{1}$ , and $C_{2}$ are constants.

Proof: Again, let $\mathbb{W}=\mathcal{W}\cup W_{M}$ , where $\mathcal{W}$ and $W_{M}$ are the completed tensors of $\mathcal{T}$ and $M$ , and $\Sigma=\Sigma_{\mathcal{T}}\cup\Sigma_{M}$ , where $\Sigma_{\mathcal{T}}$ and $\Sigma_{M}$ consist of the corresponding Rademacher variables. We can see that

[TABLE]

which can be bounded as

[TABLE]

where the last term is derived by considering the infimum with respect to $\mathcal{W}^{(2)}$ and $\mathcal{W}^{(3)}$ .

Using the duality result given in Theorem 1 (Corollary 1) and Latała’s Theorem, we obtain

[TABLE]

Finally, we have

[TABLE]

$\square$

Proof of Theorem 4: By definition, we have

[TABLE]

which results in

[TABLE]

Using the duality result given in Theorem 1 and Latała’s Theorem, we obtain

[TABLE]

Finally, we have

[TABLE]

$\square$

Proof of Theorem 5: First let us look at $\|\mathbb{W}\|^{1}_{(\mathrm{S},\mathrm{O},\mathrm{O})}$ , which is expressed as

[TABLE]

This norm can be upper bounded

[TABLE]

Now we are left with bounding $\|\Sigma_{\mathcal{T}},\Sigma_{M}\|^{1}_{(\mathrm{S},\mathrm{O},\mathrm{O})^{*}}$ . Using Theorem 2, we obtain

[TABLE]

We then have

[TABLE]

The final resulting bound is

[TABLE]

$\square$

In addition to the above transductive bounds for completion with coupled norms, we also provide the bounds for individual tensor completion with tensor norms such as the overlapped trace norm, the latent trace norm, and the scaled latent trace norm. We can consider (16) only for a matrix or a tensor without coupling and with low-rank regularization. Therefore, we may have the transductive bounds for a matrix $M\in\mathbb{R}^{n_{1}\times m}$ (Shamir and Shalev-Shwartz, 2014) as

[TABLE]

where $\mathrm{S}^{M}$ is the index set of observed samples of matrix $M$ , $\hat{r}$ is the rank induced by matrix trace norm regularization, and $c$ is a constant.

Next we can consider the transductive bounds for tensor $\mathcal{T}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{3}}$ with regularization using norms such as the overlapped trace norm (Tomioka and Suzuki, 2013), the latent trace norm (Tomioka and Suzuki, 2013) and the scaled latent trace norm (Wimalawarne et al., 2014) in the following three theorems. We denote the index set of observed sample of $\mathcal{T}$ by $\mathrm{S}^{\mathcal{T}}$ .

Theorem 7.

Using the overlapped trace norm regularization given as $\|\mathcal{W}\|_{\mathrm{overlap}}=\|\mathcal{W}\|_{(\mathrm{O},\mathrm{O},\mathrm{O})}$ , we obtain

[TABLE]

for some constant $c_{1}$ ; $(\hat{r}_{1},\hat{r}_{2},\hat{r}_{3})$ is the multilinear rank of $\mathcal{W}$ .

Proof: Using the same procedure as for Theorem 3, we obtain

[TABLE]

Since $\|\mathcal{W}\|_{\mathrm{overlap}}\leq\bigg{(}\sum_{k=1}^{3}\sqrt{\hat{r}_{k}}\bigg{)}B_{\mathcal{T}}$ , where $\|\mathcal{W}\|_{\mathrm{F}}\leq B_{\mathcal{T}}$ (Tomioka and Suzuki, 2013), we have

[TABLE]

$\square$

Theorem 8.

Using the latent trace norm regularization given by $\|\mathcal{W}\|_{\mathrm{latent}}=\|\mathcal{W}\|_{(\mathrm{L},\mathrm{L},\mathrm{L})}$ , we obtain

[TABLE]

for some constant $c_{2}$ ; $(\hat{r}_{1},\hat{r}_{2},\hat{r}_{3})$ is the multilinear rank of $\mathcal{W}$ .

Proof :Using the duality result from (Wimalawarne et al., 2014), we have

[TABLE]

Using Latała’s Theorem, we obtain

[TABLE]

Finally, using the known bound $\|\mathcal{W}\|_{\mathrm{latent}}\leq\min_{i}\sqrt{\hat{r}_{i}}B_{\mathcal{T}}$ (Wimalawarne et al., 2014), where $\|\mathcal{W}\|_{\mathrm{F}}\leq B_{\mathcal{T}}$ , we obtain the excess risk:

[TABLE]

$\square$

Theorem 9.

Using the scaled latent trace norm regularization given by $\|\mathcal{W}\|_{\mathrm{scaled}}=\|\mathcal{W}\|_{(\mathrm{S},\mathrm{S},\mathrm{S})}$ , we obtain

[TABLE]

for some constant $c_{3}$ ; $(\hat{r}_{1},\hat{r}_{2},\hat{r}_{3})$ is the multilinear rank of $\mathcal{W}$ .

Proof: From previous work (Wimalawarne et al., 2014), we can derive

[TABLE]

Using an approach similar to that for Theorem 8 with the additional scaling of $\sqrt{n_{k}}$ and using the Latała’s Theorem, we arrive at the following bound:

[TABLE]

∎

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Acar et al. (2015) Acar, E., Bro, R., and Smilde, A. K. (2015). Data fusion in metabolomics using coupled matrix and tensor factorizations. Proceedings of the IEEE , 103(9):1602–1620.
2Acar et al. (2011) Acar, E., Kolda, T. G., and Dunlavy, D. M. (2011). All-at-once optimization for coupled matrix and tensor factorizations. Co RR , abs/1105.3422.
3Acar et al. (2014 a) Acar, E., Nilsson, M., and Saunders, M. (2014 a). A flexible modeling framework for coupled matrix and tensor factorizations. In EUSIPCO , pages 111–115.
4Acar et al. (2014 b) Acar, E., Papalexakis, E. E., Gürdeniz, G., Rasmussen, M. A., Lawaetz, A. J., Nilsson, M., and Bro, R. (2014 b). Structure-revealing data fusion. BMC Bioinformatics , 15:239.
5Amelunxen et al. (2014) Amelunxen, D., Lotz, M., Mc Coy, M. B., and Tropp, J. A. (2014). Living on the edge: phase transitions in convex programs with random data. Information and Inference .
6Argyriou et al. (2006) Argyriou, A., Evgeniou, T., and Pontil, M. (2006). Multi-task feature learning. In NIPS , pages 41–48.
7Banerjee et al. (2007) Banerjee, A., Basu, S., and Merugu, S. (2007). Multi-way clustering on relation graphs. In ICDM , pages 145–156.
8Boyd et al. (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. and Tren. in Mach. Learn. , (1):1–122.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Convex Coupled Matrix and Tensor Completion

Abstract

1 Introduction

2 Related Work

3 Proposed Method

3.1 Existing Matrix and Tensor Norms

3.2 Coupled Tensor Norms

3.2.1 Extensions for Multiple Matrices and Tensors

3.3 Dual Norms

Theorem 1**.**

Corollary 1**.**

Theorem 2**.**

4 Optimization

5 Theoretical Analysis

Theorem 3**.**

Theorem 4**.**

Theorem 5**.**

6 Evaluation

6.1 Synthetic Data

6.1.1 Experiments Using CP Rank

6.1.2 Simulations Using Tucker Rank

6.2 Real-World Data

7 Conclusion and Future Work

Acknowledgment

Appendices

Appendix A Proofs of Dual Norms

Appendix B Proofs of Excess Risk Bounds

Theorem 6**.**

Theorem 7**.**

Theorem 8**.**

Theorem 9**.**

Theorem 1.

Corollary 1.

Theorem 2.

Theorem 3.

Theorem 4.

Theorem 5.

Theorem 6.

Theorem 7.

Theorem 8.

Theorem 9.