Hyperlink Regression via Bregman Divergence

Akifumi Okuno; Hidetoshi Shimodaira

arXiv:1908.02573·cs.SI·March 31, 2020

Hyperlink Regression via Bregman Divergence

Akifumi Okuno, Hidetoshi Shimodaira

PDF

Open Access

TL;DR

This paper introduces Bregman hyperlink regression (BHLR), a flexible framework for hyper-relational learning that predicts hyperlink weights from data tuples using Bregman divergence, with proven statistical consistency and computational efficiency.

Contribution

It proposes BHLR, a unified and general approach for hyper-relational learning that encompasses existing methods and provides theoretical guarantees for consistency and tractability.

Findings

01

BHLR is statistically consistent and asymptotically recovers true hyperlink weights.

02

The framework is computationally tractable with stochastic optimization and novel minibatch sampling.

03

It unifies and extends various existing hyper-relational learning methods.

Abstract

A collection of $U (\in N)$ data vectors is called a $U$ -tuple, and the association strength among the vectors of a tuple is termed as the \emph{hyperlink weight}, that is assumed to be symmetric with respect to permutation of the entries in the index. We herein propose Bregman hyperlink regression (BHLR), which learns a user-specified symmetric similarity function such that it predicts the tuple's hyperlink weight from data vectors stored in the $U$ -tuple. BHLR is a simple and general framework for hyper-relational learning, that minimizes Bregman-divergence (BD) between the hyperlink weights and estimated similarities defined for the corresponding tuples; BHLR encompasses various existing methods, such as logistic regression ( $U = 1$ ), Poisson regression ( $U = 1$ ), link prediction ( $U = 2$ ), and those for representation learning, such as graph embedding ( $U = 2$ ), matrix…

Tables6

Table 1. Table 1 : Bregman divergence family. See, e.g. Cichocki et al. ( 2009 ) Section 2.4 and Banerjee et al. ( 2005 ) Table 1 for details.

$φ (x)$	$dom (φ)$	$d_{φ} (a, b)$	Name of $D_{φ} (𝒂, 𝒃)$
$x \log x + (1 - x) \log (1 - x)$	$[0, 1]$	$\begin{matrix} - a \log b - (1 - a) \log (1 - b) \\ + a \log a + (1 - a) \log (1 - a) \end{matrix}$	Logistic loss^† (Banerjee et al., 2005)
$x \log x - x$	$ℝ_{\geq 0}$	$a \log \frac{a}{b} - (a - b)$	Kullback–Leibler div. (Cichocki et al., 2009)
$\frac{x^{1 + β}}{β (1 + β)} - \frac{x}{β}$	$ℝ_{\geq 0}$	$\frac{a^{1 + β}}{β (1 + β)} - \frac{a b^{β}}{β} + \frac{b^{1 + β}}{1 + β}$	$β$ -div.^‡ (Basu et al., 1998)
$- \log x$	$ℝ_{> 0}$	$\frac{a}{b} - \log \frac{a}{b} - 1$	Itakura-Saito div. (Cichocki et al., 2009)
$\frac{1}{x}$	$ℝ_{> 0}$	$\frac{{(a - b)}^{2}}{a b^{2}}$	Inverse div. (Cichocki et al., 2009)
$\frac{x^{2} - x}{2}$	$ℝ$	$\frac{1}{2} {(a - b)}^{2}$	Quadratic loss (Cichocki et al., 2009)
$\exp (x)$	$ℝ$	$\exp (a) - (a - b + 1) \exp (b)$	Exponential div. (Cichocki et al., 2009)
$\log (1 + \exp (x))$	$ℝ$	$\log \frac{1 + \exp (a)}{1 + \exp (b)} - (a - b) \frac{\exp (b)}{1 + \exp (b)}$	Dual logistic loss (Boissonnat et al., 2010)

Table 2. Table 2 : BHLR family members.

	Method	$𝒮$	$φ$	$μ_{𝜽} (𝑿_{𝒊})$	$𝚯$	$ℐ_{n}^{(U)}$	${𝒙_{i}}_{i = 1}^{n}$
$U = 1$	Poisson reg. (Cameron and Trivedi, 2007)	$ℝ_{\geq 0}$	$φ_{KL}$	$\exp (𝜽^{⊤} 𝒙_{i_{1}})$ or $\exp (f_{𝜽} (𝒙_{i_{1}}))$	$ℝ^{p}$ or $ℱ (p, 1)$	$[n]$	observed
	Logistic reg. (Bishop, 2006)	$[0, 1]$	$φ_{Logistic}$	$σ (𝜽^{⊤} 𝒙_{i_{1}})$ or $σ (f_{𝜽} (𝒙_{i_{1}}))$	$ℝ^{p}$ or $ℱ (p, 1)$	$[n]$	observed
	LS reg. (Bishop, 2006)	$ℝ$	$φ_{Quad.}$	$𝜽^{⊤} 𝒙_{i_{1}}$ or $f_{𝜽} (𝒙_{i_{1}})$	$ℝ^{p}$ or $ℱ (p, 1)$	$[n]$	observed
	PBDR (Zhang et al., 2009)	any	any^†	$g (𝜽^{⊤} 𝒙_{i})$ for some $g$	$ℝ^{p}$	$[n]$	observed
$U = 2$	Matrix Fact. (Koren et al., 2009)	any	any^†	$⟨ 𝜽^{⊤} 𝒙_{i_{1}}, 𝜽^{⊤} 𝒙_{i_{2}} ⟩$	$ℝ^{(n_{1} + n_{2}) \times K}$	$𝒞 (n_{1}, n_{2})$	$1$ -hot $\in {0, 1}^{n_{1} + n_{2}}$
	NMF (Cichocki et al., 2009)	any	any^†	$⟨ 𝜽^{⊤} 𝒙_{i_{1}}, 𝜽^{⊤} 𝒙_{i_{2}} ⟩$	$𝒜 (n_{1} + n_{2}, K)$	$𝒞 (n_{1}, n_{2})$	$1$ -hot $\in {0, 1}^{n_{1} + n_{2}}$
	LINE (Tang et al., 2015)	$[0, 1]$	$φ_{Logistic}$	$σ (⟨ 𝒇_{𝜽} (𝒙_{i_{1}}), 𝒇_{𝜽} (𝒙_{i_{2}}) ⟩)$	$ℱ (p, K)$	any	$1$ -hot $\in {0, 1}^{n}$
	KL-GE (Okuno et al., 2018)	$ℝ_{\geq 0}$	$φ_{KL}$	$\exp (⟨ 𝒇_{𝜽} (𝒙_{i_{1}}), 𝒇_{𝜽} (𝒙_{i_{2}}) ⟩)$	$ℱ (p, K)$	any	observed
	$β$ -GE (Okuno and Shimodaira, 2019)	$ℝ_{\geq 0}$	$φ_{β}$	$\exp (⟨ 𝒇_{𝜽} (𝒙_{i_{1}}), 𝒇_{𝜽} (𝒙_{i_{2}}) ⟩)$	$ℱ (p, K)$	any	observed
	Poincaré Emb. (Nickel and Kiela, 2017)	$[0, 1]$	$φ_{Logistic}$	$σ (- d_{Poincaré} (𝒇_{𝜽} (𝒙_{i_{1}}), 𝒇_{𝜽} (𝒙_{i_{2}})))$	$ℱ (p, K)$	any	1-hot $\in {0, 1}^{n}$
	SBM (Holland et al., 1983)	$[0, 1]$	$φ_{Logistic}$	$θ_{1} 𝟏 (x_{i_{1}} = x_{i_{2}}) + θ_{2} 𝟏 (x_{i_{1}} \neq x_{i_{2}})$	${[0, 1]}^{2}$	${[n]}^{2}$	cluster indicator $\in [C]$
$U \geq 2$	PARAFAC (Bro, 1997)	any	any^†	$⟨ 𝜽^{⊤} 𝒙_{i_{1}}, 𝜽^{⊤} 𝒙_{i_{2}}, \dots, 𝜽^{⊤} 𝒙_{i_{U}} ⟩$	$ℝ^{(\sum_{u = 1}^{U} n_{u}) \times K}$	$𝒞 (n_{1}, n_{2}, \dots, n_{U})$	$1$ -hot $\in {0, 1}^{\sum_{u = 1}^{U} n_{u}}$
$U \geq 2$	NTF (Cichocki et al., 2009)	any	any^†	$⟨ 𝜽^{⊤} 𝒙_{i_{1}}, 𝜽^{⊤} 𝒙_{i_{2}}, \dots, 𝜽^{⊤} 𝒙_{i_{U}} ⟩$	$𝒜 (\sum_{u = 1}^{U} n_{u}, K)$	$𝒞 (n_{1}, n_{2}, \dots, n_{U})$	$1$ -hot $\in {0, 1}^{\sum_{u = 1}^{U} n_{u}}$

Table 3. Table 3 : Poisson regression ( U = 1 𝑈 1 U=1 ) is conducted on a randomly sampled Boston housing dataset, and the sample average and standard error of the mean squared error for 100 100 100 experiments are listed. A smaller score is better . The best score is bolded , and the second best score is underlined .

	Generating function	$μ_{𝜽} (𝒙) := \exp (f_{𝜽} (𝒙))$	$μ_{𝜽} (𝒙) := f_{𝜽} (𝒙)$
Neural Network	BHLR + $β$ -div. ( $β = 2.0$ )	$\underline{14.57} \pm 0.65$	$14.03 \pm 0.62$
	BHLR + $β$ -div. ( $β = 1.5$ )	$14.12 \pm 0.60$	$\underline{14.20} \pm 0.70$
	BHLR + $β$ -div. ( $β = 1.0$ )	$14.32 \pm 0.70$	$15.30 \pm 0.50$
	BHLR + $β$ -div. ( $β = 0.5$ )	$14.90 \pm 0.64$	$15.31 \pm 0.64$
	BHLR + $β$ -div. ( $β = 0.1$ )	$16.12 \pm 0.70$	$16.07 \pm 0.62$
	Poisson regression^† (Fallah et al., 2009)	$16.08 \pm 0.58$	$16.86 \pm 0.73$
Linear	Poisson regression^† (Cameron and Trivedi, 2013)	$18.86 \pm 0.56$
Linear	LS regression^† (Bishop, 2006)	$24.58 \pm 0.64$
Random^†		$170.01 \pm 3.51$

Table 4. Table 4 : Link prediction ( U = 2 𝑈 2 U=2 ) is conducted on the attributed DBLP co-authorship network dataset (Desmier et al., 2012 ) , and the sample average and standard error of the ROC–AUC test scores for 40 experiments are listed. A higher score is better . The best score is bolded , and the second best score is underlined .

$K = 10$	Method	$m_{+}$ / $m_{-}$				Best (validated)
$K = 10$	Method	1/15	3/13	6/10	10/6	Best (validated)
Neural network	BHLR + exponential div.	$81.5 \pm 0.4$	$\underline{82.5} \pm 0.2$	$82.7 \pm 0.4$	$\underline{82.7} \pm 0.3$	$83.0 \pm 0.4$
	BHLR + dual logistic loss	$80.0 \pm 0.1$	$81.4 \pm 0.2$	$81.7 \pm 0.2$	$81.5 \pm 0.1$	$81.7 \pm 0.2$
	KL-GE^†,1 (Okuno et al., 2018)	$80.1 \pm 0.2$	$81.5 \pm 0.3$	$82.1 \pm 0.2$	$82.1 \pm 0.2$	$82.2 \pm 0.3$
	$β$ -GE^†,2 (Okuno and Shimodaira, 2019) ( $β = 0.1$ )	$\underline{81.4} \pm 0.1$	$82.3 \pm 0.2$	$82.3 \pm 0.2$	$\underline{82.7} \pm 0.2$	$82.3 \pm 0.3$
	$β$ -GE^†,2 (Okuno and Shimodaira, 2019) ( $β = 0.5$ )	$80.6 \pm 0.3$	$82.2 \pm 0.2$	$\underline{82.5} \pm 0.2$	$82.9 \pm 0.2$	$82.2 \pm 0.3$
	$β$ -GE^†,2 (Okuno and Shimodaira, 2019) ( $β = 1$ )	$81.2 \pm 0.3$	$82.2 \pm 0.2$	$82.4 \pm 0.3$	$82.4 \pm 0.2$	$82.2 \pm 0.3$
	LINE^†,3 (Tang et al., 2015)	$\underline{81.4} \pm 0.2$	$82.6 \pm 0.1$	$82.0 \pm 0.2$	$82.3 \pm 0.3$	$\underline{82.8} \pm 0.2$
Linear	LPP^† (He and Niyogi, 2004)	$78.9 \pm 0.3$

Table 5. Table 5 : Hyperlink prediction ( U = 3 𝑈 3 U=3 ) with the setting (a) is conducted on the attributed DBLP co-authorship network dataset (Desmier et al., 2012 ) , and the sample average and standard error of the ROC-AUC test scores for 40 experiments are listed. A higher score is better . The best score is bolded , and the second best score is underlined .

$K = 10$	Method	$m_{+}$ / $m_{-}$				Best (validated)
$K = 10$	Method	1/15	3/13	6/10	10/6	Best (validated)
Neural network	BHLR + exponential div.	$86.1 \pm 0.3$	$87.3 \pm 0.3$	$\underline{87.5} \pm 0.3$	$87.2 \pm 0.3$	$\underline{87.3} \pm 0.3$
	BHLR + dual logistic loss	$85.4 \pm 0.3$	$86.1 \pm 0.2$	$86.0 \pm 0.3$	$86.7 \pm 0.3$	$86.2 \pm 0.3$
	BHLR + KL-div.	$85.2 \pm 0.3$	$85.1 \pm 0.2$	$85.3 \pm 0.3$	$85.6 \pm 0.2$	$85.4 \pm 0.3$
	BHLR + $β$ -div. ( $β = 0.1$ )	$85.3 \pm 0.3$	$85.5 \pm 0.3$	$85.8 \pm 0.3$	$85.8 \pm 0.2$	$85.8 \pm 0.3$
	BHLR + $β$ -div. ( $β = 0.5$ )	$85.3 \pm 0.3$	$86.1 \pm 0.2$	$86.9 \pm 0.3$	$86.3 \pm 0.3$	$86.6 \pm 0.3$
	BHLR + $β$ -div. ( $β = 1$ )	$85.7 \pm 0.3$	$\underline{86.5} \pm 0.2$	$86.8 \pm 0.3$	$\underline{87.0} \pm 0.3$	$\underline{87.3} \pm 0.2$
	BHLR + logistic loss	$\underline{86.0} \pm 0.3$	$87.3 \pm 0.3$	$87.9 \pm 0.2$	$87.2 \pm 0.2$	$87.4 \pm 0.3$
Linear	HIMFAC^† (Nori et al., 2012) + (i)	$48.4 \pm 0.5$
Linear	HIMFAC^† (Nori et al., 2012) + (ii)	$76.9 \pm 0.3$

Table 6. Table 6 : Hyperlink prediction ( U = 3 𝑈 3 U=3 ) with the setting (b) is conducted on the attributed DBLP co-authorship network dataset (Desmier et al., 2012 ) , and the sample average and standard error of the ROC-AUC test scores for 40 experiments are listed. A higher score is better . The best score is bolded , and the second best score is underlined .

$K = 10$	Method	$m_{+}$ / $m_{-}$				Best (validated)
$K = 10$	Method	1/15	3/13	6/10	10/6	Best (validated)
Neural network	BHLR + exponential div.	$\underline{85.7} \pm 0.3$	$\underline{86.6} \pm 0.3$	$\underline{86.7} \pm 0.3$	$\underline{86.9} \pm 0.3$	$\underline{86.7} \pm 0.3$
	BHLR + dual logistic loss	$\underline{85.7} \pm 0.4$	$85.9 \pm 0.4$	$86.0 \pm 0.3$	$86.2 \pm 0.3$	$86.4 \pm 0.3$
	BHLR + KL-div.	$84.5 \pm 0.4$	$85.0 \pm 0.4$	$85.6 \pm 0.4$	$85.3 \pm 0.5$	$86.1 \pm 0.5$
	BHLR + $β$ -div. ( $β = 0.1$ )	$84.9 \pm 0.4$	$85.7 \pm 0.3$	$85.5 \pm 0.3$	$85.8 \pm 0.3$	$85.9 \pm 0.4$
	BHLR + $β$ -div. ( $β = 0.5$ )	$85.0 \pm 0.4$	$85.7 \pm 0.3$	$85.9 \pm 0.3$	$86.3 \pm 0.4$	$86.5 \pm 0.4$
	BHLR + $β$ -div. ( $β = 1$ )	$85.4 \pm 0.4$	$86.0 \pm 0.4$	$\underline{86.7} \pm 0.3$	$86.4 \pm 0.3$	$86.6 \pm 0.3$
	BHLR + logistic loss	$85.9 \pm 0.3$	$86.8 \pm 0.3$	$87.2 \pm 0.3$	$87.3 \pm 0.3$	$86.8 \pm 0.3$
Linear	HIMFAC^† (Nori et al., 2012) + (i)	$49.1 \pm 1.3$
Linear	HIMFAC^† (Nori et al., 2012) + (ii)	$82.6 \pm 0.4$

Equations206

D_{φ} (a, b)

D_{φ} (a, b)

d_{φ} (a, b)

d_{φ} (a, b)

μ_{θ} (X) \approx μ_{*} (X), X \in X^{U}

μ_{θ} (X) \approx μ_{*} (X), X \in X^{U}

1 \leq i_{1} < i_{2} \leq n \prod = q (w_{i_{1} i_{2}} ∣ X_{i_{1}, i_{2}}) μ_{*} (X_{i_{1}, i_{2}})^{w_{i_{1} i_{2}}} (1 - μ_{*} (X_{i_{1}, i_{2}}))^{1 - w_{i_{1} i_{2}}} i = 1 \prod n q_{X} (x_{i}) .

1 \leq i_{1} < i_{2} \leq n \prod = q (w_{i_{1} i_{2}} ∣ X_{i_{1}, i_{2}}) μ_{*} (X_{i_{1}, i_{2}})^{w_{i_{1} i_{2}}} (1 - μ_{*} (X_{i_{1}, i_{2}}))^{1 - w_{i_{1} i_{2}}} i = 1 \prod n q_{X} (x_{i}) .

μ_{θ} (X_{i}) \approx w_{i}, i \in I_{n}^{(U)}

μ_{θ} (X_{i}) \approx w_{i}, i \in I_{n}^{(U)}

q (w_{i} ∣ X_{i}) := \tilde{q} (w_{r (i)} ∣ X_{r (i)}), (i \in I_{n}^{(U)}) .

q (w_{i} ∣ X_{i}) := \tilde{q} (w_{r (i)} ∣ X_{r (i)}), (i \in I_{n}^{(U)}) .

i \in I_{n}^{(U)} \prod q (w_{i} ∣ X_{i}),

i \in I_{n}^{(U)} \prod q (w_{i} ∣ X_{i}),

\frac{1}{n} i = 1 \sum n D_{φ} (q_{i}, p_{θ, i}),

\frac{1}{n} i = 1 \sum n D_{φ} (q_{i}, p_{θ, i}),

∣ I ∣ D_{φ} (\hat{q}_{i}, p_{θ, i})

∣ I ∣ D_{φ} (\hat{q}_{i}, p_{θ, i})

= w \in N \sum {φ^{'} (p_{θ, i w}) p_{θ, i w} - φ (p_{θ, i w}) - φ^{'} (p_{θ, i w}) \overset{q}{^}_{i w}} + Const.

= w \in N \sum {φ^{'} (p_{θ} (w ∣ x_{i})) p_{θ} (w ∣ x_{i}) - φ (p_{θ} (w ∣ x_{i}))} - φ^{'} (p_{θ} (w_{i} ∣ x_{i})) + Const.

(∵ p_{θ, i w} = p_{θ} (w ∣ x_{i}), \overset{q}{^}_{i w} = {10 (w = w_{i}) (w \neq = w_{i}))

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\bigg{\{}\underbrace{\sum_{w\in\mathbb{N}_{0}}\left(\varphi^{\prime}(p_{\boldsymbol{\theta}}(w\mid\boldsymbol{x}_{i}))p_{\boldsymbol{\theta}}(w\mid\boldsymbol{x}_{i})-\varphi(p_{\boldsymbol{\theta}}(w\mid\boldsymbol{x}_{i}))\right)}_{(\star)}-\varphi^{\prime}(p_{\boldsymbol{\theta}}(w_{i}\mid\boldsymbol{x}_{i}))\bigg{\}}.

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\bigg{\{}\underbrace{\sum_{w\in\mathbb{N}_{0}}\left(\varphi^{\prime}(p_{\boldsymbol{\theta}}(w\mid\boldsymbol{x}_{i}))p_{\boldsymbol{\theta}}(w\mid\boldsymbol{x}_{i})-\varphi(p_{\boldsymbol{\theta}}(w\mid\boldsymbol{x}_{i}))\right)}_{(\star)}-\varphi^{\prime}(p_{\boldsymbol{\theta}}(w_{i}\mid\boldsymbol{x}_{i}))\bigg{\}}.

D_{φ} ({μ_{*} (x_{i})}_{i = 1}^{n}, {μ_{θ} (x_{i})}_{i = 1}^{n}),

D_{φ} ({μ_{*} (x_{i})}_{i = 1}^{n}, {μ_{θ} (x_{i})}_{i = 1}^{n}),

D_{φ} ({w_{i}}_{i = 1}^{n}, {μ_{θ} (x_{i})}_{i = 1}^{n}) = \frac{1}{n} i = 1 \sum n {φ^{'} (μ_{θ} (x_{i})) μ_{θ} (x_{i}) - φ (μ_{θ} (x_{i})) - w_{i} φ^{'} (μ_{θ} (x_{i}))} + C

D_{φ} ({w_{i}}_{i = 1}^{n}, {μ_{θ} (x_{i})}_{i = 1}^{n}) = \frac{1}{n} i = 1 \sum n {φ^{'} (μ_{θ} (x_{i})) μ_{θ} (x_{i}) - φ (μ_{θ} (x_{i})) - w_{i} φ^{'} (μ_{θ} (x_{i}))} + C

L_{φ, n} (θ)

L_{φ, n} (θ)

= \frac{1}{∣ I _{n}^{(U)} ∣} i \in I_{n}^{(U)} \sum {φ^{'} (μ_{θ} (X_{i})) μ_{θ} (X_{i}) - φ (μ_{θ} (X_{i})) - w_{i} φ^{'} (μ_{θ} (X_{i}))} + C,

\hat{θ}_{φ, n} := ar g min_{θ \in Θ} L_{φ, n} (θ) .

\hat{θ}_{φ, n} := ar g min_{θ \in Θ} L_{φ, n} (θ) .

μ_{θ} (x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{U}}) = μ_{θ} (x_{i_{1}^{'}}, x_{i_{2}^{'}}, \dots, x_{i_{U}^{'}})

μ_{θ} (x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{U}}) = μ_{θ} (x_{i_{1}^{'}}, x_{i_{2}^{'}}, \dots, x_{i_{U}^{'}})

μ_{θ} (x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{U}}) = η (⟨ f_{θ} (x_{i_{1}}), f_{θ} (x_{i_{2}}), \dots, f_{θ} (x_{i_{U}})⟩),

μ_{θ} (x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{U}}) = η (⟨ f_{θ} (x_{i_{1}}), f_{θ} (x_{i_{2}}), \dots, f_{θ} (x_{i_{U}})⟩),

p_{ζ} (w ∣ μ)

p_{ζ} (w ∣ μ)

exp (- ∣ I_{n}^{(U)} ∣ L_{φ, n} (θ))

exp (- ∣ I_{n}^{(U)} ∣ L_{φ, n} (θ))

= i \in I_{n}^{(U)} \prod exp (- {φ (w_{i}) - φ (μ_{θ} (X_{i})) - φ^{'} (μ_{θ} (X_{i})) (w_{i} - μ_{θ} (X_{i}))})

= D \cdot i \in I_{n}^{(U)} \prod exp (w_{i} ζ_{1} (μ_{θ} (X_{i})) + ζ_{2} (μ_{θ} (X_{i})) + ζ_{3} (w_{i}))

=: D \cdot i \in I_{n}^{(U)} \prod p_{ζ} (w_{i} ∣ μ_{θ} (X_{i})),

L_{φ_{Logistic}, n} (θ)

L_{φ_{Logistic}, n} (θ)

L_{φ_{KL}, n} (θ)

L_{φ_{Quad.}, n} (θ)

L_{φ_{β}, n} (θ)

C_{Logistic}^{(U)}

C_{Logistic}^{(U)}

C_{KL}^{(U)}

A (p, K)

A (p, K)

F (p, K)

C (n_{1}, n_{2}, \dots, n_{U})

i_{2} = n_{1} + 1, n_{1} + 2, \dots, n_{1} + n_{2}; \dots; i_{U} = u = 1 \sum U - 1 n_{u} + 1, \dots, u = 1 \sum U n_{u}},

\displaystyle\boldsymbol{W}=(w_{\boldsymbol{i}})=\left(\begin{array}[]{cc}\boldsymbol{O}_{n_{1}\times n_{1}}&\boldsymbol{V}\\ \boldsymbol{V}^{\top}&\boldsymbol{O}_{n_{2}\times n_{2}}\end{array}\right),

\displaystyle\boldsymbol{W}=(w_{\boldsymbol{i}})=\left(\begin{array}[]{cc}\boldsymbol{O}_{n_{1}\times n_{1}}&\boldsymbol{V}\\ \boldsymbol{V}^{\top}&\boldsymbol{O}_{n_{2}\times n_{2}}\end{array}\right),

D_{φ} ({v_{j}}_{j \in [n_{1}] \times [n_{2}]}, {(ξ^{(1)} ξ^{(2) ⊤})_{j}}_{j \in [n_{1}] \times [n_{2}]})

D_{φ} ({v_{j}}_{j \in [n_{1}] \times [n_{2}]}, {(ξ^{(1)} ξ^{(2) ⊤})_{j}}_{j \in [n_{1}] \times [n_{2}]})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Advanced Graph Neural Networks · Complex Network Analysis Techniques

MethodsLogistic Regression

Full text

Hyperlink Regression via Bregman Divergence

Akifumi Okuno [email protected] RIKEN Center for Advanced Intelligence Project

Hidetoshi Shimodaira [email protected] RIKEN Center for Advanced Intelligence Project

Graduate School of Informatics, Kyoto University

Abstract

A collection of $U\>(\in\mathbb{N})$ data vectors is called a $U$ -tuple, and the association strength among the vectors of a tuple is termed as the hyperlink weight, that is assumed to be symmetric with respect to permutation of the entries in the index. We herein propose Bregman hyperlink regression (BHLR), which learns a user-specified symmetric similarity function such that it predicts the tuple’s hyperlink weight from data vectors stored in the $U$ -tuple. BHLR is a simple and general framework for hyper-relational learning, that minimizes Bregman-divergence (BD) between the hyperlink weights and estimated similarities defined for the corresponding tuples; BHLR encompasses various existing methods, such as logistic regression ( $U=1$ ), Poisson regression ( $U=1$ ), link prediction ( $U=2$ ), and those for representation learning, such as graph embedding ( $U=2$ ), matrix factorization ( $U=2$ ), tensor factorization ( $U\geq 2$ ), and their variants equipped with arbitrary BD. Nonlinear functions (e.g., neural networks), can be employed for the similarity functions. However, there are theoretical challenges such that some of different tuples of BHLR may share data vectors therein, unlike the i.i.d. setting of classical regression. We address these theoretical issues, and proved that BHLR equipped with arbitrary BD and $U\in\mathbb{N}$ is (P-1) statistically consistent, that is, it asymptotically recovers the underlying true conditional expectation of hyperlink weights given data vectors, and (P-2) computationally tractable, that is, it is efficiently computed by stochastic optimization algorithms using a novel generalized minibatch sampling procedure for hyper-relational data. Consequently, theoretical guarantees for BHLR including several existing methods, that have been examined experimentally, are provided in a unified manner.

1 Introduction

Many real-world datasets are in the form of undirected graphs comprising nodes and their links, where nodes may have attributes called data vectors and the links are specified by link weights representing the strength of association between the corresponding data vectors. A friend network is an example whose data vectors and binary link weights represent properties of people and their friendships, respectively.

Although such a graph-structured dataset contains rich information, a large number of underlying link weights may be missing in practice (Clauset et al., 2008; Lü and Zhou, 2011). Such missing link weights may be inferred by considering the observed link weights; for instance, two nodes that are connected to the same types of nodes in common are supposed to have high link weights (Lü and Zhou, 2011; Liben-Nowell and Kleinberg, 2007). However, such an inference deteriorates easily when no or only a few positive link weights to the target nodes are observed.

Even in a severe situation, missing link weights can be inferred by additionally utilizing node data vectors, as their similarities imply the link weights. Thus, various methods inferring link weights through data vectors, which are often implemented with neural networks these days, have been developed. We generalize these methods as link regression.

A simple implementation of link regression is similarity learning, where a user-specified similarity function defined for pairs of data vectors is trained to predict link weights. Although arbitrary similarity functions can be employed, many existing studies leverage the Mahalanobis distance (De Maesschalck et al., 2000) and Mahalanobis inner product (Kung, 2014). Using these Mahalanobis similarities is mathematically equivalent to using the Euclidean distance or inner product between low-dimensional linearly transformed data vectors (Goldberger et al., 2005), implying that Mahalanobis similarity learning implicitly obtains the optimal low-dimensional linear transformation of data vectors.

Obtaining such an optimal transformation is also known as graph embedding (GE). GE is a method for representation learning; it computes feature vectors such that their inner products predict link weights, and the obtained feature vectors can be used for a variety of downstream tasks in machine learning and statistics. For computing the feature vectors, neural networks (NN) have been incorporated recently (Tang et al., 2015) to enhance its expressive power. Graph embedding with NNs demonstrates promising performance experimentally with some theoretical justification; Okuno et al. (2018) theoretically proved that the inner product similarity (IPS) between NN-based transformation of data vectors can approximate arbitrary positive-definite (PD) similarities. Furthermore, Okuno et al. (2019) proposed a shifted IPS by introducing NN-based bias terms to approximate a larger class of similarities called conditionally PD similarities that includes PD similarities and some other non-PD similarities as special cases; an example is the recently popular negative Poincaré distance (Nickel and Kiela, 2017, 2018) for embedding in a Hyperbolic space. Furthermore, Kim et al. (2019) proposed a weighted IPS for approximating general similarities. Therefore, GE equipped with these similarities can be regarded as a theoretically guaranteed and highly expressive link regression.

Along with the development of highly expressive GEs, replacing loss functions for learning GE has shown progress. Whereas many GEs minimize logistic loss (Tang et al., 2015) or the Kullback–Leibler (KL) divergence (Okuno et al., 2018) between the observed link weights and those predicted from data vectors, Okuno and Shimodaira (2019) recently proposed $\beta$ -GE that instead minimizes $\beta$ -divergence (Basu et al., 1998), which reduces to KL divergence when $\beta=0$ . In addition to the robustness of $\beta$ -GE against noisy link weights, Okuno and Shimodaira (2019) proved that $\beta$ -GE exhibited the following two desirable properties: (P-1) statistical consistency, that is, it asymptotically recovers the underlying true conditional expectation of link weights given data vectors, and (P-2) computational tractability, that is, it can be computed efficiently by stochastic algorithms using a minibatch sampling for relational data.

Although the existing GEs above achieved success from both theoretical and application perspectives, several challenges still remain.

The first challenge is that the existing GEs are limited to considering the link weight defined between only two nodes, despite the fact that link weights can be similarly defined for a set of three or more nodes. We call the weight defined for three or more nodes as hyperlink weight. A hyperlink weight appears in many practical situations; in a friend network, the existence of a group to which all the selected $U(\geq 2)$ people belong should be expressed as a binary hyperlink weight. Similarly, the number of co-authored papers written by all the selected $U(\geq 2)$ people in a co-authorship network should be represented as hyperlink weights assuming values in non-negative integers. The existing link regression, including metric learning and GE, cannot address such complicated hyperlink weights.

The second challenge is that, it is unclear whether the properties (P-1) and (P-2) above only hold for the $\beta$ -divergence function class, or if they hold for some larger function classes. Because only the $\beta$ -GE is theoretically proven to exhibit such favorable properties, the present circumstance may limit the choice of loss function and may result in a missed opportunity to improve the GE’s performance.

For simultaneously solving these two challenges, we propose the Bregman hyperlink regression (BHLR) by (i) extending link regression to hyperlink regression (HLR) such that it predicts the hyperlink weight defined for a collection of $U(\in\mathbb{N})$ vectors called $U$ -tuple, and (ii) employing the Bregman divergence (BD) that includes many loss functions such as logistic loss, KL divergence, and $\beta$ -divergence as special cases. BHLR is a general framework for hyper-relational learning, that encompasses various existing methods; BHLR is in general demonstrated to possess the two desirable properties (P-1) statistical consistency and (P-2) computational tractability.

1.1 Contribution

The contribution of this study is summarized as follows.

In Section 3.4, we propose BHLR, that is a simple and general framework for hyper-relational learning. BHLR predicts hyperlink weight $w_{i_{1},i_{2},\ldots,i_{U}}\in\mathcal{S}\>(\subset\mathbb{R})$ from the corresponding tuple of data vectors $\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\ldots,\boldsymbol{x}_{i_{U}}\in\mathcal{X}\>(\subset\mathbb{R}^{p})$ through a user-specified symmetric similarity function $\mu_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\ldots,\boldsymbol{x}_{i_{U}})$ ; highly expressive nonlinear functions, e.g., neural networks, can be employed for the similarity function. 2. 2.

In Section 4, we demonstrate that BHLR encompasses various existing methods, such as logistic regression ( $U=1$ ), Poisson regression ( $U=1$ ), and link prediction ( $U=2$ ). Furthermore, BHLR also includes methods for representation learning, such as graph embedding ( $U=2$ ), matrix factorization ( $U=2$ ), tensor factorization ( $U\geq 2$ ), and their variants equipped with arbitrary BD; obtained feature vectors through the representation learning methods can be used for a variety of downstream tasks (e.g., clustering and visualization) besides just predicting hyperlink weights. 3. 3.

In Section 5, we generally prove the following properties (P-1) and (P-2) for BHLR equipped with arbitrary BD and $U\in\mathbb{N}$ :

(P-1)

Statistical consistency. Some tuples in hyper-relational learning may share some data vectors therein. For instance, two different tuples $\boldsymbol{X}_{(1,2,3)}=(\boldsymbol{x}_{1},\boldsymbol{x}_{2},\boldsymbol{x}_{3}),\boldsymbol{X}_{(1,3,4)}=(\boldsymbol{x}_{1},\boldsymbol{x}_{3},\boldsymbol{x}_{4})$ share two data vectors $\boldsymbol{x}_{1},\boldsymbol{x}_{3}$ . This interesting data structure results in the difference between underlying theories for BHLR and classical regression; Proposition 1 proves that the convergence rate of the loss function used in BHLR is $O(1/\sqrt{n})$ even if $O(n^{U})$ tuples are leveraged; the convergence rate is similar to $U$ -statistic, and is different from the rate $O(1/\sqrt{n^{U}})$ of classical regression using $O(n^{U})$ i.i.d. data vectors. Also, Theorem 1 generally proves that the similarity $\mu_{\hat{\boldsymbol{\theta}}_{\varphi,n}}(\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\ldots,\boldsymbol{x}_{i_{U}})$ estimated via BHLR asymptotically recovers the underlying true conditional expectation of the tuple’s hyperlink weight $\mu_{*}(\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\ldots,\boldsymbol{x}_{i_{U}}):=\mathbb{E}(w_{i_{1},i_{2},\ldots,i_{U}}\mid\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\ldots,\boldsymbol{x}_{i_{U}})$ , i.e., $\|\mu_{\hat{\boldsymbol{\theta}}_{\varphi,n}}-\mu_{*}\|\overset{p}{\to}0$ as the number of data vectors $n$ goes to infinity. Theorem 1 assumes that the similarity function $\mu_{\boldsymbol{\theta}}$ is correctly specified, i.e., $\exists\boldsymbol{\theta}\in\boldsymbol{\Theta}$ such that $\mu_{\boldsymbol{\theta}_{*}}=\mu_{*}$ , but it is free from specifying the probability distribution of $w_{i_{1},i_{2},\ldots,i_{U}}$ . 2. (P-2)

Computationally tractability. Due to the non-negligible significant computational complexity for dealing with $O(n^{U})$ hyperlink weights appeared in hyper-relational learning, we employ stochastic optimization algorithms using a novel generalized mini-batch sampling procedure for hyper-relations. The proposed procedure is a hyper-relational extension ( $U\geq 2$ ) of negative-sampling (Mikolov et al., 2013), that is often used for graph embedding $(U=2)$ . Our numerical experiments empirically demonstrate that BHLR is efficiently computed by the stochastic optimization, and our Theorem 2 also provides a theoretical guarantee for the entire optimization procedure, in the sense that the full-batch gradient of a loss function, evaluated at each step in the stochastic optimization using mini-batch, approaches $\boldsymbol{0}$ in probability as the number of iterations goes to infinity.

Consequently, BHLR including several existing methods, that have been examined experimentally, is theoretically justified in a unified manner. 4. 4.

In Section 6, we perform BHLR on real-world datasets.

1.2 Organization

The remainder of this paper is organized as follows. In Section 2, we first introduce the Bregman divergence. In Section 3, we formally formulate the hyperlink regression and propose the BHLR. In Section 4, we explain the BHLR family members and related works. In Section 5, we show the two favorable properties (P-1) statistical consistency and (P-2) computational tractability for BHLR. In Section 6, we describe the numerical experiments conducted for performing BHLR. In Section 7, we present our conclusions and future works.

2 Bregman Divergence

In this section, we introduce Bregman divergence (BD) for formulating the Bregman hyperlink regression later in Section 3.

Here, we consider an index set $\mathcal{I}$ , which is specifically defined as the set of tuple indices in our problem setting explained in Section 3.1. With a continuously differentiable and strictly convex generating function $\varphi:\text{dom}(\varphi)\to\mathbb{R}$ whose domain is a set $\text{dom}(\varphi)\subset\mathbb{R}$ , the BD (Bregman, 1967; Censor et al., 1997) between $\boldsymbol{a}:=\{a_{\boldsymbol{i}}\in\text{dom}(\varphi)\mid\boldsymbol{i}\in\mathcal{I}\}$ and $\boldsymbol{b}:=\{b_{\boldsymbol{i}}\in\text{dom}(\varphi)\mid\boldsymbol{i}\in\mathcal{I}\}$ is defined by

[TABLE]

where $d_{\varphi}:\text{dom}(\varphi)^{2}\to\mathbb{R}$ indicates the difference between $\varphi(a)$ and the first-order Taylor approximation of $\varphi(a)$ around $b\in\text{dom}(\varphi)$ as

[TABLE]

Because $\varphi$ is strictly convex, $d_{\varphi}(a,b)$ is always non-negative, and attains the minimum value [math] at $b=a$ for any fixed $a\in\text{dom}(\varphi)$ . Similarly, $D_{\varphi}(\boldsymbol{a},\boldsymbol{b})\geq 0\>(\forall\boldsymbol{a},\boldsymbol{b}\in\text{dom}(\varphi)^{|\mathcal{I}|})$ , and the equality holds if and only if $\boldsymbol{a}=\boldsymbol{b}$ (basic property 2 in (Cichocki et al., 2009) p.101). Thus, for any fixed $\boldsymbol{a}\in\text{dom}(\varphi)^{|\mathcal{I}|}$ , minimizing $D_{\varphi}(\boldsymbol{a},\boldsymbol{b})$ with respect to $\boldsymbol{b}\in\text{dom}(\varphi)^{|\mathcal{I}|}$ is expected to cause $\boldsymbol{b}$ to be closer to $\boldsymbol{a}\in\text{dom}(\varphi)^{|\mathcal{I}|}$ . In our proposed BHLR, $\boldsymbol{a},\boldsymbol{b}$ are specifically defined as observed hyperlink weights and their predicted weights, respectively, as explained in Section 3.4; the predicted weights are expected to be closer to the observed weights, due to the BD’s property.

Some of the BD family members such as the KL divergence are originally defined for measuring the difference between two probability distributions. That is, they assume that $\boldsymbol{a},\boldsymbol{b}$ satisfy (1) $a_{\boldsymbol{i}},b_{\boldsymbol{i}}\geq 0\>(\forall\boldsymbol{i}\in\mathcal{I})$ , and (2) $\sum_{\boldsymbol{i}\in\mathcal{I}}a_{\boldsymbol{i}}=\sum_{\boldsymbol{i}\in\mathcal{I}}b_{\boldsymbol{i}}=1$ . However, assumptions (1) and (2) are in fact not required for the BD to hold the favorable property above. Thus, we do not assume (1) and (2) hereinafter, similarly to some existing studies (Cichocki et al., 2009; Banerjee et al., 2005; Sra and Dhillon, 2006).

The BD includes a variety of loss functions such as the KL divergence, $\beta$ -divergence, quadratic loss, and logistic loss, as shown in Table 1.

By removing the strict convexity assumption on $\varphi$ and additionally assuming $a\in\{0,1\}$ , the BD includes margin-based loss functions. For instance, $\varphi(x)=\max\{-x,x-1\}$ results in the misclassification loss $d_{\varphi}(a,b)=I(a\neq I(b>1/2))$ , where $I(\cdot)$ represents the indicator function; other examples can be found in Zhang et al. (2009) Section 6.2.

3 Bregman Hyperlink Regression (BHLR)

In this section, we first describe the problem setting in Section 3.1; subsequently, we formally define the conditional distribution of hyperlink weights in Section 3.2. We compare two different approaches to HLR in Section 3.3, and propose BHLR in Section 3.4. In Section 3.5, we demonstrate that the BHLR can be interpreted as a maximum likelihood estimation using some exponential family model.

3.1 Problem Setting

For fixed $p,n,U\in\mathbb{N}$ and non-empty sets $\mathcal{X}\subset\mathbb{R}^{p},\mathcal{S}\subset\mathbb{R}$ , our dataset comprises $p$ -dimensional data vectors $\{\boldsymbol{x}_{i}\}_{i=1}^{n}\subset\mathcal{X}$ and symmetric hyperlink weights $\{w_{\boldsymbol{i}}\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}\subset\mathcal{S}$ , where $\boldsymbol{i}=(i_{1},i_{2},\ldots,i_{U})$ is an index in a set $\mathcal{I}_{n}^{(U)}\subset[n]^{U}$ , and $[n]$ represents the set $\{1,2,\ldots,n\}$ . Formal descriptions for tuple of data vectors, hyperlink weights and the index set are provided in the following.

•

$U$ -tuple $\boldsymbol{X}=(\boldsymbol{x},\boldsymbol{x}^{\prime},\boldsymbol{x}^{\prime\prime},\ldots)\in\mathcal{X}^{U}$ is an array of $U$ vectors, where $\boldsymbol{x},\boldsymbol{x}^{\prime},\boldsymbol{x}^{\prime\prime},\ldots\in\mathcal{X}\>(\subset\mathbb{R}^{p})$ are $p$ -dimensional vectors. For an index $\boldsymbol{i}=(i_{1},i_{2},\ldots,i_{U})\in\mathcal{I}_{n}^{(U)}\>(\subset[n]^{U})$ , a collection of $U$ data vectors $\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\ldots,\boldsymbol{x}_{i_{U}}\in\mathcal{X}$ constitute $U$ -tuple $\boldsymbol{X}_{\boldsymbol{i}}=(\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\ldots,\boldsymbol{x}_{i_{U}})$ indexed by $\boldsymbol{i}$ . Although the order of the vectors is provided, it is in effect ignored in the proposed method, by considering only the symmetric function for the tuple. Note that two different tuples may share same data vectors. For instance, $\boldsymbol{X}_{(1,2,3)}=(\boldsymbol{x}_{1},\boldsymbol{x}_{2},\boldsymbol{x}_{3})$ and $\boldsymbol{X}_{(1,3,4)}=(\boldsymbol{x}_{1},\boldsymbol{x}_{3},\boldsymbol{x}_{4})$ share two data vectors $\boldsymbol{x}_{1},\boldsymbol{x}_{3}$ ; we use the multiple index $\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}$ for dealing with the duplicate data vectors that appear in several different tuples.

•

Hyperlink weight $w_{\boldsymbol{i}}\in\mathcal{S}\>(\subset\mathbb{R})$ represents the strength of association defined for the $U$ -tuple $\boldsymbol{X}_{\boldsymbol{i}}$ . Hyperlink is also called hyperedge in hypergraph theory, and is assumed to be symmetric with respect to permutation of the entries $i_{1},i_{2},\ldots,i_{U}$ in the index $\boldsymbol{i}$ . Although we practically consider non-negative hyperlink weights in many cases, i.e., $\mathcal{S}:=\mathbb{R}_{\geq 0}$ such that the weight taking value [math] represents no association among the tuple, $\mathcal{S}$ is not restricted to be non-negative; $\mathcal{S}$ can be arbitrary specified depending on the setting.

•

Index set $\mathcal{I}_{n}^{(U)}\subset[n]^{U}$ is typically defined as $\mathcal{I}_{n}^{(U)}=[n]^{U}$ , or $\mathcal{I}_{n}^{(U)}=\{\boldsymbol{i}\in[n]^{U}\mid u\neq u^{\prime}\Rightarrow i_{u}\neq i_{u^{\prime}}\}$ such that any tuple do not contain any duplicate vectors in itself, though different tuples may share some data vectors. A particular set $\mathcal{I}_{n}^{(U)}=\mathcal{J}_{n}^{(U)}:=\{\boldsymbol{i}\in[n]^{U}\mid 1\leq i_{1}<i_{2}<\cdots<i_{U}\}$ is employed later in Section 5.1, for showing asymptotic properties of the proposed method. Although the examples of $\mathcal{I}_{n}^{(U)}$ mentioned above basically cover all the combinations of indices under some constraints, we can think of even a subset of them for $\mathcal{I}_{n}^{(U)}$ in order to allow the practical situation that a limited number of hyperlink weights are actually observed.

Such hyperlink weights defined for $U$ -tuples appear in many practical situations. Two different types of hyperlink weights for $\mathcal{S}:=\mathbb{R}_{\geq 0}$ are shown in the following Examples 1 and 2. They are also referred to as a hypernetwork (Jeffrey, 2013).

Example 1 (Friend network).

Data vector $\boldsymbol{x}_{i}$ represents the property of person $i\in[n]$ , e.g., age, gender, education, etc., and the hyperlink weight $w_{\boldsymbol{i}}\in\{0,1,2,\ldots\}(\subset\mathcal{S})$ represents the number of social groups to which all the $U$ people indexed by $\boldsymbol{i}=(i_{1},i_{2},\ldots,i_{U})$ belong.

Example 2 (Co-authorship network).

Data vector $\boldsymbol{x}_{i}$ represents the attributes of researcher $i\in[n]$ such as number of publications in each journal, and the hyperlink weight $w_{\boldsymbol{i}}\in\{0,1,2,\ldots\}(\subset\mathcal{S})$ represents the number of co-authored papers written by all the $U$ researchers indexed by $\boldsymbol{i}=(i_{1},i_{2},\ldots,i_{U})$ .

Here, we consider a user-specified parametric model of similarity function $\mu_{\boldsymbol{\theta}}:\mathcal{X}^{U}\to\mathcal{S}$ with parameter vector $\boldsymbol{\theta}\in\boldsymbol{\Theta}\subset\mathbb{R}^{q}$ . For $U$ -tuple $\boldsymbol{X}=(\boldsymbol{x},\boldsymbol{x}^{\prime},\boldsymbol{x}^{\prime\prime},\ldots)\in\mathcal{X}^{U}$ , we consider a random variable $w\in\mathcal{S}$ with conditional expectation $\mu_{*}(\boldsymbol{X}):=\mathbb{E}(w\mid\boldsymbol{X})$ . $w$ and $\boldsymbol{X}$ are linked by a conditional probability mass (or density) function $q$ , as will be formally described in the following Section 3.2. Then, learning the similarity function $\mu_{\boldsymbol{\theta}}$ so that

[TABLE]

is called hyperlink regression (HLR); this is analogous to the ordinary regression analysis, where $w$ and $\boldsymbol{X}$ correspond to the response and explanatory variables, respectively. For illustrating the HLR, two simple instances are provided in the following Examples 3 and 4.

Example 3 (Linear regression).

As will be explained in Section 4.1, linear regression (LR) is the simplest case of HLR ( $U=1$ ); “LS reg.” in Table 2. Given data vectors $\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots,\boldsymbol{x}_{n}\in\mathcal{X}$ and the corresponding response variables $w_{1},w_{2},\ldots,w_{n}\in\mathbb{R}$ , LR considers a probabilistic model $w_{i}=\langle\boldsymbol{\theta}_{*},\boldsymbol{x}_{i}\rangle+\varepsilon_{i}$ , where $\langle\cdot,\cdot\rangle$ represents the inner product and $\boldsymbol{\theta}_{*}\in\mathbb{R}^{p}$ is an underlying true parameter. Assuming that $\mathbb{E}(\varepsilon_{i}\mid\boldsymbol{x}_{i})=0$ , the conditional expectation is $\mu_{*}(\boldsymbol{x}_{i})=\mathbb{E}(w_{i}\mid\boldsymbol{x}_{i})=\langle\boldsymbol{\theta}_{*},\boldsymbol{x}_{i}\rangle$ ; linear regression aims at learning the function $\mu_{\boldsymbol{\theta}}(\boldsymbol{x}):=\langle\boldsymbol{\theta},\boldsymbol{x}\rangle$ , so that it satisfies $\mu_{\boldsymbol{\theta}}(\boldsymbol{x})\approx\mu_{*}(\boldsymbol{x})$ for all $\boldsymbol{x}\in\mathcal{X}$ .

Example 4 (Graph embedding).

As will be explained in Section 4.2, graph embedding is a special case of HLR ( $U=2$ ). Let $\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots,\boldsymbol{x}_{n}\in\mathcal{X}$ be data vectors, and $\{w_{i_{1}i_{2}}\}_{1\leq i_{1},i_{2}\leq n}$ be the corresponding weights, where $w_{i_{1}i_{2}}$ represents the strength of association between a pair of two vectors $\boldsymbol{X}_{i_{1},i_{2}}=(\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}})$ . We consider that $\{\boldsymbol{x}_{i}\}_{i=1}^{n}$ are nodes of a graph, and $(w_{i_{1}i_{2}})_{1\leq i_{1},i_{2}\leq n}\in\mathbb{R}^{n\times n}$ represents the adjacency matrix of the graph. In addition to the conditional expectation $\mu_{*}(\boldsymbol{X}_{i_{1},i_{2}}):=\mathbb{E}(w_{i_{1}i_{2}}\mid\boldsymbol{X}_{i_{1},i_{2}})$ , we may also specify a conditional distribution $q(w\mid\boldsymbol{X})$ of $w$ given $\boldsymbol{X}=(\boldsymbol{x},\boldsymbol{x}^{\prime})$ ; typically, Bernoulli distribution $q(w\mid\boldsymbol{X})=\mu_{*}(\boldsymbol{X})^{w}(1-\mu_{*}(\boldsymbol{X}))^{1-w}$ is considered for binary $w\in\{0,1\}$ . Furthermore, we assume that the data vectors $\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots,\boldsymbol{x}_{n}$ are i.i.d. generated from a pdf $q_{X}$ the link weights $\{w_{i_{1}i_{2}}\}_{1\leq i_{1}<i_{2}\leq n}$ and data vectors $\{\boldsymbol{x}_{i}\}_{i=1}^{n}$ follow a joint distribution

[TABLE]

The remaining link weights are specified by $w_{i_{2}i_{1}}=w_{i_{1}i_{2}}$ for $1\leq i_{1}<i_{2}\leq n$ and $w_{ii}=0$ . See Figure 1 for the generative model (2); it is straightforwardly generalized to arbitrary $U\in\mathbb{N}$ and arbitrary $q(w\mid\boldsymbol{X})$ , in the following Section 3.2. For fully describing the generative model, we also define a similarity function $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{i_{1},i_{2}}):=\sigma(\langle\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i}),\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{j})\rangle)$ , where $\sigma(z)=(1+\exp(-z))^{-1}$ represents the sigmoid function, and $\boldsymbol{f}_{\boldsymbol{\theta}}:\mathcal{X}\to\mathbb{R}^{K}\>(K\in\mathbb{N})$ is an user-specified parametric function such as neural networks. Then, graph embedding learns the function $\boldsymbol{f}_{\boldsymbol{\theta}}$ so that the similarity function $\mu_{\boldsymbol{\theta}}$ satisfies $\mu_{\boldsymbol{\theta}}(\boldsymbol{X})\approx\mu_{*}(\boldsymbol{X})$ for any $\boldsymbol{X}=(\boldsymbol{x},\boldsymbol{x}^{\prime})\in\mathcal{X}^{2}$ . A better feature vector $\boldsymbol{y}_{i}=\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i})\in\mathbb{R}^{K}$ can be obtained by applying a trained $\boldsymbol{f}_{\boldsymbol{\theta}}$ to the data vector $\boldsymbol{x}_{i}\in\mathcal{X}$ , which is often used for several tasks including “link prediction” by looking at the value of $\sigma(\langle\boldsymbol{y}_{i},\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})\rangle)$ , $i=1,\ldots,n$ , for a newly obtained vector $\boldsymbol{x}\in\mathcal{X}$ .

Given our dataset consists of data vectors $\{\boldsymbol{x}_{i}\}_{i=1}^{n}$ and hyperlink weights $\{w_{\boldsymbol{i}}\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}$ , the parameter vector $\boldsymbol{\theta}$ is optimized by minimizing an empirical loss function so that

[TABLE]

hold. This paper aims at providing a general framework for HLR, named BHLR, such that it encompasses a variety of existing methods. This paper also intends to provide theoretical guarantees for general BHLR; several existing methods, that have been examined experimentally, are also theoretically justified in a unified manner.

3.2 Probability Distributions of Hyperlink Weights and Tuples

In order to obtain the conditional expectation $\mu_{*}(\boldsymbol{X}_{\boldsymbol{i}})=\mathbb{E}(w_{\boldsymbol{i}}\mid\boldsymbol{X}_{\boldsymbol{i}})$ , we first formally define the conditional distribution of hyperlink weights given data vectors by straightforwardly generalizing the probabilistic model for $U=2$ shown in Example 4 and Figure 1.

Here, we explain why the extra attention is required for defining the conditional distribution of hyperlink weights given data vectors. For any $\boldsymbol{i}^{\prime}$ obtained by permutating the elements of $\boldsymbol{i}$ , tuples $\boldsymbol{X}_{\boldsymbol{i}},\boldsymbol{X}_{\boldsymbol{i}^{\prime}}$ consist of the same vectors $\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\ldots,\boldsymbol{x}_{i_{U}}$ , and it holds that $w_{\boldsymbol{i}}=w_{\boldsymbol{i}^{\prime}}$ since the hyperlink weights are assumed to be symmetric. In the case of $U=2$ , this symmetry coincides with considering undirected links; link weights should satisfy $w_{i_{1}i_{2}}=w_{i_{2}i_{1}}$ for all $i_{1}$ and $i_{2}$ , implying the constraints on the distributions for $w_{i_{1},i_{2}}$ and $w_{i_{2}i_{1}}$ .

For specifying the distribution appropriately, we employ a simple idea. We first specify the conditional probability density function (cpdf) or conditional probability mass function (cpmf) $\tilde{q}$ only for $w_{i_{1}i_{2}}\mid\boldsymbol{X}_{i_{1}i_{2}}$ whose index is in non-decreasing order $i_{1}\leq i_{2}$ . Then, the cpdf or cpmf $q$ of $w_{i_{2}i_{1}}\mid\boldsymbol{X}_{i_{2}i_{1}}$ whose index is in reverse order, can be defined as that of $w_{i_{1}i_{2}}\mid\boldsymbol{X}_{i_{1}i_{2}}$ , since the weights satisfy the symmetry $w_{i_{1}i_{2}}=w_{i_{2}i_{1}}$ and both tuples $\boldsymbol{X}_{i_{1},i_{2}},\boldsymbol{X}_{i_{2}i_{1}}$ consist of the same vectors $\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}}$ . This idea of symmetry is readily generalized to $U\in\mathbb{N}$ ; we specify the cpdf or cpmf $\tilde{q}$ of $w_{\boldsymbol{i}^{\prime}}\mid\boldsymbol{X}_{\boldsymbol{i}^{\prime}}$ only for non-decreasing order index $\boldsymbol{i}^{\prime}\in[n]^{U}$ such that $i^{\prime}_{1}\leq i^{\prime}_{2}\leq\cdots\leq i^{\prime}_{U}$ , and consider a mapping $r:\boldsymbol{i}\mapsto\boldsymbol{i}^{\prime}$ such that $\boldsymbol{i}^{\prime}=r(\boldsymbol{i})$ is obtained by sorting the elements of $\boldsymbol{i}$ in non-decreasing order. Then cpdf or cpmf $q$ of $w_{\boldsymbol{i}}\mid\boldsymbol{X}_{\boldsymbol{i}}$ is defined as

[TABLE]

Therefore, we have well-defined conditional distribution for hyperlink weights. Then, the cpdf (or cpmf) of all the hyperlink weights $\{w_{\boldsymbol{i}}\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}$ given data vectors $\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots,\boldsymbol{x}_{n}\in\mathcal{X}^{n}$ is

[TABLE]

meaning that hyperlink weight $w_{\boldsymbol{i}}$ is conditionally independently generated by following the probabilistic model (4). When considering the case that $U=2$ , $q(w\mid\boldsymbol{X})=\mu_{*}(\boldsymbol{X})^{w}(1-\mu_{*}(\boldsymbol{X}))^{1-w}$ represents the cpmf of Bernoulli distribution whose expectation is $\mu_{*}(\boldsymbol{X}):=\mathbb{E}(w\mid\boldsymbol{X})$ , and $\boldsymbol{X}=(\boldsymbol{x},\boldsymbol{x}^{\prime})$ is a pair of latent variables, the probabilistic model (4) is also known as latent position random graph (LPRG) model with kernel $\mu_{*}$ . LPRG model is considered in Tang et al. (2013) and Athreya et al. (2018) Definition 6, and it is originated from the random dot product graph model (Young and Scheinerman, 2007), that corresponds to a case $\mu_{*}(\boldsymbol{X}):=\langle\boldsymbol{x},\boldsymbol{x}^{\prime}\rangle$ for $\boldsymbol{X}=(\boldsymbol{x},\boldsymbol{x}^{\prime})\in\mathcal{X}^{2}$ . Our probabilistic model (4) generalizes the LPRG model to arbitrary probability distribution with arbitrary $U\in\mathbb{N}$ , though the previous studies focus on the spectral analyses on the matrix $\boldsymbol{W}=(w_{ij})$ of Bernoulli link weights with $U=2$ , and they assume that $\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots,\boldsymbol{x}_{n}$ are latent variables.

Hereinafter, we note the probability distribution of the tuple $\boldsymbol{X}_{\boldsymbol{i}}$ . We will simply assume that the data vectors $\{\boldsymbol{x}_{i}\}_{i=1}^{n}$ are i.i.d. randomly generated from a distribution $q_{X}$ in Section 5 for showing statistical consistency of BHLR. Then, the joint distribution over all the hyperlink weights and data vectors is specified as $\prod_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}q(w_{\boldsymbol{i}}\mid\boldsymbol{X}_{\boldsymbol{i}})\prod_{i=1}^{n}q_{X}(\boldsymbol{x}_{i})$ . Note that the marginal distribution for $\boldsymbol{Z}_{\boldsymbol{i}}:=(w_{\boldsymbol{i}},\boldsymbol{X}_{\boldsymbol{i}})$ does not depend on the index $\boldsymbol{i}$ , thus $\boldsymbol{Z}_{\boldsymbol{i}},\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}$ are identically distributed. However, even if data vectors $\{\boldsymbol{x}_{i}\}_{i=1}^{n}$ are i.i.d. generated, two different $\boldsymbol{Z}_{\boldsymbol{i}},\boldsymbol{Z}_{\boldsymbol{i}^{\prime}}$ can be dependent, as their tuples $\boldsymbol{X}_{\boldsymbol{i}},\boldsymbol{X}_{\boldsymbol{i}^{\prime}}$ may share same data vectors in common. For instance, $\boldsymbol{X}_{(1,2,3)}=(\boldsymbol{x}_{1},\boldsymbol{x}_{2},\boldsymbol{x}_{3})$ and $\boldsymbol{X}_{(1,3,4)}=(\boldsymbol{x}_{1},\boldsymbol{x}_{3},\boldsymbol{x}_{4})$ share two data vectors $\boldsymbol{x}_{1}$ and $\boldsymbol{x}_{3}$ . Therefore, $\boldsymbol{Z}_{\boldsymbol{i}},\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}$ are NOT independently distributed. This property for $U\geq 2$ makes our setting interesting and needs a special care in the asymptotic theory. In this regard, theories for HLR, that predicts hyperlink weights from the constrained tuples, can be different from those of classical regression, that typically predicts response variables from i.i.d. data vectors. We consider such constrained tuples, and the statistical consistency for BHLR is proved later in Section 5.

3.3 Two Different Approaches to HLR

In this section, we show two different approaches to HLR with $\mathcal{S}:=\mathbb{R}_{\geq 0}$ , and explain why we employ the second approach. Although the case of $U=1$ is illustrated here, it can be easily generalized to arbitrary $U\in\mathbb{N}$ .

Considering a weight $w_{i}$ taking a value in the set $\{0,1,2,\ldots\}\subset\mathcal{S}$ and a data vector $\boldsymbol{x}_{i}\in\mathbb{R}^{p}$ $(i=1,2,\ldots,n)$ , HLR predicts the weight $w_{i}\in\mathcal{S}$ through the function $\mu_{\boldsymbol{\theta}}(\boldsymbol{x}_{i})\in\mathcal{S}$ . However, there are two different approaches to this problem. The first approach is based on matching conditional probability mass function (pmf) $q(w_{i}\mid\boldsymbol{x}_{i})$ shown in Fig. 2 (a) and the parametric generative model $p_{\boldsymbol{\theta}}(w_{i}\mid\boldsymbol{x}_{i})$ whose expectation is $\mu_{\boldsymbol{\theta}}(\boldsymbol{x}_{i})=\sum_{w\in\mathbb{N}_{0}}wp_{\boldsymbol{\theta}}(w\mid\boldsymbol{x}_{i})$ . Although this approach naturally extends the maximum likelihood regression, there remain several challenges explained below. For solving these challenges, we also consider the second approach, that instead matches only the conditional expectation function $\mu_{*}(\boldsymbol{x}_{i}):=E(w_{i}\mid\boldsymbol{x}_{i})$ shown in Fig. 2 (b) and the model $\mu_{\boldsymbol{\theta}}(\boldsymbol{x}_{i})$ . Consequently, we employ and generalize the second approach, and propose Bregman-HLR (BHLR) in Section 3.4.

Hereinafter, we describe the details of the two approaches to HLR.

The first approach is, matching the underlying conditional pmf $q(w_{i}\mid\boldsymbol{x}_{i})$ and the parametric generative model $p_{\boldsymbol{\theta}}(w_{i}\mid\boldsymbol{x}_{i})$ . Let $q_{iw}=q(w\mid\boldsymbol{x}_{i})$ and $p_{\boldsymbol{\theta},iw}=p_{\boldsymbol{\theta}}(w\mid\boldsymbol{x}_{i})$ for $w\in\mathbb{N}_{0}$ , $i=1,\ldots,n$ . They are put together as vectors $\boldsymbol{q}_{i}:=(q_{i0},q_{i1},q_{i2},\ldots),\boldsymbol{p}_{\boldsymbol{\theta},i}:=(p_{\boldsymbol{\theta},i0},p_{\boldsymbol{\theta},i1},p_{\boldsymbol{\theta},i2},\ldots)$ , so that each of vectors $\boldsymbol{q}_{i},\boldsymbol{p}_{\boldsymbol{\theta},i}$ represents the distribution of $w_{i}\mid\boldsymbol{x}_{i}$ . Then, we may estimate $\boldsymbol{\theta}$ by minimizing

[TABLE]

where $\varphi$ is a user-specified generating function. However, the underlying conditional distributions $\boldsymbol{q}_{1},\boldsymbol{q}_{2},\ldots,\boldsymbol{q}_{n}$ used in (5) cannot be observed in practice; we instead consider the empirical conditional distribution $\hat{\boldsymbol{q}}_{i}=(\hat{q}_{i0},\hat{q}_{i1},\hat{q}_{i2},\ldots)$ whose $w_{i}$ -th entry is $1$ and [math] otherwise, for $i=1,2,\ldots,n$ . Considering $\mathcal{I}=\mathbb{N}$ ,

[TABLE]

holds; minimizing (5) equipped with the empirical distributions $\{\hat{\boldsymbol{q}}_{i}\}_{i=1}^{n}$ is equivalent to minimizing

[TABLE]

(6) appears in some existing studies, such as Ghosh et al. (2013) for $\beta$ -divergence in Table 1. However, as Okuno and Shimodaira (2019) Section 3.2 pointed out in a special case of HLR, the term ( $\star$ ) in eq. (6) is computationally intractable due to the infinite summation $\sum_{w\in\mathbb{N}_{0}}$ ; there remain a computational challenge in this approach. The fininite summation similarly appears in eq. (4) of Kawashima and Fujisawa (2019), and they compute the term by the finite-sum approximation instead. Note that, the term ( $\star$ ) reduces to $\sum_{w\in\mathbb{N}_{0}}p_{\boldsymbol{\theta}}(w\mid\boldsymbol{x}_{i})=1$ if the generating function is specified as $\varphi(x)=x\log x-x$ ; the computational issue does not occur if KL-divergence is considered.

For solving the computational challenge, we also consider the second approach. This second approach simply matches the underlying expectation function $\mu_{*}(\boldsymbol{x}_{i})=\mathbb{E}(w_{i}\mid\boldsymbol{x}_{i})$ and the parametric model $\mu_{\boldsymbol{\theta}}(\boldsymbol{x}_{i})$ without assuming any specific probability distribution for $w_{i}\mid\boldsymbol{x}_{i}$ ; we may obtain the estimator of $\boldsymbol{\theta}$ by minimizing

[TABLE]

where $\varphi$ is a user-specified generating function whose domain $\text{dom}(\varphi)$ includes the set $\mathcal{S}$ . However, the underlying expectation function $\mu_{*}$ cannot be observed in practice; we instead minimize

[TABLE]

that approximates (7) in the sense that the underlying true conditional expectation $\mu_{*}(\boldsymbol{x}_{i})=\mathbb{E}(w_{i}\mid\boldsymbol{x}_{i})$ is replaced with the observation $w_{i}$ . $C:=\frac{1}{n}\sum_{i=1}^{n}\varphi(w_{i})$ is a constant independent of the parameter $\boldsymbol{\theta}$ . (8) reduces to Zhang et al. (2009) eq. (20), if the model is specified as $\mu_{\boldsymbol{\theta}}(\boldsymbol{x})=g(\boldsymbol{\theta}^{\top}\boldsymbol{x})$ for some non-linear function $g:\mathbb{R}\to\mathbb{R}$ , whereas arbitrary similarity function $\mu_{\boldsymbol{\theta}}$ is considered in this study.

The second approach bypasses the computational challenge of the first approach, since (8) does not include any infinite summatation; we consequently employ the second approach, and generalize it from $U=1$ to $U\in\mathbb{N}$ as shown in the next section.

3.4 Proposed BHLR

We here consider HLR with arbitrary $U\in\mathbb{N}$ , for predicting the hyperlink weights $w_{\boldsymbol{i}}$ taking values in a set $\mathcal{S}\subset\mathbb{R}$ via a user-specified symmetric similarity function $\mu_{\boldsymbol{\theta}}:\mathcal{X}^{U}\to\mathcal{S}$ . By generalizing the loss function (8) from $U=1$ to $U\in\mathbb{N}$ , we propose to minimize a simple loss function

[TABLE]

where $\varphi$ is a user-specified generating function whose domain $\text{dom}(\varphi)$ includes the set $\mathcal{S}$ , and $C:=\frac{1}{|\mathcal{I}_{n}^{(U)}|}\sum_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}\varphi(w_{\boldsymbol{i}})$ is a constant independent of the parameter $\boldsymbol{\theta}$ . Subsequently, the estimator is defined as

[TABLE]

Once the estimator $\hat{\boldsymbol{\theta}}_{\varphi,n}$ is obtained, we may predict $w_{\boldsymbol{i}}$ by the estimated similarity function $\mu_{\hat{\boldsymbol{\theta}}_{\varphi,n}}(\boldsymbol{X}_{\boldsymbol{i}})$ . We formally define predicting $w_{\boldsymbol{i}}$ by the function $\mu_{\hat{\boldsymbol{\theta}}_{\varphi,n}}(\boldsymbol{X}_{\boldsymbol{i}})$ as the BHLR.

Since the hyperlink weights are symmetry, we assume that the function $\mu_{\boldsymbol{\theta}}$ also satisfies the symmetry

[TABLE]

for any $\boldsymbol{i}^{\prime}=(i_{1}^{\prime},i_{2}^{\prime},\ldots,i_{U}^{\prime})$ obtained by permutating the elements of $\boldsymbol{i}=(i_{1},i_{2},\ldots,i_{U})\in\mathcal{I}_{n}^{(U)}$ . This symmetry should hold for all $\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\ldots,\boldsymbol{x}_{i_{U}}\in\mathcal{X}$ and $\boldsymbol{\theta}\in\boldsymbol{\Theta}$ ; the similarity function $\mu_{\boldsymbol{\theta}}$ in effect ignores the order of the vectors, as long as (9) is assumed. An example of such a symmetric similarity function is

[TABLE]

where $\boldsymbol{f}_{\boldsymbol{\theta}}:\mathcal{X}\to\mathbb{R}^{K}$ is a function parametrized by $\boldsymbol{\theta}$ , e.g., vector-valued neural networks, $\eta:\mathbb{R}\to\mathcal{S}$ is a link function, e.g., exponential function for $\mathcal{S}=\mathbb{R}_{\geq 0}$ and sigmoid function for $\mathcal{S}=[0,1]$ , and $\langle\boldsymbol{y},\boldsymbol{y}^{\prime},\boldsymbol{y}^{\prime\prime},\ldots\rangle:=\sum_{k=1}^{K}y_{k}y_{k}^{\prime}y_{k}^{\prime\prime}\cdots$ . The above function (12) is employed for our numerical experiments later in Section 6, and it reduces to tensor decomposition explained in Section 4.3 if $\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})=\boldsymbol{\theta}^{\top}\boldsymbol{x}$ , $\eta(z)=z$ , and $\boldsymbol{x}_{i}$ is $1$ -hot vector.

The BHLR reduces to several existing methods, such as logistic regression ( $U=1$ ), Poisson regression ( $U=1$ ), and link prediction ( $U=2$ ), by specifying $\mu_{\boldsymbol{\theta}}$ and $\varphi$ . Furthermore, BHLR also reduces to several methods for representation learning, such as graph embedding ( $U=2$ ), matrix factorization ( $U=2$ ), tensor factorization ( $U\geq 2$ ), and their variants equipped with arbitrary BD. We describe the relation between the BHLR and these existing methods in Section 4.

In addition to the rich examples for the BHLR family, the BHLR possesses the following two favorable properties: (P-1) statistical consistency, and (P-2) computational tractability. We further explain these properties (P-1) and (P-2) in Section 5.1 and Section 5.2, respectively, along with the proposal of a novel and generalized minibatch sampling procedure for hyper-relational data that can be used for efficient stochastic algorithms.

3.5 BHLR is Equivalent to MLE through Corresponding Exponential Family Model

In this section, we demonstrate that BHLR is interpreted as the maximum-likelihood estimation with a corresponding exponential family model. In other words, specifying a generating function $\varphi$ for BD implicitly specifies a cpdf or cpmf for $w_{\boldsymbol{i}}\mid\boldsymbol{X}_{\boldsymbol{i}}$ of the form

[TABLE]

with $\mu=\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}})$ , where $\zeta_{1}(\mu):=\varphi^{\prime}(\mu),\zeta_{2}(\mu):=\varphi(\mu)-\mu\varphi^{\prime}(\mu)$ , and $\zeta_{3}(w)$ is specified such that $\int_{\mathcal{S}}p_{\boldsymbol{\zeta}}(w|\mu)\,\mathrm{d}w=1$ (cpdf) or $\sum_{w\in\mathcal{S}}p_{\boldsymbol{\zeta}}(w|\mu)=1$ (cpmf) holds. This is easily understood as explained below. Starting from (9), a simple calculation leads to

[TABLE]

where $D:=\prod_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}\exp(-\varphi(w_{\boldsymbol{i}})-\zeta_{3}(w_{\boldsymbol{i}}))$ is a constant independent of the parameter $\boldsymbol{\theta}$ . The normalizing function $\zeta_{3}(w)$ is explicitly specified as $\zeta_{3}(w)=-\log\int_{\mathcal{S}}\exp(w\zeta_{1}(\mu)+\zeta_{2}(\mu))\,\mathrm{d}w$ (cpdf) or $\zeta_{3}(w)=-\log\sum_{w\in\mathcal{S}}\exp(w\zeta_{1}(\mu)+\zeta_{2}(\mu))$ (cpmf). Therefore minimizing $L_{\varphi,n}(\boldsymbol{\theta})$ in BHLR is formally equivalent to maximizing the likelihood function of the exponential family model $p_{\boldsymbol{\zeta}}(w_{\boldsymbol{i}}\mid\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}}))$ .

When $U=1$ , we associate the BHLR with the MLE of the generalized linear model (GLM) (Bishop, 2006). They are almost the same but do not exhibit inclusion in the following sense: (i) The GLM restricts $\zeta_{1}$ in (13) to be an identity function, and the function $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}})$ is in the form of $g(\boldsymbol{\theta}^{\top}\boldsymbol{x}_{i_{1}})$ for some function $g$ , whereas the BHLR is free from these constraints. (ii) Meanwhile, function $\zeta_{2}$ in (13) is constrained by the generating function $\varphi$ , whereas this does not apply to GLM.

4 BHLR Family Members and Related Works

In this section, we describe the BHLR family members by specifying $U\in\mathbb{N}$ and the generating function $\varphi$ in Section 4.1–4.3 and Table 2. Other related works are explained in Section 4.4.

Before explaining the BHLR family members, we first explicitly derive the corresponding loss functions $L_{\varphi,n}(\boldsymbol{\theta})$ associated with some generating functions $\varphi_{\text{Logistic}}(x):=x\log x+(1-x)\log(1-x),\varphi_{\text{KL}}(x):=x\log x-x,\varphi_{\text{Quad.}}(x):=x^{2}-x$ and $\varphi_{\beta}(x):=\frac{x^{1+\beta}}{\beta(1+\beta)}-\frac{x}{\beta}$ , that are listed in Table 1. Subsequently, for an arbitrary $U\in\mathbb{N}$ , we have

[TABLE]

respectively, where

[TABLE]

are constants independent of the parameter $\boldsymbol{\theta}$ . By utilizing these loss functions (14)–(17), and sets

[TABLE]

various existing methods can be regarded as the BHLR family members, as shown in the following Table 2. A detailed explanation of the BHLR family members are provided in Section 4.1 for $U=1$ , Section 4.2 for $U=2$ , and Section 4.3 for $U\geq 2$ . Other related works are explained in Section 4.4.

4.1 $U=1$

•

Least-squares (LS) regression (Bishop, 2006) minimizes $-\sum_{i_{1}\in\mathcal{I}_{n}^{(1)}}\log p_{\text{Norm}}(w_{i_{1}}\mid\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{i_{1}}))$ using the normal probability density function $p_{\text{Norm}}(w\mid\mu):=\frac{1}{\sqrt{2\pi}}\exp(-\frac{(w-\mu)^{2}}{2})$ for learning $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{i_{1}})=f_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{1}})$ . LS regression is equivalent to minimizing $L_{\varphi_{\text{Quad.}},n}(\boldsymbol{\theta})$ , and similarly, logistic regression (Bishop, 2006) and Poisson regression (Cameron and Trivedi, 2007) minimize $L_{\varphi_{\text{Logistic}}}(\boldsymbol{\theta})$ and $L_{\varphi_{\text{KL}}}(\boldsymbol{\theta})$ , respectively. The regression function $f_{\boldsymbol{\theta}}:\mathbb{R}^{p}\to\mathbb{R}$ used in the regression methods above can be specified arbitrarily. Whereas linear transformation $\boldsymbol{\theta}^{\top}\boldsymbol{x}_{i}\in\mathbb{R}$ is typically used (Zhang et al., 2010), NNs are incorporated currently for enhancing the expressive power of the regression function.

•

Parametric Bregman-divergence regression (PBDR) (Zhang et al., 2009) generalizes Poisson regression, logistic regression and least squares (LS) regression; it is equivalent to the BHLR equipped with arbitrary generating functions $\varphi$ and functions $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}})$ in the form of $g(\boldsymbol{\theta}^{\top}\boldsymbol{x}_{i_{1}})$ for some function $g$ . The PBDR is a special case of the BHLR. However, PBDR considers only the limited form of functions $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}})$ , whereas BHLR can employ arbitrary function including neural networks.

4.2 $U=2$

•

Matrix factorization (MF) (Koren et al., 2009) decomposes a given matrix $\boldsymbol{V}=(v_{\boldsymbol{j}})\in\mathbb{R}^{n_{1}\times n_{2}}$ into matrices $\boldsymbol{\xi}^{(u)}\in\mathbb{R}^{n_{u}\times K}\>(u=1,2)$ , by minimizing the BD between entries of $\boldsymbol{V}$ and those of $\boldsymbol{\xi}^{(1)}\boldsymbol{\xi}^{(2)\top}$ . Subsequently, we can expect that $\boldsymbol{V}\approx\boldsymbol{\xi}^{(1)}\boldsymbol{\xi}^{(2)\top}$ .

Here, we briefly explain that the BHLR includes MF as a special case, by considering link weights

[TABLE]

and ( $n_{1}+n_{2}$ )-dimensional $1$ -hot data vectors $\{\boldsymbol{x}_{i}\}_{i=1}^{n_{1}+n_{2}}$ .

Using the parameter $\boldsymbol{\theta}=(\boldsymbol{\xi}^{(1)\top},\boldsymbol{\xi}^{(2)\top})^{\top}\in\mathbb{R}^{(n_{1}+n_{2})\times K}$ and an index set $\mathcal{C}(n_{1},n_{2}):=\{(i_{1},i_{2})\mid i_{1}=1,2,\ldots,n_{1};i_{2}=n_{1}+1,n_{1}+2,\ldots,n_{1}+n_{2}\}$ , it holds that

[TABLE]

where $v_{\boldsymbol{j}}$ and $w_{\boldsymbol{i}}$ represent elements of the matrices $\boldsymbol{V}$ and $\boldsymbol{W}$ respectively. Thus, MF minimizing the objective on the left-hand side is equivalent to the BHLR minimizing the objective on the right-hand side. Although MF employs the quadratic loss $L_{\varphi_{\text{Quad.}},n}(\boldsymbol{\theta})$ in many cases, MF is in fact defined with an arbitrary BD (Cichocki et al., 2009).

MF ( $U=2$ ) can be generalized to $U\geq 2$ , where the generalization is called tensor factorization (TF). We describe TF in the following section, and its relation to the BHLR is described in detail in B.

Finally, MF is called a non-negative MF (NMF) (Cichocki et al., 2009) if the entries of the decomposed matrices $\boldsymbol{\xi}^{(1)},\boldsymbol{\xi}^{(2)}$ are restricted to be non-negative.

•

Graph embedding (GE) (Tang et al., 2015; Okuno et al., 2018; Nickel and Kiela, 2017; Okuno and Shimodaira, 2019) is a method for representation learning, that trains the transformation $\boldsymbol{f}_{\boldsymbol{\theta}}:\mathcal{X}(\subset\mathbb{R}^{p})\to\mathbb{R}^{K}$ with a user-specified dimension $K\in\mathbb{N}$ , such that the link weight $w_{\boldsymbol{i}}\geq 0$ is predicted through $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}})=g(\langle\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{1}}),\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{2}})\rangle)$ . $g:\mathbb{R}^{K}\times\mathbb{R}^{K}\to\mathbb{R}$ is a symmetric function, and $\boldsymbol{\theta}$ is a parameter vector to be estimated by minimizing $L_{\varphi_{\text{Logistic}},n}(\boldsymbol{\theta})$ with sigmoid function $g(\cdot)=\sigma(\cdot)$ in large-scale information network embedding (LINE) (Tang et al., 2015), and $L_{\varphi_{\text{KL}},n}(\boldsymbol{\theta})$ with $g(\cdot)=\exp(\cdot)$ in $1$ -view version of probabilistic multi-view graph embedding (Okuno et al., 2018), which we denote as KL-GE herein.

While these GEs achieved outstanding success, the observed link weights may contain noise in practice that may degrade the GE’s performance; $\beta$ -GE (Okuno and Shimodaira, 2019) minimizes $L_{\varphi_{\beta},n}(\boldsymbol{\theta})$ associated with $\beta$ -divergence for learning the similarity function $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}})$ robustly from noisy link weights.

The GEs above are special cases of the BHLR. Once the estimator $\hat{\boldsymbol{\theta}}_{\varphi,n}$ for GE is obtained, we may compute feature vectors $\boldsymbol{y}_{i}:=\boldsymbol{f}_{\hat{\boldsymbol{\theta}}_{\varphi,n}}(\boldsymbol{x}_{i})$ , $(i=1,2,\ldots,n)$ . Applying further statistical analysis methods such as visualization, clustering, and discriminant analysis to the obtained feature vectors $\{\boldsymbol{y}_{i}\}_{i=1}^{n}$ has demonstrated empirically better performance than using the original data vectors $\{\boldsymbol{x}_{i}\}_{i=1}^{n}$ .

Many GEs employ the IPS model $\langle\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{1}}),\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{2}})\rangle$ equipped with a vector-valued NN $\boldsymbol{f}_{\boldsymbol{\theta}}$ in their similarity function $\mu_{\boldsymbol{\theta}}$ . In terms of its expressive power, Okuno et al. (2018) proved that the IPS approximates any PD similarity $g^{(\text{PD})}(\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}})$ arbitrarily well. However, non-PD similarities are not expressed by the IPS model, and thus some other similarity models are drawing attention. For instance, Nickel and Kiela (2017, 2018) employ negative Poincaré distance that can efficiently embed tree-structured graphs. Furthermore, shifted IPS (SIPS) (Okuno et al., 2019) $\langle\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{1}}),\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{2}})\rangle+u_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{1}})+u_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{2}})$ is proposed for GE by introducing the bias terms using a NN $u_{\boldsymbol{\theta}}:\mathcal{X}\to\mathbb{R}$ , and it has been proven to approximate a wider class called conditionally PD similarities that include PD similarities and various non-PD similarities, such as negative Poincaré distance. Recently Kim et al. (2019) proposed the weighted inner product similarity (WIPS) for approximating general similarities including PD and conditionally PD similarities as special cases.

•

Stochastic block model (SBM) (Holland et al., 1983) considers a graph for which each node $i\in[n]$ is associated with the cluster index $x_{i}\in[C]$ . The SBM learns $\theta_{1},\theta_{2}\in[0,1]$ , representing probabilities that a link exists between two nodes belonging to the same cluster and different clusters, respectively. As the probability $\mathbb{P}(w_{\boldsymbol{i}}=1\mid\boldsymbol{X}_{\boldsymbol{i}})$ is expressed as $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}}):=\theta_{1}\boldsymbol{1}(x_{i_{1}}=x_{i_{2}})+\theta_{2}\boldsymbol{1}(x_{i_{1}}\neq x_{i_{2}})$ and the parameter $\boldsymbol{\theta}=(\theta_{1},\theta_{2})$ is learned by minimizing $L_{\varphi_{\text{Logistic}},n}(\boldsymbol{\theta})$ , the SBM is a special case of the BHLR.

4.3 $U\geq 2$

•

PARAFAC (Cichocki et al., 2009; Cong et al., 2015), that is also called TF, CP-decomposition, and CANDECOMP, decomposes a given tensor $\boldsymbol{V}:=(v_{\boldsymbol{j}})\in\mathbb{R}^{n_{1}\times n_{2}\times\cdots\times n_{U}}$ into matrices $\boldsymbol{\xi}^{(u)}:=(\xi^{(u)}_{jk})\in\mathbb{R}^{n_{u}\times K}\>(u\in[U])$ , by minimizing the BD between entries of $\boldsymbol{V}$ and $[\![\boldsymbol{\xi}^{(1)},\boldsymbol{\xi}^{(2)},\ldots,\boldsymbol{\xi}^{(U)}]\!]$ whose $\boldsymbol{j}=(j_{1},j_{2},\ldots,j_{U})$ -th entry is specified as $\sum_{k=1}^{K}\xi^{(1)}_{j_{1}k}\xi^{(2)}_{j_{2}k}\cdots\xi^{(U)}_{j_{U}k}$ . Subsequently, we can expect that $\boldsymbol{V}\approx[\![\boldsymbol{\xi}^{(1)},\boldsymbol{\xi}^{(2)},\ldots,\boldsymbol{\xi}^{(U)}]\!]$ . TF $(U\geq 2)$ generalizes the MF ( $U=2$ ) explained in Section 4.2 because $[\![\boldsymbol{\xi}^{(1)},\boldsymbol{\xi}^{(2)}]\!]=\boldsymbol{\xi}^{(1)}\boldsymbol{\xi}^{(2)\top}$ . Similar to MF, TF is a special case of the BHLR. See B for details.

•

PARAFAC is called a non-negative tensor factorization (NTF) (Cichocki et al., 2009; Kolda and Bader, 2009) or non-negative PARAFAC, if the entries of the decomposed matrices $\boldsymbol{\xi}^{(u)}\>(u\in[U])$ are restricted to be non-negative. Although this PARAFAC-based NTF can be applied to general $U\in\mathbb{N}$ (Kolda and Bader, 2009), many different types of NTFs have been developed especially for $U=3$ ; by referring to Cichocki et al. (2009) p.54 Table 1.2, NTF1, NTF2 (Cichocki and Zdunek, 2006), and shifted NTF (Harshman et al., 2003) decompose a given tensor into $2$ matrices and a tensor, and convolutive NTF (CNTF) and C2NTF (Mørup and Schmidt, 2006) decompose the tensor into a matrix and $2$ tensors.

4.4 Other Related Works

In this section, some other related works are listed. Please also see A for the remaining related works.

For $U=1$ , the MLE of a generalized linear model (Bishop, 2006) and the BHLR are almost the same; however, they do not exhibit inclusion, as explained in Section 3.5.
For $U=2$ , Locality preserving projections (LPP) (He and Niyogi, 2004) computes a low-dimensional linearly transformed feature vectors $\boldsymbol{y}_{i}=\boldsymbol{A}^{\top}\boldsymbol{x}_{i}\>(i=1,2,\ldots,n)$ by considering link weights $w_{i_{1}i_{2}}\geq 0$ . Cross-Domain Matching Correlation Analysis (CDMCA) (Shimodaira, 2016) is a multiview extension of LPP. Considering that (i) LPP can be regarded as $1$ -view CDMCA and (ii) CDMCA is a quadratic approximation of multiview KL–GE equipped with linear transformations, as shown in Okuno et al. (2018) section 3.6, LPP is a quadratic approximations of KL–GE that is included in the BHLR. LPP reduces to spectral graph embedding (Chung, 1997) if the data vectors are $1$ -hot.
For $U\geq 2$ , Hypergraph Incidence Matrix Factorization (HIMFAC) (Nori et al., 2012) computes the linear transformation of given data vectors by considering the observed hyperlinks defined for $U$ -tuples. HIMFAC consists of the following two steps: (i) for $i,i^{\prime}\in[n]$ , HIMFAC first counts the number $v_{ii^{\prime}}$ of hyperlinks that both data vectors $\boldsymbol{x}_{i},\boldsymbol{x}_{i^{\prime}}$ belong; (ii) by regarding $\boldsymbol{V}=(v_{ii^{\prime}})$ as a new adjacency matrix of data vectors, HIMFAC computes the LPP (He and Niyogi, 2004) if the link weight is defined among a single type of data, and CDMCA (Shimodaira, 2016) for multiple types of data (e.g., text, images, etc.). Similarly to LPP explained above $(U=2)$ , HIMFAC can be regarded as a quadratic approximation of BHLR $(U=2)$ , though the hyperlink weights $U\geq 2$ are converted into link weights $U=2$ through the preprocessing step (i).

5 BHLR Properties

In this section, we show two favorable properties of BHLR. The first property (P-1) statistical consistency: the BHLR asymptotically recovers the true conditional expectation of link weights, is explained in Section 5.1. Additionally, we explain the second property (P-2) computational tractability: the BHLR can be efficiently computed by stochastic algorithms in Section 5.2.

5.1 BHLR Asymptotically Recovers True Conditional Expectations

In this section, we demonstrate via Theorem 1 that the similarity function $\mu_{\hat{\boldsymbol{\theta}}_{\varphi,n}}(\boldsymbol{X}_{\boldsymbol{i}})$ estimated by the BHLR asymptotically recovers the true conditional expectation $\mu_{*}(\boldsymbol{X}_{\boldsymbol{i}})=\mathbb{E}(w_{\boldsymbol{i}}\mid\boldsymbol{X}_{\boldsymbol{i}})$ . For proving the asymptotic properties of BHLR in Proposition 1 and Theorem 1, only in this section, we specify the increasing order index set as

[TABLE]

such that it includes all the possible combinations of $U$ different entries $i_{1},i_{2},\ldots,i_{U}\in[n]$ , whereas no two distinct indices $\boldsymbol{i},\boldsymbol{i}^{\prime}\in\mathcal{J}_{n}^{(U)}$ are obtained from each other by permutation. Then, hyperlink weights $\{w_{\boldsymbol{i}}\}_{\boldsymbol{i}\in\mathcal{J}_{n}^{(U)}}$ are free from the symmetry constraints described in Section 3.1; the underlying conditional distribution of $w_{\boldsymbol{i}}\mid\boldsymbol{X}_{\boldsymbol{i}}$ can be defined without the constraints, thus making the theoretical development easier.

In the following, we list conditions (C-1)–(C-5) needed for theoretical development. $w$ represents a random variable that follows a cpdf (or cpmf) $q$ of $w\mid\boldsymbol{X}$ , for $\boldsymbol{X}=(\boldsymbol{x},\boldsymbol{x}^{\prime},\boldsymbol{x}^{\prime\prime},\ldots)\in\mathcal{X}^{U}$ .

(C-1)

$\boldsymbol{\Theta}$ is compact. 2. (C-2)

Real-valued functions $\mu_{\boldsymbol{\theta}}(\boldsymbol{X})$ and $\mu_{*}(\boldsymbol{X}):=\mathbb{E}(w\mid\boldsymbol{X})$ are continuous on $\boldsymbol{\Theta}\times\mathcal{X}^{U}$ and $\mathcal{X}^{U}$ , respectively. Especially, the function $\mu_{\boldsymbol{\theta}}(\boldsymbol{X})$ is Lipschitz continuous on $\boldsymbol{\Theta}$ for each $\boldsymbol{X}\in\mathcal{X}^{U}$ . 3. (C-3)

Hyperlink weights $\{w_{\boldsymbol{i}}\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}$ follow a distribution whose cpdf (or cpmf) are specified as $\prod_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}q(w_{\boldsymbol{i}}\mid\boldsymbol{X}_{\boldsymbol{i}})$ , and data vectors $\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots,\boldsymbol{x}_{n}$ i.i.d. follow a pdf $q_{X}$ , where the support of $q_{X}$ is compact. 4. (C-4)

$\mathbb{E}(w^{2}\mid\boldsymbol{X})<\infty$ and $\mathbb{E}(\varphi(w)^{2}\mid\boldsymbol{X})<\infty$ for all $\boldsymbol{X}\in\mathcal{X}^{U}$ . 5. (C-5)

$\varphi$ is $C^{2}$ and strongly convex.

It is noteworthy that all the functions listed in Table 1 satisfy the condition (C-5); all the conditions (C-1)–(C-5) are not difficult to satisfy in practice. Using these conditions, we demonstrate in the following Proposition 1 that $L_{\varphi,n}(\boldsymbol{\theta})$ empirically approximates the expected value of $d_{\varphi}(\mu_{*}(\boldsymbol{X}),\mu_{\boldsymbol{\theta}}(\boldsymbol{X}))$ up to a constant.

Proposition 1.

Let $U\in\mathbb{N}$ , $\mathcal{I}_{n}^{(U)}=\mathcal{J}_{n}^{(U)}$ defined in eq. (22) and suppose that (C-1)–(C-5) hold. Let $\mathbb{E}_{\mathcal{X}^{U}}$ represent the expectation with respect to the density of the $U$ -tuple $\boldsymbol{X}=(\boldsymbol{x},\boldsymbol{x}^{\prime},\boldsymbol{x}^{\prime\prime},\ldots)\in\mathcal{X}^{U}$ ; more specifically, $\boldsymbol{x},\boldsymbol{x}^{\prime},\boldsymbol{x}^{\prime\prime},\ldots$ i.i.d. follow a pdf $q_{X}$ . Then, for $n\to\infty$ , it holds that

[TABLE]

for each $\boldsymbol{\theta}\in\boldsymbol{\Theta}$ , where $C_{\varphi}:=\mathbb{E}_{\mathcal{X}^{U}}\left(\mathbb{E}(\varphi(w)\mid\boldsymbol{X})-\varphi(\mu_{*}(\boldsymbol{X}))\right)$ is a constant independent of the parameter $\boldsymbol{\theta}$ .

Proof is obtained by applying the law of large numbers for multiple indexed partially dependent random variables. See C.2 for details.

As explained in Section 3.2, different tuples $\boldsymbol{X}_{\boldsymbol{i}},\boldsymbol{X}_{\boldsymbol{i}^{\prime}}$ may be constrained as they may share some data vectors, even if data vectors $\boldsymbol{x}_{i}$ are i.i.d. generated; theories for HLR can be different from those of classical regression, that predicts response variables from i.i.d. explanatory variables. Due to the constraint, the convergence rate of the loss function for BHLR is $O(1/\sqrt{n})$ whereas the estimation leverages $|\mathcal{I}_{n}^{(U)}|=O(n^{U})$ samples. The convergence rate is similar to $U$ -statistic (Lee, 1990), and is different from the rate $O(1/\sqrt{n^{U}})$ for classical regression using $O(n^{U})$ i.i.d. data vectors. In addition, Proposition 1 with $\beta$ -div. listed in Table 1 and $U=2$ corresponds to a special case $(\varepsilon=0)$ of Theorem 3.1 in Okuno and Shimodaira (2019) that indicates the convergence of the GE’s loss function using $\beta$ -divergence.

Proposition 1 leads to the following Theorem 1, which claims that the estimated model $\mu_{\hat{\boldsymbol{\theta}}_{\varphi,n}}$ converges to $\mu_{*}$ in probability, by considering that $d_{\varphi}(\mu_{*}(\boldsymbol{X}),\mu_{\boldsymbol{\theta}}(\boldsymbol{X}))$ with fixed $\mu_{*}(\boldsymbol{X})$ is minimized if $\mu_{\boldsymbol{\theta}}(\boldsymbol{X})=\mu_{*}(\boldsymbol{X})$ .

Theorem 1.

The symbols and conditions are the same as those of Proposition 1 except for the additional condition: there exists $\boldsymbol{\theta}_{*}\in\boldsymbol{\Theta}$ such that $\mu_{\boldsymbol{\theta}*}=\mu_{*}$ . Using a norm $\|f\|:=\mathbb{E}_{\mathcal{X}^{U}}(f(\boldsymbol{X})^{2})^{1/2}$ defined for functions $f:\mathcal{X}^{U}\to\mathbb{R}$ , it holds that

[TABLE]

where $\hat{\boldsymbol{\theta}}_{\varphi,n}$ is the estimator (10) computed with $n$ data vectors $\{\boldsymbol{x}_{i}\}_{i=1}^{n}$ and their hyperlink weights $\{w_{\boldsymbol{i}}\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}$ .

Proof is provided in C.3. As indicated in Theorem 1 above, the estimated similarity function $\mu_{\hat{\boldsymbol{\theta}}_{\varphi,n}}$ asymptotically recovers the underlying expectation function $\mu_{*}$ in probability, regardless of the choice of $\varphi$ . Thus, the BHLR is statistically consistent.

Interestingly, Theorem 1 does not rely on the underlying conditional distribution of hyperlink weights; BHLR is also robust against the distributional misspecification for the weights, as long as the set of user-specified similarity functions $\{\mu_{\boldsymbol{\theta}}(\boldsymbol{X})\}_{\boldsymbol{\theta}\in\boldsymbol{\Theta}}$ includes the conditional expectation $\mu_{*}(\boldsymbol{X}):=\mathbb{E}(w\mid\boldsymbol{X})$ therein.

Note that a similar property is already known for exponential linear regression models (e.g., Poisson regression model), that correspond to BHLR with $U=1$ . See Cameron and Trivedi (2013) Section 2.4.2 and 3.2.3 for details.

5.2 BHLR can be Efficiently Computed by Stochastic Algorithm

In this section, we discuss the optimization for the BHLR. We first consider applying the classical fullbatch gradient descent (GD), i.e., GD using all data for computing gradients to obtain the estimator (10). Subsequently, we demonstrate that the fullbatch-based methods require considerable computational cost when considering $U\geq 2$ . For reducing the computational complexity, we introduce an efficient algorithm based on minibatch stochastic GD (SGD), i.e., GD using a sampled small dataset for computing gradients. Furthermore, we prove the asymptotics of the minibatch SGD, and demonstrate that it increases the ROC–AUC test score in our numerical experiments.

For notational simplicity, $n,U\in\mathbb{N}$ , generating function $\varphi$ , index set $\mathcal{I}_{n}^{(U)}(\neq\emptyset)\subset[n]^{U}$ , hyperlink weights $\{w_{\boldsymbol{i}}\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}$ , and data vectors $\{\boldsymbol{x}_{i}\}_{i=1}^{n}$ are fixed in this section. It is noteworthy that the index set $\mathcal{I}_{n}^{(U)}\subset[n]^{U}$ can be arbitrary specified hereinafter, whereas the set $\mathcal{I}_{n}^{(U)}$ was restricted to have a specific form $\mathcal{J}_{n}^{(U)}=(\ref{eq:inu})$ in the previous Section 5.1 for making the theory easier. For example, both $(1,2)$ and $(2,1)$ can be included in $\mathcal{I}_{n}^{(2)}$ while only $(1,2)$ was included in $\mathcal{J}_{n}^{(2)}$ .

We begin by obtaining the estimator (10) by applying the fullbatch GD with $T\in\mathbb{N}$ iterations started from a randomly initialized vector $\boldsymbol{\theta}^{(1)}$ :

[TABLE]

where $\{\gamma^{(t)}\}_{t=1,2,\ldots,T}\subset\mathbb{R}_{>0}$ are step sizes, $g(\boldsymbol{\theta})$ is the gradient function, and $\mathcal{Q}_{\boldsymbol{\Theta}}(\boldsymbol{\theta}):=\mathop{\arg\min}_{\boldsymbol{\theta}^{\prime}\in\boldsymbol{\Theta}}\|\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}\|_{2}$ is the projection to the parameter space. This projection is required for ensuring that the estimator $\hat{\boldsymbol{\theta}}^{(t)}$ is included in the parameter space $\boldsymbol{\Theta}$ ; the projection can be ignored if $\boldsymbol{\Theta}=\mathbb{R}^{p}$ . The gradient function is expressed as

[TABLE]

where $\mathcal{P}_{n}^{(U)}:=\{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}\mid w_{\boldsymbol{i}}\neq 0\}$ is a set of indices whose corresponding weights are non-zero. After the $T$ iterations, $\boldsymbol{\theta}^{(T+1)}$ converges to the estimator (10) as $T\to\infty$ under some assumptions (Dunn, 1981). However, computing the gradient (25) requires considerable computational cost $O(|\mathcal{I}_{n}^{(U)}|)=O(n^{U})$ ; the significant computational complexity is non-negligible especially for $U\geq 2$ .

For efficiently computing the estimator (10), we alternatively employ minibatch SGD (Ruder, 2016) that iteratively updates the parameter as

[TABLE]

where $\tilde{g}_{\eta}^{(t)}(\boldsymbol{\theta})$ is a stochastic gradient as will be defined in (30) using the sampled small dataset called minibatch.

Although minibatch sampling can be easily formulated in the case of $U=1$ , several different sampling patterns may occur when $U\geq 2$ . For instance, when $U=2$ , the negative-sampling used in skip-gram (Mikolov et al., 2013) first randomly fixes the first entry $i_{1}$ in the index $\boldsymbol{i}=(i_{1},i_{2})$ and subsequently samples a minibatch as shown in Figure 3, whereas the minibatch SGD used in Okuno et al. (2018) and Okuno and Shimodaira (2019) samples a minibatch without fixing any entries in the index. Thus, we unify both of these existing methods in this study, and propose a general procedure for sampling a minibatch that can be used for both $U=1,2$ and $U\geq 3$ . The proposed general procedure is explained in the following and Algorithm 1.

In the proposed procedure, that generalizes negative sampling ( $U=2,v=1$ ) used in skip-gram (Mikolov et al., 2013), we first specify $v\in\{0,1,2,\ldots,U-1\}$ , that represents the number of entries in the index $\boldsymbol{i}$ to be fixed. $v=0$ indicates that no entry is fixed; we herein consider $v\geq 1$ . For fixing the entries, we specify $\boldsymbol{u}$ in a set

[TABLE]

Then, the proposed procedure is summarized in Algorithm 1 using a set of $\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}$ whose $\boldsymbol{u}=(u_{1},u_{2},\ldots,u_{v})$ -th entry is fixed as $\boldsymbol{j}=(j_{1},j_{2},\ldots,j_{v})\in[n]^{v}$ , that is

[TABLE]

and a set

[TABLE]

that decomposes the index set as $\mathcal{I}_{n}^{(U)}=\bigcup_{\boldsymbol{j}\in\mathcal{K}_{\boldsymbol{u}}}\mathcal{I}_{n,\boldsymbol{u}}^{(U)}(\boldsymbol{j})$ without any overlap. $p_{\boldsymbol{j}}$ represents the probability to choose $\boldsymbol{j}$ from the set $\mathcal{K}_{\boldsymbol{u}}$ ; we employ $p_{\boldsymbol{j}}=1/|\mathcal{K}_{\boldsymbol{u}}|$ later in Theorem 2, whereas it can be arbitrarily specified by users in practice. The proposed minibatch sampling for hyper-relations is also illustrated in Example 5.

Example 5 (Minibatch sampling for hyper-relations).

We consider $n=7,U=4,v=2,\mathcal{I}_{n}^{(U)}=[n]^{U},\boldsymbol{u}=(1,3)$ , and $\boldsymbol{j}=(2,5)$ is herein randomly selected. We define a set of indices whose $\boldsymbol{u}=(1,3)$ -th entry is fixed as $\boldsymbol{j}=(2,5)$ , i.e.,

[TABLE]

and a set of indices whose corresponding hyperlink weights are non-zero, i.e., $\tilde{\mathcal{P}}_{n}^{(U)}=\{\boldsymbol{i}\in\tilde{\mathcal{I}}_{n}^{(U)}\mid w_{\boldsymbol{i}}\neq 0\}$ ; they are sets of candidate indices to be resampled. We uniformly and randomly choose $m_{+},m_{-}$ indices from sets $\tilde{\mathcal{P}}_{n}^{(U)},\tilde{\mathcal{I}}_{n}^{(U)}$ , and denote the sets as $\mathcal{P}_{\text{mini}}^{(U)},\mathcal{I}_{\text{mini}}^{(U)}$ ; they are used for computing the gradient (30) and update the parameter by (26).

It is noteworthy that the sampling procedure in Algorithm 1 can efficiently pick up non-zero weights even if most of the weights $\{w_{\boldsymbol{i}}\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}$ are zero. Similarly to Mikolov et al. (2013) and Okuno and Shimodaira (2019), the gradient $g(\boldsymbol{\theta})$ at the iteration $t$ can be stochastically approximated by

[TABLE]

where the minibatch $\mathcal{M}^{(t)}:=(\tilde{\mathcal{P}}_{\text{mini}}^{(t)},\tilde{\mathcal{I}}_{\text{mini}}^{(t)},s_{+}^{(t)},s_{-}^{(t)})$ is obtained via Algorithm 1 and $\eta>0$ is a user-specified parameter. The coefficient $s^{(t)}_{-}=|\tilde{\mathcal{I}}_{n}^{(U)}|/|\tilde{\mathcal{I}}_{\text{mini}}^{(t)}|$ is needed for adjusting the first term in the stochastic gradient (30), since only the fixed size of minibatch $\tilde{\mathcal{I}}_{\text{mini}}^{(t)}$ is sampled from the set $\tilde{\mathcal{I}}_{n}^{(U)}$ whose size may depend on the selected $\boldsymbol{j}\in\mathcal{K}_{\boldsymbol{u}}$ . Similarly, $s^{(t)}_{+}=|\tilde{\mathcal{P}}_{n}^{(U)}|/|\tilde{\mathcal{P}}_{\text{mini}}^{(t)}|$ is needed for adjusting the second term. Although these coefficients $s^{(t)}_{+},s^{(t)}_{-}$ are required for theoretical development, they may be ignored in practice as explained later.

The computational complexity for the stochastic gradient (30) is $O(m_{+}+m_{-})$ , and it can be significantly less than the complexity $O(n^{U})$ of the fullbatch gradient (25), at least for each iteration. Moreover, the minibatch SGD (26) using (30) reaches approximately the optimal value within a reasonable number of iterations, as will be empirically demonstrated at the last of this section; BHLR can be efficiently computed by the minibatch SGD.

The minibatch SGD equipped with Algorithm 1 and (30), can be applied to general $U\geq 2$ and $v\geq 0$ whereas it encompasses several existing methods; in our context, it reduces to the minibatch SGD using the negative sampling for skip-gram (Mikolov et al., 2013) if $(U,v,\varphi,m_{+})=(2,1,\varphi_{\text{Logistic}},1)$ , and it also reduces to Okuno et al. (2018) and Okuno and Shimodaira (2019) if $(U,v,\varphi)=(2,0,\varphi_{\text{KL}}),(2,0,\varphi_{\beta})$ , respectively, where their sampling procedures are called “negative sampling: unigram” $(v=1)$ and “uniform link sampling” $(v=0)$ in Veitch et al. (2019). Other major stochastic algorithms such as AdaGrad (Duchi et al., 2011) and Adam (Kingma and Ba, 2014) can be employed as well, once the minibatch-based stochastic gradient (30) is formally defined with Algorithm 1.

Hereinafter, we discuss the asymptotics of the minibatch SGD when the number of iterations is sufficiently large, by employing Ghadimi and Lan (2013) Theorem 2.1 (a).

Whereas the standard stochastic optimization algorithms preliminary determine the number of iterations $T$ , for theoretical purposes, Ghadimi and Lan (2013) randomly choose the number of iterations $\tau$ from the set $[T]=\{1,2,\ldots,T\}$ with the probability $\mathbb{P}(\tau)$ , and update the parameter $\boldsymbol{\theta}$ within $\tau$ iterations. In this setting, the expectation of the stochastic gradient $\tilde{g}_{\eta}^{(\tau)}(\tilde{\boldsymbol{\theta}}^{(\tau)})$ is proved to approach $\boldsymbol{0}$ as $T\to\infty$ ; considering the Bregman divergence between the hyperlink weights multiplied by a user-specified constant $\eta>0$ and the similarities $\{\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}})\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}$ , i.e.,

[TABLE]

we apply Ghadimi and Lan (2013) to our setting, and show in the following Theorem 2 that the gradient of $Q_{\eta}(\boldsymbol{\theta})$ approaches to $\boldsymbol{0}$ as $T$ increases.

For applying Ghadimi and Lan (2013), we further assume following conditions (D-1)–(D-3):

(D-1)

Differentiability of $Q_{\eta}(\boldsymbol{\theta})$ : the loss function $Q_{\eta}(\boldsymbol{\theta})$ defined in eq. (31) is differentiable with respect to $\boldsymbol{\theta}$ . 2. (D-2)

Lipschitz continuity for the gradient of $Q_{\eta}(\boldsymbol{\theta})$ : using the coefficient $\alpha:=\begin{cases}|\mathcal{I}_{n}^{(U)}|/|\mathcal{K}_{\boldsymbol{u}}|&(v=1)\\ |\mathcal{I}_{n}^{(U)}|&(v=0)\\ \end{cases}$ , the gradient $\alpha\frac{\partial}{\partial\boldsymbol{\theta}}Q_{\eta}(\boldsymbol{\theta})$ is $H$ -Lipschitz continuous for some $H>0$ , i.e., $\|\alpha\frac{\partial}{\partial\boldsymbol{\theta}}Q_{\eta}(\boldsymbol{\theta})-\alpha\frac{\partial}{\partial\boldsymbol{\theta}}Q_{\eta}(\boldsymbol{\theta}^{\prime})\|_{2}\leq H\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\|_{2},\>(\forall\boldsymbol{\theta},\boldsymbol{\theta}^{\prime}\in\boldsymbol{\Theta})$ . 3. (D-3)

Bounded variance for the stochastic gradient: variance of the minibatch-based stochastic gradient $\tilde{g}^{(1)}_{\eta}(\boldsymbol{\theta})$ is uniformly bounded with respect to resampling the minibatch, i.e., $\sup_{\boldsymbol{\theta}\in\boldsymbol{\Theta}}\text{tr}\mathbb{V}_{\mathcal{M}^{(1)}}(\tilde{g}_{\eta}^{(1)}(\boldsymbol{\theta}))<\infty$ .

Symbols $\mathbb{E}_{\mathcal{M}^{(t)}}(\cdot),\mathbb{V}_{\mathcal{M}^{(t)}}(\cdot)$ represent the expectation and the variance-covariance matrix with respect to resampling the minibatch $\mathcal{M}^{(t)}=(\tilde{\mathcal{P}}_{\text{mini}}^{(t)},\tilde{\mathcal{I}}_{\text{mini}}^{(t)},s_{+}^{(t)},s_{-}^{(t)})$ , and $E_{\tau}(\cdot)$ takes expectation with respect to selecting $\tau\in[T]$ . $\text{tr}\boldsymbol{Z}$ represents the trace of the matrix $\boldsymbol{Z}=(z_{ij})\in\mathbb{R}^{p\times p}$ , i.e., $\text{tr}\boldsymbol{Z}=\sum_{i=1}^{p}z_{ii}$ .

(D-1)–(D-3) are assumed in Ghadimi and Lan (2013), and they are not unusually strong assumptions in our setting; when assuming (C-1) compactness of the parameter set $\boldsymbol{\Theta}$ , $Q_{\eta}(\boldsymbol{\theta})$ using any generating function listed in Table 1 and the similarity function (12) equipped with vector-valued neural networks $\boldsymbol{f}_{\boldsymbol{\theta}}:\mathcal{X}^{U}\to\mathbb{R}^{K}$ activated by sigmoid function, satisfies the assumptions (D-1)–(D-2). Then, (D-3) also holds since the stochastic gradient $\tilde{g}_{\eta}^{(1)}(\boldsymbol{\theta})$ is $C^{1}$ on the compact set $\boldsymbol{\Theta}$ and the minibatch $\mathcal{M}^{(t)}$ is a realization of random variable taking value in a finite set.

Theorem 2.

Let $m_{+},m_{-},q,T,U\in\mathbb{N},v\in\{0,1,\ldots,U-1\},\eta>0,\boldsymbol{\Theta}:=\mathbb{R}^{q}$ , and $\{\tilde{\boldsymbol{\theta}}^{(t)}\}_{t=1}^{T}$ is a sequence of the minibatch SGD (26), and the conditions (D-1)–(D-3) are assumed. If $v\geq 1$ , let $\boldsymbol{u}$ be a vector in the set $(\ref{eq:specifying_u})$ , and $p_{\boldsymbol{j}}:=1/|\mathcal{K}_{\boldsymbol{u}}|$ for all $\boldsymbol{j}\in\mathcal{K}_{\boldsymbol{u}}$ . By specifying $\gamma^{(t)}=\gamma t^{-1}$ with $\gamma\in(0,2/H)$ , and choosing the number of iterations $\tau\in[T]$ with the probability $\mathbb{P}(\tau=t)=\frac{2\gamma/t-H\gamma^{2}/t^{2}}{\sum_{t=1}^{T}(2\gamma/t-H\gamma^{2}/t^{2})}$ , it holds that

[TABLE]

See C.4 for the proof.

Theorem 2 indicates that the gradient $\frac{\partial}{\partial\boldsymbol{\theta}}Q_{\eta}(\tilde{\boldsymbol{\theta}}^{(\tau)})=\frac{\partial}{\partial\boldsymbol{\theta}}D_{\varphi}(\{\eta w_{\boldsymbol{i}}\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}},\{\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}})\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}})\bigg{|}_{\boldsymbol{\theta}=\tilde{\boldsymbol{\theta}}^{(\tau)}}$ approaches $\boldsymbol{0}$ as $T\to\infty$ . Considering $\lim_{T\to\infty}\mathbb{P}(\tau\leq T^{\prime})=0$ for any fixed constant $T^{\prime}\in\mathbb{N}$ , indicating that large $\tau$ tends to be selected when $T$ is sufficiently large, the estimator $\tilde{\boldsymbol{\theta}}^{(t)}$ computed through the iterative update (26) approaches a set of stationary points of the function $D_{\varphi}(\{\eta w_{\boldsymbol{i}}\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}},\{\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}})\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}})$ as $t$ increases. Although the estimator can be trapped in local minimizers or saddle points during the iterative update, gradient descent using randomly perturbed gradients is proved to escape saddle points efficiently (Jin et al., 2017). The similar is expected for minibatch SGD; the estimator may approach a good minimizer efficiently, depending on the situation. When the estimator approaches a global minimizer, under some assumptions, we can expect that

[TABLE]

for some sufficiently large $n,t\in\mathbb{N}$ , by considering Theorem 1 with $\mathbb{E}(\eta w_{\boldsymbol{i}}\mid\boldsymbol{X}_{\boldsymbol{i}})=\eta\mu_{*}(\boldsymbol{X}_{\boldsymbol{i}})$ . Although specifying $\eta=1$ appears better in terms of exactly recovering the underlying true similarity function $\mu_{*}$ , it is not necessarily so in practice; only the ratio $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}})/\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}^{\prime}})$ is required to infer which of the tuples $\boldsymbol{X}_{\boldsymbol{i}},\boldsymbol{X}_{\boldsymbol{i}^{\prime}}$ exhibits a stronger relation. Thus $\eta$ can be arbitrarily specified by users. In practice, we may set $s_{+}^{(t)}=s_{-}^{(t)}=1,\eta=1$ in (30), which is justified if the ratio $|\tilde{\mathcal{I}}_{n}^{(U)}|/|\tilde{\mathcal{P}}_{n}^{(U)}|$ is constant; this in effect specifies $\eta=(|\tilde{\mathcal{I}}_{n}^{(U)}|m_{+})/(|\tilde{\mathcal{P}}_{n}^{(U)}|m_{-})$ in (32) and $\gamma^{(t)}$ being multiplied by $|\tilde{\mathcal{I}}_{n}^{(U)}|/m_{-}$ in (26).

It is noteworthy that Okuno and Shimodaira (2019) Theorem 3.2 already shows the convergence of the estimator $\tilde{\boldsymbol{\theta}}^{(t)}$ when $(U,v,\varphi)=(2,0,\varphi_{\beta})$ , by assuming that the loss function is locally strongly convex. However, Theorem 2 admits non-convex loss functions by considering not the convergence of the estimator $\tilde{\boldsymbol{\theta}}^{(t)}$ but that of the gradient $\frac{\partial}{\partial\boldsymbol{\theta}}Q(\tilde{\boldsymbol{\theta}}^{(t)})$ . As the objective function $Q(\boldsymbol{\theta})$ is typically unidentifiable when NNs therein, implying that the strong convexity is rarely satisfied, Theorem 2 satisfies the practical situations more than Okuno and Shimodaira (2019) Theorem 3.2. Furthermore, Theorem 2 can be applied to general $U\in\mathbb{N}$ , whereas only a few theoretical aspects of stochastic algorithms have been investigated even for $U=2$ (Veitch et al., 2019).

Here, we empirically demonstrate that a stochastic optimization algorithm called Adam (Kingma and Ba, 2014) equipped with the proposed minibatch sampling procedure shown in Algorithm 1 appropriately optimizes the similarity function within the reasonable number of iterations, in Figure 4.

6 Experiments

In this section, we describe the numerical experiments that we conducted on real-world datasets. In Section 6.1, we utilized the Boston housing dataset to perform the BHLR with $U=1$ , that corresponds to the Poisson regression. In Section 6.2 and 6.3, we employed the attributed DBLP co-authorship network dataset (Desmier et al., 2012) for performing the BHLR with $U=2$ and $U=3$ , corresponding to link regression and hyperlink regression, respectively.

Hereinafter, we incorporate a regularization $\varphi_{\text{KL}}(z)=z\log(z+\varepsilon)$ with a small constant $\varepsilon:=10^{-4}$ into the KL divergence, for numerically stabilizing the experimental results.

6.1 Poisson regression ( $U=1$ )

•

Dataset: We employ the Boston housing dataset111http://lib.stat.cmu.edu/datasets/boston (visited on June 13th, 2019) that contains $n=506$ samples, comprising $p=13$ dimensional standardized explanatory variables $\{\boldsymbol{x}_{i}\}_{i=1}^{506}\subset\mathbb{R}^{13}$ and non-negative-valued target variables $\{y_{i}\}_{i=1}^{506}\subset\mathbb{R}_{\geq 0}$ .

•

Architecture of $\mu_{\boldsymbol{\theta}}$ : 1-hidden-layer multilayer perceptron (see, e.g., Bishop (2006) Chapter 5) with $1{,}000$ hidden units activated by Rectified Linear Unit (ReLU), i.e., $\text{ReLU}(z):=\max\{0,z\}$ , and unactivated $1$ -dimensional output unit, are used for $f_{\boldsymbol{\theta}}:\mathbb{R}^{13}\to\mathbb{R}$ . Using the NN $f_{\boldsymbol{\theta}}$ , we define two different functions $\mu_{\boldsymbol{\theta}}(\boldsymbol{x}_{i}):=\exp(f_{\boldsymbol{\theta}}(\boldsymbol{x}_{i}))$ and $\mu_{\boldsymbol{\theta}}(\boldsymbol{x}_{i}):=f_{\boldsymbol{\theta}}(\boldsymbol{x}_{i})$ , where the former is restricted to positive values whereas the latter is not.

•

Learning $\mu_{\boldsymbol{\theta}}$ : The NN in the function $\mu_{\boldsymbol{\theta}}$ is trained through the BHLR with $U=1$ using fullbatch gradient descent with the training dataset.

•

Evaluation: The dataset is randomly duivided into $3$ non-overlapping sets for training, validation, and test, whose numbers are $304~{}(60\%)$ , $101~{}(20\%)$ , and $101~{}(20\%)$ , respectively. We first predict the target variables for validation and test datasets, and the mean squared error between the predicted values $\{\mu_{\hat{\boldsymbol{\theta}}_{\varphi,n}}(\boldsymbol{X}_{\boldsymbol{i}})\}$ and the observed values $w_{\boldsymbol{i}}$ are recorded at each iteration of GD. At the end of the iteration, the test score whose validation score is the best, is recorded as “optimal” test score. We repeat the experiment 100 times, and compute the sample average and the standard error of the optimal test scores, for each setting.

•

Baselines: We perform Poisson regression using a linear model and a simple linear regression that are already implemented in a Python statsmodels module (Seabold and Perktold, 2010). We also perform Poisson regression using a neural network (Fallah et al., 2009). (Random) We first compute the sample average $\hat{\mu}$ and the sample standard deviation $\hat{\sigma}$ for the target variables in each of the $100$ test datasets. For each, we generate random numbers from a normal distribution whose mean and standard deviation are $\hat{\mu},\hat{\sigma}$ , respectively, and evaluate the mean-squared error between the target variables in the test dataset and the generated random numbers. We repeat this evaluation 100 times for each of the $100$ test datasets, and compute the sample average and standard error.

Results: The experimental results are shown in Table 3. Although the linear methods are much better than the baseline (Random), NN-based methods outperformed the linear methods. Among the NN-based methods, using $\varphi_{\beta}$ with $\beta\geq 1$ , which corresponds to using $\beta$ -divergence, demonstrated better performance than $\varphi_{\text{KL}}$ . This result indicates that, the classical loss function for the Poisson regression $L_{\varphi_{\text{KL}},n}(\boldsymbol{\theta})$ is not always the best choice for learning the function $\mu_{\boldsymbol{\theta}}$ .

6.2 Link regression ( $U=2$ )

•

Dataset: We utilize a network comprising $n=2{,}723$ attributed nodes and $37{,}322$ positive binary link weights, that aggregates $9$ snapshots of the DBLP dynamic co-authorship network dataset (Desmier et al., 2012). In the aggregated network, each binary link weight represents whether the corresponding authors have at least one co-authorship relation in the $9$ snapshots; $w_{i_{1}i_{2}}=1$ if the authors $i_{1}$ and $i_{2}$ have the relation, and [math] otherwise. Each node has $p=43$ dimensional data vectors, representing the number of publications, summed up over the $9$ snapshots, in each of the selected 43 journals/conferences.

•

Similarity function architecture: Vector-valued NN $\boldsymbol{f}_{\boldsymbol{\theta}}:\mathbb{R}^{43}\to\mathbb{R}^{K}$ is a $1$ -hidden-layer multilayer perceptron with $1{,}000$ hidden units activated by the ReLU and $K$ unactivated output units. Using $\boldsymbol{f}_{\boldsymbol{\theta}}$ , we exploit a similarity function $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}}):=\sigma(\langle\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{1}}),\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{2}})\rangle)$ , where $\sigma(z):=(1+\exp(-z))^{-1}$ is a sigmoid function.

•

Learning similarity functions: NN $\boldsymbol{f}_{\boldsymbol{\theta}}$ in the similarity function is trained by Adam optimizer (Kingma and Ba, 2014) using Algorithm 1 for minibatch sampling. For computing the stochastic gradient (30), we utilize $s_{+}^{(t)}=s_{-}^{(t)}=1,\eta=1$ , and batch sizes $(m_{+},m_{-})$ are selected the set $\{(1,15),(3,13),(6,10),(10,6)\}$ . For each of batch sizes $(m_{+},m_{-})$ , the weight decay is grid searched over $\{10^{-2},10^{-3}\}$ .

•

Evaluation: The set of data vectors is randomly divided into $3$ non-overlapping sets for training, validation, and test, whose numbers are $n_{\text{train}}=1,907~{}(70\%),\,n_{\text{valid}}=408~{}(15\%),\,n_{\text{test}}=408~{}(15\%)$ . In the test dataset, $10$ pairs are sampled from the set $\{\boldsymbol{i}=(i_{1},i_{2})\mid w_{\boldsymbol{i}}^{(\text{test})}=0\}$ for each $i_{1}=1,2,\ldots,n_{\text{test}}$ , and combined with positive pairs $\{\boldsymbol{i}=(i_{1},i_{2})\mid w_{\boldsymbol{i}}^{(\text{test})}>0\}$ ; we compute the ROC-AUC score (Bradley, 1997) using these link weights, and record the scores for each of the $50$ iterations. Similarly, we compute the ROC–AUC score for the validation dataset. At the end of the iteration ( $T=3n_{\text{train}}$ ), we record the test score whose validation score is the best. We repeat this experiment $40$ times, and compute the sample average and the standard error for each $(m_{+},m_{-})$ ; the best validated score amongst all $(m_{+},m_{-})$ is also computed.

•

Baselines: We employ LINE (Tang et al., 2015), KL-GE (Okuno et al., 2018), and $\beta$ -GE (Okuno and Shimodaira, 2019) that correspond to the BHLR equipped with $L_{\varphi_{\text{Logistic}},n}(\boldsymbol{\theta})$ , $L_{\varphi_{\text{KL}},n}(\boldsymbol{\theta})$ , and $L_{\varphi_{\beta},n}(\boldsymbol{\theta})$ , respectively. LPPs (He and Niyogi, 2004) are also conducted for obtaining the linearly transformed feature vectors $\tilde{\boldsymbol{y}}_{i}:=\hat{\boldsymbol{A}}^{\top}\boldsymbol{x}_{i}\>(i\in[n])$ . Subsequently, similarities for the feature vectors are computed by $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}})=\sigma(\langle\tilde{\boldsymbol{y}}_{i_{1}},\tilde{\boldsymbol{y}}_{i_{2}}\rangle)$ .

Results: The experimental results are shown in Table 4. Overall, the NN-based methods outperformed the LPPs as the NN is highly expressive whereas the LPP is linear. In addition, NN-based methods demonstrated better performance by increasing the dimension $K$ of the feature vectors, unlike the LPPs that imposes a quadratic constraint on the feature vectors $\{\boldsymbol{y}_{i}\}_{i=1}^{n}$ . Overall, the exponential divergence and logistic loss demonstrated good performances; particularly, the exponential divergence demonstrated the best performance among the KL divergence, $\beta$ -divergence, logistic loss, dual logistic loss, and exponential divergence employed in this experiment. In terms of selecting $m_{+}$ and $m_{-}$ , in this case, using more than one positive minibatch sample $(m_{+}>1)$ is better.

6.3 Hyperlink regression ( $U=3$ )

Experimental settings are almost similar to those of $U=2$ . We employ the same dataset used in Section 6.2, and compute synthetic hyperlink weights from their link weights.

•

Similarity function architecture: using $\boldsymbol{f}_{\boldsymbol{\theta}}$ defined in Section 6.2, we exploit a similarity function: $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}}):=\sigma\left(\langle\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{1}}),\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{2}}),\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{3}})\rangle\right)$ , where $\langle\boldsymbol{y},\boldsymbol{y}^{\prime},\boldsymbol{y}^{\prime\prime}\rangle=\sum_{k=1}^{K}y_{k}y^{\prime}_{k}y^{\prime\prime}_{k}$ . Similarity functions are trained and evaluated similarly to those of $U=2$ .

•

Evaluation: We first divide the set of data vectors into training, validation, and test sets, similarly to $U=2$ . However, these datasets contain only the link weights ( $U=2$ ) but not hyperlink weights ( $U=3$ ); in each of the datasets, we compute synthetic hyperlink weights $\boldsymbol{W}:=(w_{\boldsymbol{i}})$ in two different ways:

(a)

$w_{\boldsymbol{i}}=w_{i_{1}i_{2}i_{3}}=1$ if $\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\boldsymbol{x}_{i_{3}}$ are connected, i.e., a path exists between any of the two in $\boldsymbol{i}=(i_{1},i_{2},i_{3})$ , and $w_{\boldsymbol{i}}=0$ otherwise. 2. (b)

$w_{\boldsymbol{i}}=w_{i_{1}i_{2}i_{3}}=1$ if $\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\boldsymbol{x}_{i_{3}}$ are fully connected, i.e., all of two in $\boldsymbol{i}=(i_{1},i_{2},i_{3})$ are connected, and $w_{\boldsymbol{i}}=0$ otherwise.

In the test dataset, $15$ tuples are sampled from the set $\{\boldsymbol{i}=(i_{1},i_{2},i_{3})\mid w^{\text{(test)}}_{\boldsymbol{i}}=0\}$ for each $i_{1}=1,2,\ldots,n_{\text{test}}$ , and combine them with positive tuples $\{\boldsymbol{i}=(i_{1},i_{2},i_{3})\mid w^{(\text{test})}_{\boldsymbol{i}}>0\}$ . Using these tuples, we evaluated the experimental results by ROC-AUC score, similarly to $U=2$ .

•

Baseline: We employ HIMFAC (Nori et al., 2012) for obtaining the linearly transformed feature vectors $\tilde{\boldsymbol{y}}_{i}:=\hat{\boldsymbol{A}}^{\top}\boldsymbol{x}_{i}\>(i\in[n])$ . Subsequently, similarities for the feature vectors are computed by (i) $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}}):=\sigma(\langle\tilde{\boldsymbol{y}}_{i_{1}},\tilde{\boldsymbol{y}}_{i_{2}},\tilde{\boldsymbol{y}}_{i_{3}}\rangle)$ and (ii) $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}}):=\sigma(\sum_{1\leq k<l\leq 3}\langle\tilde{\boldsymbol{y}}_{i_{k}},\tilde{\boldsymbol{y}}_{i_{l}}\rangle)$ .

Results: The experimental results are shown in Table 5 for the setting (a) and Table 6 for (b). Overall, the NN-based methods outperformed HIMFAC, since the NN is highly expressive whereas HIMFAC is linear. NN-based methods demonstrated a slight improvement by increasing the dimension $K$ of the feature vectors. There is significant difference between the settings (a) and (b) for HIMFAC, unlike NN-based methods. Regarding the setting (a), the logistic loss, exponential divergence and $\beta$ -divergence with $\beta=1$ demonstrated good performances for $K=10$ . On the other hand, the $\beta$ -divergence with $\beta=0.5$ and KL-divergence, whose scores for $K=10$ were not that high, demonstrated good performance for $K=40$ ; experimental results depend on the choice of $K$ . HIMFAC with (i) demonstrates a low performance, since their feature vectors are consequently obtained via LPP, that is based on the simple inner product $\langle\boldsymbol{y},\boldsymbol{y}^{\prime}\rangle$ whereas (i) is based on the similarity for triplets $\langle\boldsymbol{y},\boldsymbol{y}^{\prime},\boldsymbol{y}^{\prime\prime}\rangle$ . On the other hand, HIMFAC with (ii) demonstrates much higher performance than (i), since HIMFAC is compatible with the simple inner product. In terms of selecting $m_{+}$ and $m_{-}$ , in this case, using more than one positive minibatch sample $(m_{+}>1)$ is better. Regarding the setting (b), tendency of the results are almost similar to the setting (a).

7 Conclusion and future works

In this study, we considered hyperlink weight $w_{\boldsymbol{i}}$ defined for $U$ -tuple $\boldsymbol{X}_{\boldsymbol{i}}$ that is a collection of $U$ data vectors $(\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\ldots,\boldsymbol{x}_{i_{U}})$ . The hyperlink weights are assumed to be symmetric with respect to permutation of the entries $i_{1},i_{2},\ldots,i_{U}$ in the index. We proposed the BHLR that learns a user-specified symmetric similarity function $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}})$ such that it predicts a tuple’s hyperlink weight $w_{\boldsymbol{i}}$ through data vectors $(\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\ldots,\boldsymbol{x}_{i_{U}})$ stored in the corresponding $U$ -tuple $\boldsymbol{X}_{\boldsymbol{i}}$ . The BHLR encompassed various existing methods such as logistic regression ( $U=1$ ), Poisson regression ( $U=1$ ), graph embedding ( $U=2$ ), matrix factorization ( $U=2$ ), stochastic block model ( $U=2$ ), tensor factorization ( $U\geq 2$ ), and their variants equipped with arbitrary BD. We provided theoretical guarantees for BHLR including several existing methods, in the sense that general BHLR possessed the following two favorable properties: (P-1) statistical consistency and (P-2) computational tractability. Novel minibatch-sampling procedure for hyper-relations and theoretical guarantee for the entire stochastic optimization was also provided.

For future work, it would be worthwhile to simultaneously learn several BLHRs with different sizes of tuples; it is straightforward to modify our method to incorporate several $U$ values. Because a single BHLR first fixes the tuple size $U\in\mathbb{N}$ , the association strengths for the different sizes of tuples cannot be measured by the similarity function. Although we empirically demonstrated the BHLR only for $U=1,2,3$ in this study, a BHLR with a larger $U$ can be conducted, and it would be natural to learn tuples with several sizes at the same time.

Another interesting direction is designing a better similarity function for $U$ -tuples. Although we employed limited forms of similarity functions in our numerical experiments in the current study, arbitrary similarity functions can be employed for the BHLR. We are especially interested in identifying highly expressive similarity functions for capturing the underlying complicated data structure. Some recent studies (Okuno et al., 2018, 2019; Kim et al., 2019) demonstrated that the inner product similarity used in graph embedding ( $U=2$ ) exhibited a limited representation capability, and more expressive similarities have been proposed; their results may be simply generalized to the setting of the BHLR with general $U\in\mathbb{N}$ .

The last direction is to apply the proposed BHLR to larger-scale hypernetworks. Although the BHLR is already demonstrated on several thousands of nodes in our numerical experiments, a more efficient implementation is required for conducting the BHLR on much larger hypernetworks.

Acknowledgement

This work was partially supported by JSPS KAKENHI grant 16H02789 to HS, and 17J03623 to AO.

Appendix A Remaining related works

In this section, we describe the remaining related works, that are not listed in Section 4.4.

For $U=2$ ,

•

Metric learning (Bellet et al., 2013) is a type of similarity learning that captures the discrepancy between two data vectors $\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}}$ by some metric function. Many existing methods consider the Mahalanobis distance and Mahalanobis inner product $\boldsymbol{x}_{i_{1}}^{\top}\boldsymbol{M}\boldsymbol{x}_{i_{2}}$ where $\boldsymbol{M}\in\mathbb{R}^{p\times p}$ is a non-negative definite matrix to be estimated. Owing to the decomposition $\boldsymbol{M}=\boldsymbol{\theta}\boldsymbol{\theta}^{\top}$ with $\boldsymbol{\theta}\in\mathbb{R}^{p\times K}$ , the Mahalanobis inner product measures the inner product similarity between $\boldsymbol{\theta}^{\top}\boldsymbol{x}_{i_{1}}$ and $\boldsymbol{\theta}^{\top}\boldsymbol{x}_{i_{2}}$ ; obtaining such a linear transformation $\boldsymbol{x}\mapsto\boldsymbol{\theta}^{\top}\boldsymbol{x}$ is also known as graph embedding. Although the Mahalanobis metric/similarity learning above is an HLR similarly to graph embedding, it is not exactly a BHLR as most of the existing studies employ loss functions that are not exactly consistent with the BD, such as triplet loss and margin-based loss functions. However, some margin-based loss functions can be written in the form of BD by removing the strict convexity assumption of $\varphi$ , as explained in Section 2.

For $U\geq 2$ ,

•

Hyperlink prediction using latent social features (HPLSF) (Xu et al., 2013) first computes entropy of data vectors. Let $\boldsymbol{z}_{\boldsymbol{i}}=(z_{\boldsymbol{i}1},z_{\boldsymbol{i}2},\ldots,z_{\boldsymbol{i}p})\in\mathbb{R}^{p}$ be a vector of entropy for each tuple $\boldsymbol{X}_{\boldsymbol{i}}$ such that the $j$ -th entry $z_{\boldsymbol{i}j}$ ( $j=1,\ldots,p$ ) is defined as the entropy of $\{x_{i_{1}j},x_{i_{2}j},\ldots,x_{i_{U}j}\}\subset\mathbb{R}$ , where $\boldsymbol{x}_{i}:=(x_{i1},x_{i2},\ldots,x_{ip})\in\mathbb{R}^{p},\>i\in[n]$ . Subsequently, hyperlink weight $w_{\boldsymbol{i}}$ can be predicted through the single vector $\boldsymbol{z}_{\boldsymbol{i}}$ ; applying a structural SVM results in a hyperlink prediction. As the SVM finally predicts the target label $w_{\boldsymbol{i}}$ through the similarity function $\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}}):=\langle\boldsymbol{\theta},\boldsymbol{\Psi}(\boldsymbol{z}_{i})\rangle$ with a high-dimensional feature map $\boldsymbol{\Psi}:\mathbb{R}^{p}\to\mathbb{R}^{p^{\prime}}$ , the HPLSF is an HLR. However, the similarity function is typically trained with some loss functions that are not consistent with the BD; the HPLSF is not exactly included in the BHLR.

•

Coordinated matrix minimization (CMM) (Zhang et al., 2018) efficiently infers a subset of user-specified candidate hyperlinks that are the most suitable to fill the training hypernetworks using a low-rank approximation. However, CMM can find hyperlinks only among the training nodes, implying that it cannot be used for obtaining hyperlinks among test nodes outside the training dataset. CMM is neither an HLR or a BHLR.

•

Deep Sets (Zaheer et al., 2017) provides a permutation invariant expressive similarity function $\tilde{\mu}_{\boldsymbol{\theta}}:2^{\mathcal{X}}\to\mathbb{R}$ defined for sets of data vectors. The function $\tilde{\mu}_{\boldsymbol{\theta}}$ is trained by leveraging KL-divergence and logistic loss, whereas BHLR is equipped with arbitrary Bregman divergence. Although the similarity function of Deep Sets can be used for BHLR, the functional form is more restrictive than those considered in our setting. For paying the price of arbitrary size of vector sets, their Theorem 2 proves that a function $\tilde{\mu}_{\boldsymbol{\theta}}:2^{\mathcal{X}}\to\mathbb{R}$ is permutation invariant if and only if $\tilde{\mu}_{\boldsymbol{\theta}}$ is in the form of $\tilde{\mu}_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{1}},\boldsymbol{x}_{i_{2}},\ldots)=\rho_{\boldsymbol{\theta}}(\sum_{u}\phi_{\boldsymbol{\theta}}(\boldsymbol{x}_{i_{u}}))$ for some functions $\rho_{\boldsymbol{\theta}}$ and $\phi_{\boldsymbol{\theta}}$ , by assuming that the set $\mathcal{X}$ is countable, or the dimension of $\mathcal{X}$ is $1$ .

Appendix B Tensor factorization (TF) is a special case of BHLR

As explained in Section 4.3, tensor factorization (TF) (Cichocki et al., 2009) decomposes a given tensor $\boldsymbol{V}=(v_{\boldsymbol{j}})\in\mathbb{R}^{n_{1}\times n_{2}\times\cdots\times n_{U}}$ into matrices $\boldsymbol{\xi}^{(u)}=(\xi^{(u)}_{ik})\in\mathbb{R}^{n_{u}\times K}$ , by minimizing the BD between the entries of $\boldsymbol{V}$ and $[\![\boldsymbol{\xi}^{(1)},\boldsymbol{\xi}^{(2)},\ldots,\boldsymbol{\xi}^{(U)}]\!]$ whose $\boldsymbol{j}=(j_{1},j_{2},\ldots,j_{U})$ -th entry is specified as $\sum_{k=1}^{K}\xi^{(1)}_{j_{1}k}\xi^{(2)}_{j_{2}k}\cdots\xi^{(U)}_{j_{U}k}$ . Namely, TF minimizes the BD

[TABLE]

where $\langle\boldsymbol{y},\boldsymbol{y}^{\prime},\boldsymbol{y}^{\prime\prime}\ldots\rangle:=\sum_{k=1}^{K}y_{k}y_{k}^{\prime}y_{k}^{\prime\prime}\cdots$ , and $\boldsymbol{\xi}_{l}^{(u)}=(\xi_{l1}^{(u)},\xi_{l2}^{(u)},\ldots,\xi_{lK}^{(u)})\>(l\in[n_{u}])$ are column vectors of the matrix $\boldsymbol{\xi}^{(u)}$ . Subsequently, we can expect that $v_{\boldsymbol{j}}\approx\langle\boldsymbol{\xi}_{j_{1}}^{(1)},\boldsymbol{\xi}_{j_{2}}^{(2)},\ldots,\boldsymbol{\xi}_{j_{U}}^{(U)}\rangle$ for all $\boldsymbol{j}\in[n_{1}]\times[n_{2}]\times\cdots\times[n_{U}]$ .

For showing that BHLR includes TF ( $U\geq 2$ ), we first briefly review the relation between BHLR and MF ( $U=2$ ), that is explained in Section 4.2. In the case of $U=2$ , factorizing the matrix $\boldsymbol{V}$ corresponds to BHLR using

[TABLE]

that is defined in eq. (20). The link weights (36) indicate $v_{\boldsymbol{j}}=v_{j_{1},j_{2}}=w_{j_{1},n_{1}+j_{2}}=w_{\boldsymbol{i}}$ ; indices of the matrix $\boldsymbol{V}=(v_{\boldsymbol{j}})$ are formally transformed into those of the matrix $\boldsymbol{W}=(w_{\boldsymbol{i}})$ , by utilizing the conversion $\mathcal{F}:(j_{1},j_{2})\mapsto(j_{1},n_{1}+j_{2})=:(i_{1},i_{2})$ . Although this conversion only considers the correspondence between $\boldsymbol{V}$ and the upper-right part of the matrix $\boldsymbol{W}$ , the lower-left part is specified by the symmetry of $\boldsymbol{W}$ . In the case of $U\geq 2$ , we generalize the conversion as

[TABLE]

whose inverse $\mathcal{F}^{-1}$ can be defined over a set

[TABLE]

such that $\mathcal{F}^{-1}:\mathcal{C}(n_{1},n_{2},\ldots,n_{U})\ni\boldsymbol{i}\mapsto\boldsymbol{j}\in[n_{1}]\times[n_{2}]\times\cdots\times[n_{U}]$ . Since $\mathcal{F}^{-1}$ converts the indices of $\boldsymbol{W}=(w_{\boldsymbol{i}})$ to those of $\boldsymbol{V}=(v_{\boldsymbol{j}})$ , we may specify the hyperlink weights as $w_{\boldsymbol{i}}:=v_{\mathcal{F}^{-1}(\boldsymbol{i})}$ for all $\boldsymbol{i}\in\mathcal{C}(n_{1},n_{2},\ldots,n_{U})$ , similarly to $U=2$ .

Although the above specification is essentially sufficient for describing the relation between BHLR and TF, the hyperlink weights $\boldsymbol{W}=(w_{\boldsymbol{i}})$ are assumed to be symmetric as explained in Section 3.1. The symmetry can be realized by considering the non-decreasing order permutation $r(\boldsymbol{i})$ defined for any $\boldsymbol{i}$ ; a tensor $\boldsymbol{W}=(w_{\boldsymbol{i}})\in\mathbb{R}^{N^{U}}$ ( $N:=\sum_{u=1}^{U}n_{u}$ ), whose entries are specified as

[TABLE]

simultaneously satisfies the symmetry $w_{\boldsymbol{i}}=w_{\boldsymbol{i}^{\prime}}$ for any $\boldsymbol{i}^{\prime}\in[N]^{U}$ obtained by permutating the entries of $\boldsymbol{i}\in[N]^{U}$ , and the above specification $w_{\boldsymbol{i}}=v_{\mathcal{F}^{-1}(\boldsymbol{i})}$ for any $\boldsymbol{i}\in\mathcal{C}(n_{1},n_{2},\ldots,n_{U})$ . Therefore, (37) generalizes (36) from the case of $U=2$ to $U\geq 2$ .

Using the hyperlink weights (37), the parameter $\boldsymbol{\theta}=(\boldsymbol{\xi}^{(1)\top},\boldsymbol{\xi}^{(2)\top},\ldots,\boldsymbol{\xi}^{(U)\top})^{\top}\in\mathbb{R}^{N\times K}$ , and one-hot vector $\boldsymbol{x}_{i}\in\{0,1\}^{N}$ whose $i$ -th entry is $1$ and [math] otherwise ( $i\in[N]$ ), we have

[TABLE]

generalizing eq. (21) from $U=2$ to $U\geq 2$ . Therefore, TF that minimizes $(\ref{eq:original_ntf_objective})$ is equivalent to BHLR minimizing $(\ref{eq:final_ntf_objective})$ ; TF is a special case of BHLR.

Appendix C Proofs

In C.1, we first show and prove Theorem 3, that is the law of large numbers for multiply-indexed partially-dependent random variables. In C.2, we prove Proposition 1 by applying Theorem 3. In C.3, we prove Theorem 1, indicating that BHLR asymptotically recovers the underlying conditional expectation of link weights as $n\to\infty$ . In C.4, we last prove Theorem 2, showing the asymptotics of the minibatch SGD using the proposed Algorithm 1, as $T\to\infty$ .

C.1 Preliminary for proofs

Theorem 3.

Let $\boldsymbol{Z}:=(Z_{\boldsymbol{i}})$ be an array of random variables $Z_{\boldsymbol{i}}\in\mathcal{Z}$ , $\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}=\mathcal{J}_{n}^{(U)}\overset{(\ref{eq:inu})}{:=}\{(i_{1},i_{2},\ldots,i_{U})\mid 1\leq i_{1}<i_{2}<\cdots<i_{U}\leq n\}$ , and $h:\mathcal{Z}\to\mathbb{R}$ be a continuous function. We assume that $Z_{\boldsymbol{i}}$ is independent of $Z_{\boldsymbol{j}}$ if $\boldsymbol{j}\in\mathcal{R}_{n}^{(U)}(\boldsymbol{i}):=\{(j_{1},j_{2},\ldots,j_{U})\in\mathcal{I}_{n}^{(U)}\mid j_{1},j_{2},\ldots,j_{U}\in\{1,\ldots,n\}\setminus\{i_{1},i_{2},\ldots,i_{U}\}\}$ , and $\mathbb{E}_{\boldsymbol{Z}}(h(Z_{\boldsymbol{i}})^{2})<\infty$ , for all $\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}$ . Then the average of $h(Z_{\boldsymbol{i}})$ over $\mathcal{I}_{n}^{(U)}$ converges to the expectation in probability as $n\to\infty$ ; that is

[TABLE]

Proof of Theorem 3. Proof is almost the same as that of Okuno and Shimodaira (2019) Theorem A.1, that indicates the same assertion for $U=2$ . Regarding the variance of the average, we have

[TABLE]

where $\mathbb{E}_{\boldsymbol{Z}},\mathbb{V}_{\boldsymbol{Z}}$ represent expectation and variance with respect to $\boldsymbol{Z}$ . Considering $\mathbb{E}_{\boldsymbol{Z}}(|h(Z_{\boldsymbol{i}})|)\leq\mathbb{E}_{\boldsymbol{Z}}(h(Z_{\boldsymbol{i}})^{2})^{1/2}<\infty,\mathbb{E}_{\boldsymbol{Z}}(|h(Z_{\boldsymbol{i}})h(Z_{\boldsymbol{j}})|)\leq\sqrt{\mathbb{E}_{\boldsymbol{Z}}(h(Z_{\boldsymbol{i}})^{2})\mathbb{E}_{\boldsymbol{Z}}(h(Z_{\boldsymbol{j}})^{2})}<\infty$ , $|\mathcal{I}_{n}^{(U)}|=O(n^{U})$ , and

[TABLE]

for any fixed $\boldsymbol{i}=(i_{1},i_{2},\ldots,i_{U})\in\mathcal{I}_{n}^{(U)}$ , the formula (39) is of order $O(n^{-2U}\cdot n^{U}\cdot n^{U-1})=O(n^{-1})$ . Therefore,

[TABLE]

(40) and Chebyshev’s inequality indicate the assertion. ∎

This theorem generalizes Okuno and Shimodaira (2019) Theorem A.1, that proves the same assertion for $U=2$ . We note that the convergence rate is $O_{p}(n^{-1/2})$ but not $O_{p}(1/|\mathcal{I}_{n}^{(U)}|^{1/2})=O_{p}(n^{-U/2})$ , even though we leverage $|\mathcal{I}_{n}^{(U)}|=O(n^{U})$ observations $\{Z_{\boldsymbol{i}}\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}$ .

C.2 Proof of Proposition 1

By a simple calculation, we have

[TABLE]

Under the conditions (C-1)–(C-5), Theorem 3 can be applied to the terms (41)–(43) as shown in the following:

specifying $Z_{\boldsymbol{i}}:=\boldsymbol{X}_{\boldsymbol{i}},h(Z_{\boldsymbol{i}}):=d_{\varphi}(\mu_{*}(\boldsymbol{X}_{\boldsymbol{i}}),\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}}))$ leads to

[TABLE]

specifying $Z_{\boldsymbol{i}}:=(w_{\boldsymbol{i}},\boldsymbol{X}_{\boldsymbol{i}}),h(Z_{\boldsymbol{i}}):=\varphi(w_{\boldsymbol{i}})-\varphi(\mu_{*}(\boldsymbol{X}_{\boldsymbol{i}}))$ leads to

[TABLE]

and specifying $Z_{\boldsymbol{i}}:=(w_{\boldsymbol{i}},\boldsymbol{X}_{\boldsymbol{i}}),h(Z_{\boldsymbol{i}}):=\varphi^{\prime}(\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}}))(\mu_{*}(\boldsymbol{X}_{\boldsymbol{i}})-w_{\boldsymbol{i}})$ leads to

[TABLE]

Thus proving the assertion

[TABLE]

∎

C.3 Proof of Theorem 1

Definition of the estimator (10) leads to

[TABLE]

We evaluate both sides of the inequality (44), for proving the assertion.

•

Regarding the left-hand side of the inequality (44), Proposition 1 indicates that

[TABLE]

where $\varepsilon^{(1)}_{n}:=L_{\varphi,n}(\boldsymbol{\theta}_{*})-\left(\mathbb{E}_{\mathcal{X}^{U}}(\mu_{*}(\boldsymbol{X}),\mu_{\boldsymbol{\theta}_{*}}(\boldsymbol{X}))+C_{\varphi}\right)=O_{p}(1/\sqrt{n})$ .

•

We here consider the right-hand side of the inequality (44). Since the function $\varphi$ is strongly convex, the definition indicates the existence of $M_{\varphi}>0$ such that

[TABLE]

for all $a,b\in\text{dom}(\varphi)$ . This inequality indicates that the squared difference is bounded by the function $d_{\varphi}$ . By substituting $\mu_{*}(\boldsymbol{X}),\mu_{\boldsymbol{\theta}}(\boldsymbol{X})$ into $a,b$ , respectively, we have an inequality

[TABLE]

Using the above inequality (47), the right-hand side of the inequality (44) is evaluated as

[TABLE]

where $\|f\|:=\mathbb{E}_{\mathcal{X}^{U}}(f(\boldsymbol{X})^{2})^{1/2}$ for functions $f:\mathcal{X}^{U}\to\mathbb{R}$ and $\varepsilon_{n}^{(2)}(\boldsymbol{\theta}):=L_{\varphi,n}(\boldsymbol{\theta})-\left\{\mathbb{E}_{\mathcal{X}^{U}}\left(d_{\varphi}(\mu_{*}(\boldsymbol{X}),\mu_{\boldsymbol{\theta}}(\boldsymbol{X})\right)+C_{\varphi}\right\}$ represents the residual in Proposition 1 using the parameter $\boldsymbol{\theta}$ , that satisfies $\varepsilon_{n}^{(2)}(\boldsymbol{\theta})=O_{p}(1/\sqrt{n})$ for each $\boldsymbol{\theta}\in\boldsymbol{\Theta}$ .

By substituting (46) and (48) into (44), we have

[TABLE]

indicating that

[TABLE]

where $\varepsilon_{n}^{(1)}=O_{p}(1/\sqrt{n})=o_{p}(1)$ . The term $\varepsilon^{(2)}_{n}(\hat{\boldsymbol{\theta}}_{\varphi,n})$ is proved to be $o_{p}(1)$ , as shown in the remaining of this proof; then, (49) immediately proves Theorem 1.

Hereinafter, we last prove $\varepsilon^{(2)}_{n}(\hat{\boldsymbol{\theta}}_{\varphi,n})=o_{p}(1)$ , by employing Newey (1991) Corollary 2.2, indicating that $\sup_{\boldsymbol{\theta}\in\boldsymbol{\Theta}}|\varepsilon_{n}^{(2)}(\boldsymbol{\theta})|=o_{p}(1)$ under the following assumptions: (i) $\boldsymbol{\Theta}$ is compact, (ii) $\varepsilon_{n}^{(2)}(\boldsymbol{\theta})=o_{p}(1)$ for each $\boldsymbol{\theta}\in\boldsymbol{\Theta}$ , and (iii) $\exists B_{n}=O_{p}(1)$ such that $|\varepsilon_{n}^{(2)}(\boldsymbol{\theta})-\varepsilon_{n}^{(2)}(\boldsymbol{\theta}^{\prime})|\leq B_{n}\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\|_{2}$ for all $\boldsymbol{\theta},\boldsymbol{\theta}^{\prime}\in\boldsymbol{\Theta}$ . Above assumptions (i), (ii) and (iii) correspond to assumptions 1, 2 and 3A, in Newey (1991). In our setting, the assumption (i) is assumed, (ii) is proved by Proposition 1. (iii) is obtained similarly to Proof B.1 in Supplement of Okuno et al. (2018); since the product of two bounded Lipschitz continuous (LC) functions is LC, $C^{1}$ -function applied to LC function is LC, and the expectation of LC function is also LC, there exist $M_{1},M_{2}>0$ such that

[TABLE]

Denoting by $B_{n}:=M_{1}\left(\frac{1}{|\mathcal{I}_{n}^{(U)}|}\sum_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}}|w_{\boldsymbol{i}}|\right)+M_{2}$ , Proposition 1 indicates $B_{n}=O_{p}(1)$ . Therefore the condition (iii) holds; Newey (1991) Corollary 2.2 proves

[TABLE]

indicating that $\varepsilon_{n}^{(2)}(\hat{\boldsymbol{\theta}}_{\varphi,n})=o_{p}(1)$ . ∎

C.4 Proof of Theorem 2

Proof is two-folded. In the following, we first verify that (i) $\mathbb{E}_{\mathcal{M}^{(t)}}(\tilde{g}^{(t)}_{\eta}(\boldsymbol{\theta}))=\alpha\frac{\partial}{\partial\boldsymbol{\theta}}Q_{\eta}(\boldsymbol{\theta})$ , where

[TABLE]

and we next prove (ii) $\mathbb{E}_{\tau}\left(\mathbb{E}_{\{\mathcal{M}^{(t)}\}_{t\in[\tau]}}(\|\frac{\partial}{\partial\boldsymbol{\theta}}Q_{\eta}(\tilde{\boldsymbol{\theta}}^{(\tau)})\|_{2}^{2})\right)=O(1/\log T)$ by referring to (i) and Ghadimi and Lan (2013) Theorem 2.1 (a). Then, the assertion is proved.

(i)

We first verify that $\mathbb{E}_{\mathcal{M}^{(t)}}(\tilde{g}^{(t)}_{\eta}(\boldsymbol{\theta}))=\alpha\frac{\partial}{\partial\boldsymbol{\theta}}Q_{\eta}(\boldsymbol{\theta})$ . Here, we first consider the case $U\geq 2,v\geq 1$ . A vector $\boldsymbol{u}=(u_{1},u_{2},\ldots,u_{v})$ representing which of the entries in the index $\boldsymbol{i}=(i_{1},i_{2},\ldots,i_{U})$ is fixed, is preliminary specified from the set $\{\boldsymbol{u}=(u_{1},u_{2},\ldots,u_{v})\in[U]^{v}\mid u_{1}<u_{2}<\cdots<u_{v}\}$ by users. Then, considering a set $\mathcal{I}_{n,\boldsymbol{u}}^{(U)}(\boldsymbol{j}):=\{\boldsymbol{i}:=(i_{1},i_{2},\ldots,i_{U})\mid\boldsymbol{i}\in\mathcal{I}_{n}^{(U)},i_{u_{1}}=j_{1},\ldots,i_{u_{v}}=j_{v}\}$ for $\boldsymbol{j}\in[n]^{v}$ , Algorithm 1 that defines $\mathcal{M}^{(t)}=(\tilde{\mathcal{P}}_{\text{mini}}^{(t)},\tilde{\mathcal{I}}_{\text{mini}}^{(t)},s_{+}^{(t)},s_{-}^{(t)})$ consists of the following two-steps. At iteration $t$ ,

step 1.

$\boldsymbol{j}$ is randomly selected from a set $\mathcal{K}_{\boldsymbol{u}}:=\{\boldsymbol{j}\in[n]^{v}\mid\mathcal{I}^{(U)}_{n,\boldsymbol{u}}(\boldsymbol{j})\neq\emptyset\}$ with the probability $p_{\boldsymbol{j}}$ (in Theorem 2, $p_{\boldsymbol{j}}$ is assumed to be $1/|\mathcal{K}_{\boldsymbol{u}}|)$ , 2. step 2.

$m_{-},m_{+}$ entries are uniformly randomly selected from sets $\tilde{\mathcal{I}}_{n}^{(U)}=\mathcal{I}_{n,\boldsymbol{u}}^{(U)}(\boldsymbol{j})$ and $\tilde{\mathcal{P}}_{n}^{(U)}=\mathcal{P}_{n,\boldsymbol{u}}^{(U)}(\boldsymbol{j}):=\{\boldsymbol{i}^{\prime}\mid\boldsymbol{i}^{\prime}\in\mathcal{I}_{n,\boldsymbol{u}}^{(U)}(\boldsymbol{j}),w_{\boldsymbol{i}^{\prime}}\neq 0\}$ , and denote the sets as $\tilde{\mathcal{I}}_{\text{mini}}^{(t)},\tilde{\mathcal{P}}_{\text{mini}}^{(t)}$ . Coefficients $s_{+}^{(t)}:=|\tilde{\mathcal{P}}_{n}^{(U)}|/m_{+}$ and $s_{-}^{(t)}:=|\tilde{\mathcal{I}}_{n}^{(U)}|/m_{-}$ are also defined.

Therefore, the expectation of the stochastic gradient $\tilde{g}^{(t)}_{\eta}(\boldsymbol{\theta})$ with respect to sampling the minibatch $\mathcal{M}^{(t)}$ is,

[TABLE]

where the term $(\star 1)$ is evaluated by taking expectation with respect to the two steps in Algorithm 1 as

[TABLE]

and similarly,

[TABLE]

Substituting (52) and (53) into (51) leads to

[TABLE]

Thus (i) is proved for the case $U\geq 2,v\geq 1$ . Here, we also consider the case $U\in\mathbb{N},v=0$ . As $v=0$ indicates that there is no fixed entry in the index $\boldsymbol{i}$ , meaning that the step 1 in the above explanation is skipped, Algorithm 1 consists of only the step 2. Thus, by noticing that $\tilde{\mathcal{P}}_{n}^{(U)}=\mathcal{P}_{n}^{(U)},\tilde{\mathcal{I}}_{n}^{(U)}=\mathcal{I}_{n}^{(U)}$ , following the same calculation leads to the equation $\mathbb{E}_{\mathcal{M}^{(t)}}(\tilde{g}^{(t)}_{\eta}(\boldsymbol{\theta}))=\alpha\frac{\partial}{\partial\boldsymbol{\theta}}Q_{\eta}(\boldsymbol{\theta})$ , which is the same as the case of $U\geq 2,v\geq 1$ .

Since $v$ is limited to take value in $\{0,1,2,\ldots,U-1\}$ , (i) is hereby proved for all the possible $(U,v)$ .

(ii)

We next prove that $\mathbb{E}_{\tau}\left(\mathbb{E}_{\{\mathcal{M}^{(t)}\}_{t\in[\tau]}}(\|\frac{\partial}{\partial\boldsymbol{\theta}}Q_{\eta}(\tilde{\boldsymbol{\theta}}^{(\tau)})\|_{2}^{2})\right)=O(1/\log T)$ by referring to (i) and Ghadimi and Lan (2013) Theorem 2.1 (a). The following explanations are based on Ghadimi and Lan (2013), with corresponding symbols $k\Leftrightarrow t$ , $R\Leftrightarrow\tau$ , $N\Leftrightarrow T$ , $\gamma_{k}\Leftrightarrow\gamma^{(t)}$ , $x_{k}\Leftrightarrow\tilde{\boldsymbol{\theta}}^{(t)}$ , $f(x)\Leftrightarrow\alpha Q_{\eta}(\boldsymbol{\theta})$ , $G(\cdot,\xi_{k})\Leftrightarrow\tilde{g}^{(t)}_{\eta}(\cdot)$ , $L\Leftrightarrow H$ , $D_{f}\Leftrightarrow D$ , $\nabla\Leftrightarrow\frac{\partial}{\partial\boldsymbol{\theta}}$ .

Ghadimi and Lan (2013) Theorem 2.1 (a) shows that, the iterative update

[TABLE]

satisfies

[TABLE]

where $D:=\sqrt{\frac{2}{H}\left(Q_{\eta}(\tilde{\boldsymbol{\theta}}^{(1)})-\inf_{\boldsymbol{\theta}\in\boldsymbol{\Theta}}Q_{\eta}(\boldsymbol{\theta})\right)}$ , $H>0$ is the Lipschitz constant of $\alpha\frac{\partial}{\partial\boldsymbol{\theta}}Q_{\eta}(\boldsymbol{\theta})$ , $\gamma^{(t)}$ represents the step size satisfying $\gamma^{(t)}<2/H$ , and the number of iterations $\tau$ is chosen from $\{1,2,\ldots,T\}$ with the probability $\mathbb{P}(\tau=t)=\frac{2\gamma^{(t)}-H\gamma^{(t)2}}{\sum_{t=1}^{T}(2\gamma^{(t)}-H\gamma^{(t)2})}$ , if assumptions (C-1) $\mathbb{E}_{\mathcal{M}^{(t)}}(\tilde{g}_{\eta}^{(t)}(\boldsymbol{\theta}))=\alpha\frac{\partial}{\partial\boldsymbol{\theta}}Q_{\eta}(\boldsymbol{\theta})$ and (C-2) $\mathbb{E}_{\mathcal{M}^{(t)}}(\|\tilde{g}_{\eta}^{(t)}(\boldsymbol{\theta})-\alpha\frac{\partial}{\partial\boldsymbol{\theta}}Q_{\eta}(\boldsymbol{\theta})\|_{2}^{2})<\sigma^{2}$ for some $\sigma\in(0,\infty)$ , $(\forall\boldsymbol{\theta}\in\boldsymbol{\Theta})$ hold. These assumptions (C-1) and (C-2) correspond to eq. (1.2) and eq. (1.3) in Ghadimi and Lan (2013), respectively.

In the case of Theorem 2, the minibatch SGD (26) reduces to (54) due to the assumption $\boldsymbol{\Theta}=\mathbb{R}^{q}$ , the step size satisfies $\gamma^{(t)}=\gamma t^{-1}\leq\gamma\overset{\text{(assumption)}}{<}2/H$ , (C-1) is proved by the above calculation (i), and (C-2) is proved by

[TABLE]

where $(\boldsymbol{z})_{\alpha}$ represents the $\alpha$ -th entry of the vector $\boldsymbol{z}=(z_{1},z_{2},\ldots,z_{p})$ , $\boldsymbol{z}^{\otimes 2}:=\boldsymbol{z}\boldsymbol{z}^{\top}$ , and $\text{tr}\boldsymbol{Z}$ represents the trace of the matrix $\boldsymbol{Z}=(z_{ij})$ , i.e., $\text{tr}\boldsymbol{Z}=\sum_{\alpha=1}^{p}z_{\alpha\alpha}$ . Thus (55) holds; we last evaluate the right hand side of (55) in the following.

Obviously, we have $H=O(1)$ and $\sigma^{2}=O(1)$ due to the assumptions, and $D=O(1)$ since the Lipschitz continuity of $\frac{\partial}{\partial\boldsymbol{\theta}}Q_{\eta}(\boldsymbol{\theta})$ proves that $Q_{\eta}(\tilde{\boldsymbol{\theta}}^{(1)})$ is finite with any fixed $\tilde{\boldsymbol{\theta}}^{(1)}\in\boldsymbol{\Theta}$ . Then, it holds for $\gamma^{(t)}=\gamma t^{-1}$ that

[TABLE]

Thus, substituting $\alpha=O(1)$ and (56) into (55) leads to

[TABLE]

By noticing that $Q_{\eta}(\boldsymbol{\theta})=D_{\varphi}(\{\eta w_{\boldsymbol{i}}\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}},\{\mu_{\boldsymbol{\theta}}(\boldsymbol{X}_{\boldsymbol{i}})\}_{\boldsymbol{i}\in\mathcal{I}_{n}^{(U)}})$ , Theorem 2 is proved. ∎

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Clauset et al. [2008] Aaron Clauset, Cristopher Moore, and Mark EJ Newman. Hierarchical structure and the prediction of missing links in networks. Nature , 453(7191):98–101, 2008.
2Lü and Zhou [2011] Linyuan Lü and Tao Zhou. Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications , 390(6):1150–1170, 2011.
3Liben-Nowell and Kleinberg [2007] David Liben-Nowell and Jon Kleinberg. The Link-Prediction Problem for Social Networks. Journal of the American society for Information Science and Technology , 58(7):1019–1031, 2007.
4De Maesschalck et al. [2000] Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart. The Mahalanobis distance. Chemometrics and intelligent laboratory systems , 50(1):1–18, 2000.
5Kung [2014] Sun Yuan Kung. Kernel Methods and Machine Learning . Cambridge University Press, 2014.
6Goldberger et al. [2005] Jacob Goldberger, Geoffrey E Hinton, Sam T Roweis, and Ruslan R Salakhutdinov. Neighbourhood Components Analysis. In Advances in Neural Information Processing Systems , pages 513–520, 2005.
7Tang et al. [2015] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: Large-scale Information Network Embedding. In Proceedings of the International Conference on World Wide Web , pages 1067–1077, 2015.
8Okuno et al. [2018] Akifumi Okuno, Tetsuya Hada, and Hidetoshi Shimodaira. A probabilistic framework for multi-view feature learning with many-to-many associations via neural networks. In Proceedings of the International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research , pages 3888–3897. PMLR, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Hyperlink Regression via Bregman Divergence

Abstract

1 Introduction

1.1 Contribution

1.2 Organization

2 Bregman Divergence

3 Bregman Hyperlink Regression (BHLR)

3.1 Problem Setting

Example 1** (Friend network).**

Example 2** (Co-authorship network).**

Example 3** (Linear regression).**

Example 4** (Graph embedding).**

3.2 Probability Distributions of Hyperlink Weights and Tuples

3.3 Two Different Approaches to HLR

3.4 Proposed BHLR

3.5 BHLR is Equivalent to MLE through Corresponding Exponential Family Model

4 BHLR Family Members and Related Works

4.1 U=1U=1U=1

4.2 U=2U=2U=2

4.3 U≥2U\geq 2U≥2

4.4 Other Related Works

5 BHLR Properties

5.1 BHLR Asymptotically Recovers True Conditional Expectations

Proposition 1**.**

Theorem 1**.**

5.2 BHLR can be Efficiently Computed by Stochastic Algorithm

Example 5** (Minibatch sampling for hyper-relations).**

Theorem 2**.**

6 Experiments

6.1 Poisson regression (U=1U=1U=1)

6.2 Link regression (U=2U=2U=2)

6.3 Hyperlink regression (U=3U=3U=3)

7 Conclusion and future works

Acknowledgement

Appendix A Remaining related works

Appendix B Tensor factorization (TF) is a special case of BHLR

Appendix C Proofs

C.1 Preliminary for proofs

Theorem 3**.**

C.2 Proof of Proposition 1

C.3 Proof of Theorem 1

C.4 Proof of Theorem 2

Example 1 (Friend network).

Example 2 (Co-authorship network).

Example 3 (Linear regression).

Example 4 (Graph embedding).

4.1 $U=1$

4.2 $U=2$

4.3 $U\geq 2$

Proposition 1.

Theorem 1.

Example 5 (Minibatch sampling for hyper-relations).

Theorem 2.

6.1 Poisson regression ( $U=1$ )

6.2 Link regression ( $U=2$ )

6.3 Hyperlink regression ( $U=3$ )

Theorem 3.