Learning from Multi-View Multi-Way Data via Structural Factorization   Machines

Chun-Ta Lu; Lifang He; Hao Ding; Bokai Cao; Philip S. Yu

arXiv:1704.03037·cs.LG·February 16, 2018

Learning from Multi-View Multi-Way Data via Structural Factorization Machines

Chun-Ta Lu, Lifang He, Hao Ding, Bokai Cao, Philip S. Yu

PDF

TL;DR

This paper introduces Structural Factorization Machines (SFMs), a multi-tensor approach that models multi-view, multi-way data, capturing shared structures and view importance, leading to improved prediction accuracy and scalability.

Contribution

The paper proposes SFMs, a novel model that preserves multi-view data structure, learns shared latent spaces, and adjusts view importance, with linear complexity for large-scale applications.

Findings

01

SFMs outperform state-of-the-art methods in accuracy

02

SFMs have linear complexity, suitable for large datasets

03

Experiments confirm effectiveness on real-world data

Abstract

Real-world relations among entities can often be observed and determined by different perspectives/views. For example, the decision made by a user on whether to adopt an item relies on multiple aspects such as the contextual information of the decision, the item's attributes, the user's profile and the reviews given by other users. Different views may exhibit multi-way interactions among entities and provide complementary information. In this paper, we introduce a multi-tensor-based approach that can preserve the underlying structure of multi-view data in a generic predictive model. Specifically, we propose structural factorization machines (SFMs) that learn the common latent spaces shared by multi-view tensors and automatically adjust the importance of each view in the predictive model. Furthermore, the complexity of SFMs is linear in the number of parameters, which make SFMs suitable…

Tables2

Table 1. Table 1. List of basic symbols.

Symbol	Definition and description
$x$	each lowercase letter represents a scalar
$𝐱$	each boldface lowercase letter represents a vector
$𝐗$	each boldface uppercase letter represents a matrix
$𝒳$	each calligraphic letter represents a tensor
$𝔛$	each gothic letter represent a general set or space
$[1 : N]$	a set of integers in the range of $1$ to $N$ inclusively.
$⟨ \cdot, \cdot ⟩$	denotes inner product
$\circ$	denotes tensor product (outer product)
$*$	denotes Hadamard (element-wise) product

Table 2. Table 2. The statistics for each dataset. N z ( X ) subscript 𝑁 𝑧 𝑋 N_{z}(X) and N z ( ℬ ) subscript 𝑁 𝑧 ℬ N_{z}(\mathcal{B}) are the number of non-zeros in plain formatted feature matrix and in relational structures, respectively. Game: Video Games, Cloth: Clothing, Shoes and Jewelry, Sport: Sports and Outdoors, Health: Health and Personal Care, Home: Home and Kitchen, Elec: Electronics.

Dataset	#Samples	Mode					Density	$N_{z} (X)$	$N_{z} (ℬ)$
Amazon		#Users	#Items	#Words	#Categories	#Links
Game	231,780	24,303	10,672	7,500	193	17,974	0.089%	32.9M	15.2M
Cloth	278,677	39,387	23,033	3,493	1,175	107,139	0.031%	25.6M	7.3M
Sport	296,337	35,598	18,357	5,202	1,432	73,040	0.045%	34.2M	10.2M
Health	346,355	38,609	18,534	5,889	849	80,379	0.048%	33.6M	12.1M
Home	551,682	66,569	28,237	6,455	970	99,090	0.029%	46.8M	19.4M
Elec	1,689,188	192,403	63,001	12,805	967	89,259	0.014%	161.5M	69M
		#Users	#Venues	#Friends	#Categories	#Cities
Yelp	1,319,870	88,009	40,520	88,009	892	412	0.037%	70.5M	1.4M
		#Users	#Books	#Countries	#Ages	#Authors
BX	244,848	24,325	45,074	57	8	17,178	0.022%	1.2M	163K

Equations40

⟨ X, Y ⟩ = i_{1} = 1 \sum I_{1} i_{2} = 1 \sum I_{2} \dots i_{M} = 1 \sum I_{M} x_{i_{1}, i_{2}, \dots, i_{M}} y_{i_{1}, i_{2}, \dots, i_{M}} .

⟨ X, Y ⟩ = i_{1} = 1 \sum I_{1} i_{2} = 1 \sum I_{2} \dots i_{M} = 1 \sum I_{M} x_{i_{1}, i_{2}, \dots, i_{M}} y_{i_{1}, i_{2}, \dots, i_{M}} .

(X \circ Y)_{i_{1}, i_{2}, \dots, i_{N}, i_{1}^{'}, i_{2}^{'}, \dots, i_{M}^{'}} = x_{i_{1}, i_{2}, \dots, i_{N}} y_{i_{1}^{'}, i_{2}^{'}, \dots, i_{M}^{'}}

(X \circ Y)_{i_{1}, i_{2}, \dots, i_{N}, i_{1}^{'}, i_{2}^{'}, \dots, i_{M}^{'}} = x_{i_{1}, i_{2}, \dots, i_{N}} y_{i_{1}^{'}, i_{2}^{'}, \dots, i_{M}^{'}}

⟨ X, Y ⟩ = ⟨ x^{(1)}, y^{(1)} ⟩ ⟨ x^{(2)}, y^{(2)} ⟩ \dots ⟨ x^{(M)}, y^{(M)} ⟩ .

⟨ X, Y ⟩ = ⟨ x^{(1)}, y^{(1)} ⟩ ⟨ x^{(2)}, y^{(2)} ⟩ \dots ⟨ x^{(M)}, y^{(M)} ⟩ .

X = r = 1 \sum R x_{r}^{(1)} \circ x_{r}^{(2)} \circ \dots \circ x_{r}^{(M)} = [[X^{(1)}, X^{(2)}, \dots, X^{(M)}]],

X = r = 1 \sum R x_{r}^{(1)} \circ x_{r}^{(2)} \circ \dots \circ x_{r}^{(M)} = [[X^{(1)}, X^{(2)}, \dots, X^{(M)}]],

\tilde{X} = \tilde{x}^{(1)} \circ \tilde{x}^{(2)} \circ \dots \circ \tilde{x}^{(M)} \in R^{(1 + I_{1}) \times \dots \times (1 + I_{M})},

\tilde{X} = \tilde{x}^{(1)} \circ \tilde{x}^{(2)} \circ \dots \circ \tilde{x}^{(M)} \in R^{(1 + I_{1}) \times \dots \times (1 + I_{M})},

f ({\tilde{X}^{(1)}, \tilde{X}^{(2)}}) = ⟨ \tilde{W}^{(1)}, \tilde{X}^{(1)} ⟩ + ⟨ \tilde{W}^{(2)}, \tilde{X}^{(2)} ⟩

f ({\tilde{X}^{(1)}, \tilde{X}^{(2)}}) = ⟨ \tilde{W}^{(1)}, \tilde{X}^{(1)} ⟩ + ⟨ \tilde{W}^{(2)}, \tilde{X}^{(2)} ⟩

e_{v} = [v-1 0, \dots, 0, 1, 0, \dots, 0]^{T},

e_{v} = [v-1 0, \dots, 0, 1, 0, \dots, 0]^{T},

f ({\tilde{X}^{(1)}, \tilde{X}^{(2)}}) = ⟨ \hat{W}^{(1)}, \tilde{X}^{(1)} \circ e_{1} ⟩ + ⟨ \hat{W}^{(2)}, \tilde{X}^{(2)} \circ e_{2} ⟩,

f ({\tilde{X}^{(1)}, \tilde{X}^{(2)}}) = ⟨ \hat{W}^{(1)}, \tilde{X}^{(1)} \circ e_{1} ⟩ + ⟨ \hat{W}^{(2)}, \tilde{X}^{(2)} \circ e_{2} ⟩,

\hat{W}^{(1)}

\hat{W}^{(1)}

= [[[b^{(1, 1)}; Θ^{(1)}], [b^{(1, 2)}; Θ^{(2)}], [b^{(1, 3)}; Θ^{(3)}], Φ]],

\hat{W}^{(2)} = [[\hat{Θ}^{(2, 3)}, \hat{Θ}^{(2, 4)}, Φ]] = [[[b^{(2, 3)}; Θ^{(3)}], [b^{(2, 4)}; Θ^{(4)}], Φ]],

\hat{W}^{(2)} = [[\hat{Θ}^{(2, 3)}, \hat{Θ}^{(2, 4)}, Φ]] = [[[b^{(2, 3)}; Θ^{(3)}], [b^{(2, 4)}; Θ^{(4)}], Φ]],

= = = ⟨ \hat{W}^{(1)}, \tilde{X}^{(1)} \circ e_{1} ⟩ + ⟨ \hat{W}^{(2)}, \tilde{X}^{(2)} \circ e_{2} ⟩ r = 1 \sum R ⟨ \hat{θ}_{r}^{(1, 1)} \circ \hat{θ}_{r}^{(1, 2)} \circ \hat{θ}_{r}^{(1, 3)} \circ ϕ_{r}, \tilde{x}^{(1)} \circ \tilde{x}^{(2)} \circ \tilde{x}^{(3)} \circ e_{1} ⟩ + r = 1 \sum R ⟨ \hat{θ}_{r}^{(2, 3)} \circ \hat{θ}_{r}^{(2, 4)} \circ ϕ_{r}, \tilde{x}^{(3)} \circ \tilde{x}^{(4)} \circ e_{2} ⟩ ϕ^{1} (m = 1 \prod 3 * (\tilde{x}^{(m)^{T}} \hat{Θ}^{(1, m)}))^{T} + ϕ^{2} (m = 3 \prod 4 * (\tilde{x}^{(m)^{T}} \hat{Θ}^{(2, m)}))^{T} ϕ^{1} (m = 1 \prod 3 * (x^{(m)^{T}} Θ^{(m)} + b^{(1, m)}))^{T} + ϕ^{2} (m = 3 \prod 4 * (x^{(m)^{T}} Θ^{(m)} + b^{(2, m)}))^{T}

= = = ⟨ \hat{W}^{(1)}, \tilde{X}^{(1)} \circ e_{1} ⟩ + ⟨ \hat{W}^{(2)}, \tilde{X}^{(2)} \circ e_{2} ⟩ r = 1 \sum R ⟨ \hat{θ}_{r}^{(1, 1)} \circ \hat{θ}_{r}^{(1, 2)} \circ \hat{θ}_{r}^{(1, 3)} \circ ϕ_{r}, \tilde{x}^{(1)} \circ \tilde{x}^{(2)} \circ \tilde{x}^{(3)} \circ e_{1} ⟩ + r = 1 \sum R ⟨ \hat{θ}_{r}^{(2, 3)} \circ \hat{θ}_{r}^{(2, 4)} \circ ϕ_{r}, \tilde{x}^{(3)} \circ \tilde{x}^{(4)} \circ e_{2} ⟩ ϕ^{1} (m = 1 \prod 3 * (\tilde{x}^{(m)^{T}} \hat{Θ}^{(1, m)}))^{T} + ϕ^{2} (m = 3 \prod 4 * (\tilde{x}^{(m)^{T}} \hat{Θ}^{(2, m)}))^{T} ϕ^{1} (m = 1 \prod 3 * (x^{(m)^{T}} Θ^{(m)} + b^{(1, m)}))^{T} + ϕ^{2} (m = 3 \prod 4 * (x^{(m)^{T}} Θ^{(m)} + b^{(2, m)}))^{T}

f ({\tilde{X}^{(v)}}) = v = 1 \sum V ⟨ \hat{W}^{(v)}, \tilde{X}^{(v)} \circ e_{v} ⟩ = v = 1 \sum V ϕ^{v} m \in S_{M} (v) \prod * (x^{(m)^{T}} Θ^{(m)} + b^{(v, m)})^{T} = v = 1 \sum V ϕ^{v} m \in S_{M} (v) \prod * (h^{(m)} + b^{(v, m)^{T}})

f ({\tilde{X}^{(v)}}) = v = 1 \sum V ⟨ \hat{W}^{(v)}, \tilde{X}^{(v)} \circ e_{v} ⟩ = v = 1 \sum V ϕ^{v} m \in S_{M} (v) \prod * (x^{(m)^{T}} Θ^{(m)} + b^{(v, m)})^{T} = v = 1 \sum V ϕ^{v} m \in S_{M} (v) \prod * (h^{(m)} + b^{(v, m)^{T}})

R = \frac{1}{N} n = 1 \sum N ℓ (f ({X_{n}^{(v)}}), y_{n}) + λ Ω (Φ, {Θ^{(m)}}, {b^{(v, m)}})

R = \frac{1}{N} n = 1 \sum N ℓ (f ({X_{n}^{(v)}}), y_{n}) + λ Ω (Φ, {Θ^{(m)}}, {b^{(v, m)}})

\frac{\partial R}{\partial Θ ^{(m)}} = \frac{\partial L}{\partial f} \frac{\partial f}{\partial Θ ^{(m)}} + λ \frac{\partial Ω _{λ} ( Θ ^{(m)} )}{\partial Θ ^{(m)}}

\frac{\partial R}{\partial Θ ^{(m)}} = \frac{\partial L}{\partial f} \frac{\partial f}{\partial Θ ^{(m)}} + λ \frac{\partial Ω _{λ} ( Θ ^{(m)} )}{\partial Θ ^{(m)}}

\frac{\partial L}{\partial f} \frac{\partial f}{\partial Θ ^{(m)}} = X^{(m)} v \in S_{V} (m) \sum ((\frac{\partial L}{\partial f} ϕ^{v}) * Π^{(v, - m)})

\frac{\partial L}{\partial f} \frac{\partial f}{\partial Θ ^{(m)}} = X^{(m)} v \in S_{V} (m) \sum ((\frac{\partial L}{\partial f} ϕ^{v}) * Π^{(v, - m)})

\frac{\partial R}{\partial b ^{(v, m)}}

\frac{\partial R}{\partial b ^{(v, m)}}

= 1^{T} ((\frac{\partial L}{\partial f} ϕ^{v}) * Π^{(v, - m)}) + λ \frac{\partial Ω _{λ} ( b ^{(v, m)} )}{\partial b ^{(v, m)}}

\frac{\partial R}{\partial Φ}

\frac{\partial R}{\partial Φ}

\displaystyle\nabla\mathcal{R}=\left[\begin{array}[]{c}\text{vec}(\frac{\partial\mathcal{R}}{\partial\bm{\Theta}^{(1)}})\\ \vdots\\ \text{vec}(\frac{\partial\mathcal{R}}{\partial\bm{\Theta}^{(M)}})\\ \text{vec}(\frac{\partial\mathcal{R}}{\partial\mathbf{b}^{(1,1)}})\\ \vdots\\ \text{vec}(\frac{\partial\mathcal{R}}{\partial\mathbf{b}^{(V,M)}})\\ \text{vec}(\frac{\partial\mathcal{R}}{\partial\bm{\Phi}})\end{array}\right]

\displaystyle\nabla\mathcal{R}=\left[\begin{array}[]{c}\text{vec}(\frac{\partial\mathcal{R}}{\partial\bm{\Theta}^{(1)}})\\ \vdots\\ \text{vec}(\frac{\partial\mathcal{R}}{\partial\bm{\Theta}^{(M)}})\\ \text{vec}(\frac{\partial\mathcal{R}}{\partial\mathbf{b}^{(1,1)}})\\ \vdots\\ \text{vec}(\frac{\partial\mathcal{R}}{\partial\mathbf{b}^{(V,M)}})\\ \text{vec}(\frac{\partial\mathcal{R}}{\partial\bm{\Phi}})\end{array}\right]

f ({X_{n}^{(v)}} = v = 1 \sum V ϕ^{v} m \in S_{M} (v) \prod * (h_{ψ (n)}^{B^{(m)}} + b^{(v, m)^{T}}),

f ({X_{n}^{(v)}} = v = 1 \sum V ϕ^{v} m \in S_{M} (v) \prod * (h_{ψ (n)}^{B^{(m)}} + b^{(v, m)^{T}}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Learning from Multi-View Multi-Way Data via Structural Factorization Machines

Chun-Ta Lu

University of Illinois at Chicago

[email protected]

,

Lifang He

Cornell University

[email protected]

,

Hao Ding

Purdue University

[email protected]

,

Bokai Cao

University of Illinois at Chicago

[email protected]

and

Philip S. Yu

University of Illinois at Chicago

Tsinghua University

[email protected]

(2018)

Abstract.

Real-world relations among entities can often be observed and determined by different perspectives/views. For example, the decision made by a user on whether to adopt an item relies on multiple aspects such as the contextual information of the decision, the item’s attributes, the user’s profile and the reviews given by other users. Different views may exhibit multi-way interactions among entities and provide complementary information. In this paper, we introduce a multi-tensor-based approach that can preserve the underlying structure of multi-view data in a generic predictive model. Specifically, we propose structural factorization machines (SFMs) that learn the common latent spaces shared by multi-view tensors and automatically adjust the importance of each view in the predictive model. Furthermore, the complexity of SFMs is linear in the number of parameters, which make SFMs suitable to large-scale problems. Extensive experiments on real-world datasets demonstrate that the proposed SFMs outperform several state-of-the-art methods in terms of prediction accuracy and computational cost.

Tensor Factorization; Multi-Way Interaction; Multi-View Learning

††journalyear: 2018††conference: The 2018 Web Conference; April 23–27, 2018; Lyon, France††price: ††doi: https://doi.org/10.1145/3178876.3186071††isbn: 978-1-4503-5639-8††article: 4††ccs: Computing methodologies Machine learning††ccs: Computing methodologies Supervised learning††ccs: Computing methodologies Factorization methods

1. Introduction

With the ability to access massive amounts of heterogeneous data from multiple sources, multi-view data have become prevalent in many real-world applications. For instance, in recommender systems, online review sites (like Amazon and Yelp) have access to contextual information of shopping histories of users, the reviews written by the users, the categorizations of the items, as well as the friends of the users. Each view may exhibit pairwise interactions (e.g., the friendships between users) or even higher-order interactions (e.g., a customer write a review for a product) among entities (such as customers, products, and reviews), and can be represented in a multi-way data structure, i.e., tensor. Since different views usually provide complementary information (cao2014tensor, ; cao2016multi, ; lu2017mfm, ), how to effectively incorporate information from multiple structural views is critical to good prediction performance for various machine learning tasks.

Typically, a predictive model is defined as a function of predictor variables (e.g., the customer id, the product id, and the categories of the product) to some target (e.g., the rating). The most common approach in predictive modeling for multi-view multi-way data is to describe samples with feature vectors that are flattened and concatenated from structural views, and apply a vector-based method, such as linear regression (LR) and support vector machines (SVMs), to learn the target function from observed samples. Recent works have shown that linear models fail for tasks with very sparse data (rendle2012factorization, ). A variety of methods have been proposed to address the data sparsity issue by factorizing the monomials (or feature interactions) with kernels, such as the ANOVA kernels used in FMs (rendle2012factorization, ; blondel2016higher, ) and polynominal kernels used in polynominal networks (livni2014computational, ; blondelIFU16, ). However, the disadvantages of this approach are that (1) the important structural information of each view will be discarded which may lead to the degraded prediction performance and (2) the feature vectors can grow very large which can make learning and prediction very slow or even infeasible, especially if each view involves relations of high cardinality. For example, including the relation “friends of a user” in the feature vector (represented by their IDs) can result in a very long feature vector. Further, it will repeatedly appear in many samples that involve the given user.

Matrix/tensor factorization models have been a topic of interest in the areas of multi-way data analysis, e.g., community detection (he2016joint, ), collaborative filtering (koren2008factorization, ; rendle2010pairwise, ), knowledge graph completion (zhang2017connecting, ), and neuroimage analysis (he2014dusk, ). Assuming multi-view data have the same underlying low-rank structure (at least in one mode), coupled data analysis such as collective matrix factorization (CMF) (singh2008relational, ) and coupled matrix and tensor factorization (CMTF) (acar2011all, ) that jointly factorize multiple matrices (or tensors) has been applied to applications such as clustering and missing data recovery. However, they are only applicable to categorical variables. Moreover, since existing coupled factorization models are unsupervised, the importance of each structural view in modeling the target value cannot be automatically learned. Furthermore, when applying these models to data with rich meta information (e.g., friendships) but extremely sparse target values (e.g., ratings), it is very likely the learning process will be dominated by the meta information without manual tuning some hyperparameters, e.g., the weights of the fitting error of each matrix/tensor in the objective function (singh2008relational, ), the weights of different types of latent factors in the predictive models (koren2010factor, ), or the regularization hyperparamters of latent factor alignment (lu2016item, ).

In this paper, we propose a general and flexible framework for learning the predictive structure from the complex relationships within the multi-view multi-way data. Each view of an instance in this framework is represented by a tensor that describes the multi-way interactions of subsets of entities, and different views have some entities in common. Constructing the tensors for each instance may not be realistic for real-world applications in terms of space and computational complexity, and the model parameters can have exponential growth and tend to be overfitting. In order to preserve the structural information of multi-view data without physically constructing the tensors, we introduce structural factorization machines (SFMs) that can learn the consistent representations in the latent feature spaces shared in the multi-view tensors while automatically adjust the contribution of each view in the predictive model. Furthermore, we provide an efficient method to avoid redundant computing on repeating patterns stemming from the relational structure of the data, such that SFMs can make the same predictions but with largely speed up computation.

The contributions of this paper are summarized as follows:

•

We introduce a novel multi-tensor framework for mining data from heterogeneous domains, which can explore the high order correlations underlying multi-view multi-way data in a generic predictive model.

•

We develop structural factorization machines (SFMs) tailored for learning the common latent spaces shared in multi-view tensors and automatically adjusting the importance of each view in the predictive model. The complexity of SFMs is linear in the number of features, which makes SFMs suitable to large-scale problems.

•

Extensive experiments on eight real-world datasets are performed along with comparisons to existing state-of-the-art factorization models to demonstrate its advantages.

The rest of this paper is organized as follows. In Section 2, we briefly review related work on factorization models and multi-view learning. We introduce the preliminary concepts and problem definition in Section 3. We then propose the framework for learning multi-view multi-way data, and develop the structural factorization machines (SFMs), and provide an efficient computing method in Section 4. The experimental results and parameter analysis are reported in Section 5. Section 6 concludes this paper.

2. Related Work

Feature Interactions. Rendle pioneered the concept of feature interactions in Factorization Machines (FM) (rendle2012factorization, ). Juan et al. presented Field-aware Factorization Machines (FFM) (juan2016field, ) to allow each feature to interact differently with another feature depending on its field. Novikov et al. proposed Exponential Machines (ExM) (novikov2016exponential, ) where the weight tensor is represented in a factorized format called Tensor Train. Zhang et al. used FM to initialize the embedding layer in a deep model (zhang2016deep, ). Qu et al. added a product layer on the top of the embedding layer to increase the model capacity (qu2016product, ). Other extensions of FM to deep architectures include Neural Factorization Machines (NFM) (he2017neural, ) and Attentional Factorization Machines (AFM) (xiao2017attentional, ). In order to effectively model feature interactions, a variety of models has been developed in the industry as well. Microsoft studied feature interactions in deep models, including Deep Semantic Similarity Model (DSSM) (huang2013learning, ), Deep Crossing (shan2016deep, ) and Deep Embedding Forest (zhu2017deep, ). They use features as raw as possible without manually crafted combinatorial features, and let deep neural networks take care of the rest. Alibaba proposed a Deep Interest Network (DIN) (zhou2017deep, ) to learn user embeddings as a function of ad embeddings. Google used deep neural networks to learn from heterogeneous signals for YouTube recommendations (covington2016deep, ). In addition, Wide & Deep Models (cheng2016wide, ) were developed for app recommender systems in Google Play where the wide component includes cross features that are good at memorization and the deep component includes embedding layers for generalization. Guo et al. proposed to use FM as the wide component in Wide & Deep with shared embeddings in the deep component (guo2017deepfm, ). Wang et al. developed the Deep & Cross Network (DCN) to learn explicit cross features of bounded degree (wang2017deep, ).

Multi-View Learning. Multi-view learning (MVL) is concerned with predicting unknown values by taking multiple views into account. The traditional MVL refers to using relational features to construct a set of disjoint views, and these uncorrelated views are then used to model a target function to approximate the target concept to be learned (guo2006mining, ). There are currently a plethora of studies available for MVL. Interested readers are referred to (xu2013survey, ) for a comprehensive survey of these techniques and applications. The most related works to ours are (cao2016multi, ; cao2017deepmood, ; liang2017icdm, ) that introduced and explored the tensor product operator to integrate different views together in a tensor. Lu et al. further studied the multi-view feature interactions in the context of multi-task learning (lu2017mfm, ). However, this approach will introduce unexpected noise from the irrelevant feature interactions that can even be exaggerated after combinations, thereby degrading performance as demonstrated in the experiments. Different from conventional MVL approaches, the proposed algorithm can learn the common latent spaces shared in multi-view tensors and automatically adjusting the importance of each view in the predictive model.

3. Preliminaries

In this section, we begin with a brief introduction to some related concepts and notation in tensor algebra, and then proceed to formulate the problem we are concerned with multi-view learning.

3.1. Tensor Basics and Notation

Tensor is a mathematical representation of a multi-way array. The order of a tensor is the number of modes (or ways). A zero-order tensor is a scalar, a first-order tensor is a vector, a second-order tensor is a matrix and a tensor of order three or higher is called a higher-order tensor. An element of a vector $\mathbf{x}$ , a matrix $\mathbf{X}$ , or a tensor $\mathcal{X}$ is denoted by $x_{i}$ , $x_{i,j}$ , $x_{i,j,k}$ , etc., depending on the number of modes. All vectors are column vectors unless otherwise specified. For an arbitrary matrix $\mathbf{X}\in\mathbb{R}^{I\times J}$ , its $i$ -th row and $j$ -th column vector are denoted by $\mathbf{x}^{i}$ and $\mathbf{x}_{j}$ , respectively. Given two matrices $\mathbf{X},\mathbf{Y}\in\mathbb{R}^{I\times J}$ , $\mathbf{X}*\mathbf{Y}$ denotes the element-wise (Hadamard) product between $\mathbf{X}$ and $\mathbf{Y}$ , defined as the matrix in $\mathbb{R}^{I\times J}$ . An overview of the basic symbols used in this paper can be found in Table 1.

Definition 3.1 (Inner product).

The inner product of two same-sized tensors $\mathcal{X},\mathcal{Y}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{M}}$ is defined as the sum of the products of their entries:

[TABLE]

Definition 3.2 (Outer product).

The outer product of two tensors $\mathcal{X}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}$ and $\mathcal{Y}\in\mathbb{R}^{I_{1}^{\prime}\times I_{2}^{\prime}\times\cdots\times I_{M}^{\prime}}$ is a $(N+M)$ th-order tensor denoted by $\mathcal{X}\circ\mathcal{Y}$ , and the elements are defined by

[TABLE]

for all values of the indices.

Notice that for rank-one tensors $\mathcal{X}=\mathbf{x}^{(1)}\circ\mathbf{x}^{(2)}\circ\cdots\circ\mathbf{x}^{(M)}$ and $\mathcal{Y}=\mathbf{y}^{(1)}\circ\mathbf{y}^{(2)}\circ\cdots\circ\mathbf{y}^{(M)}$ , it holds that

[TABLE]

Definition 3.3 (CP factorization (kolda2009tensor, )).

Given a tensor $\mathcal{X}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{M}}$ and an integer $R$ , the CP factorization is defined by factor matrices $\mathbf{X}^{(m)}\in\mathbb{R}^{I_{m}\times R}$ for $m\in[1:M]$ , respectively, such that

[TABLE]

where $\mathbf{x}_{r}^{(m)}\in\mathbb{R}^{I_{m}}$ is the $r$ -th column of the factor matrix $\mathbf{X}^{(m)}$ , and $\llbracket\cdot\rrbracket$ is used for shorthand notation of the sum of rank-one tensors.

3.2. Problem Formulation

Our problem is different from conventional multi-view learning approaches where multiple views of data are assumed independent and disjoint, and each view is described by a vector. We formulate the multi-view learning problem using coupled analysis of multi-view features in the form of multiple tensors.

Suppose that the problem includes $V$ views where each view consists of a collection of subsets of entities (such as person, company, location, product) and different views have some entities in common. We denote a view as a tuple $(\mathbf{x}^{(1)},\mathbf{x}^{(2)},\cdots,\mathbf{x}^{(M)}),M\geq 2$ , where $\mathbf{x}^{(m)}\in\mathbb{R}^{I_{m}}$ is a feature vector associated with the entity $m$ . Inspired by (cao2016multi, ), we construct tensor representation for each view over its entities by

[TABLE]

where $\tilde{\mathbf{x}}^{(m)}=[1;\mathbf{x}^{(m)}]\in\mathbb{R}^{1+I_{m}}$ and $\circ$ is the outer product operator. In this manner, the full-order interactions 111Full-order interactions range from the first-order interactions (i.e., contributions of single entity features) to the highest-order interactions (i.e., contributions of the outer product of features from all entities). between entities are embedded within the tensor structure, which not only provides a unified and compact representation for each view, but also facilitate efficient design methods. Fig. 1 shows an example of two structural views, where the first view consists of the full-order interactions among the first three modes (e.g., review text, item ID, and user ID), and the second view consists of the full-order interactions among the last two modes (e.g., user ID and friend IDs).

After generating the tensor representation for each view, we define the multi-view learning problem as follows. Given a training set $\mathfrak{D}=\big{\{}\big{(}\big{\{}\tilde{\mathcal{X}}^{(1)}_{n},\tilde{\mathcal{X}}^{(2)}_{n},\cdots,\tilde{\mathcal{X}}^{(V)}_{n}\big{\}},~{}y_{n}\big{)}~{}|~{}n\in[1:N]\big{\}}$ , where $\tilde{\mathcal{X}}^{(v)}_{n}\in\mathbb{R}^{(1+I_{1})\times\cdots\times(1+I_{M_{v}})}$ is the tensor representation in the $v$ -th view for the $n$ -th instance, $y_{n}$ is the response of the $n$ -th instance, $M_{v}$ is the number of the constitutive modes in the $v$ -th view, and $N$ is the number of labeled instances. We assume different views have common entities, thus the resulting tensors will share common modes, e.g., the third mode in Fig 1. As we are concerned with predicting unknown values of multiple coupled tensors, our goal is to leverage the relational information from all the views to help predict the unlabeled instances, as well as to use the complementary information among different views to improve the performance. Specifically, we are interested in finding a predictive function $f:\mathfrak{X}^{(1)}\times\mathfrak{X}^{(2)}\cdots\times\mathfrak{X}^{(V)}\rightarrow\mathfrak{Y}$ that minimizes the expected loss, where $\mathfrak{X}^{(v)},v\in[1:V]$ is the input space in the $v$ -th view and $\mathfrak{Y}$ is the output space.

4. Methodology

In this section, we first discuss how to design the predictive models for learning from multiple coupled tensors. We then derive structural factorization machines (SFMs) that can learn the common latent spaces shared in multi-view coupled tensors and automatically adjust the importance of each view in the predictive model.

4.1. Predictive Models

Without loss of generality, we take two views as an example to introduce our basic design of the predictive models. Specifically, we consider coupled analysis of a third-order tensor and a matrix with one mode in common, as shown in Fig. 1. Given an input instance $\big{(}\big{\{}\tilde{\mathcal{X}}^{(1)},\tilde{\mathbf{X}}^{(2)}\big{\}},~{}y\big{)}$ , where $\tilde{\mathcal{X}}^{(1)}=\tilde{\mathbf{x}}^{(1)}\circ\tilde{\mathbf{x}}^{(2)}\circ\tilde{\mathbf{x}}^{(3)}\in\mathbb{R}^{(1+I)\times(1+J)\times(1+K)}$ and $\tilde{\mathbf{X}}^{(2)}=\tilde{\mathbf{x}}^{(3)}\circ\tilde{\mathbf{x}}^{(4)}\in\mathbb{R}^{(1+K)\times(1+L)}$ . An intuitive solution is to build the following multiple linear model:

[TABLE]

where $\tilde{\mathcal{W}}^{(1)}\in\mathbb{R}^{(1+I)\times(1+J)\times(1+K)}$ and $\tilde{\mathbf{W}}^{(2)}\in\mathbb{R}^{(1+K)\times(1+L)}$ are the weights for each view to be learned.

However, in this case it does not take into account the relations and differences between two views. In order to incorporate the relations between two views and also discriminate the importance of each view, we introduce an indicator vector $\mathbf{e}_{v}\in\mathbb{R}^{V}$ for each view $v$ as

[TABLE]

and transform the predictive model in Eq. (5) into

[TABLE]

where $\hat{\mathcal{W}}^{(1)}\in\mathbb{R}^{(1+I)\times(1+J)\times(1+K)\times 2}$ and $\hat{\mathcal{W}}^{(2)}\in\mathbb{R}^{(1+K)\times(1+L)\times 2}$ .

Directly learning the weight tensors $\hat{\mathcal{W}}$ s leads to two drawbacks. First, the weight parameters are learned independently for different modes and different views. When the feature interactions rarely (or even never) appear during training, it is unlikely to learn the associated parameters appropriately. Second, the number of parameters in Eq. (6) is exponential to the number of features, which can make the model prone to overfitting and ineffective on sparse data. Here, we assume that each weight tensor has a low-rank approximation, and $\hat{\mathcal{W}}^{(1)}$ and $\hat{\mathcal{W}}^{(2)}$ can be decomposed by CP factorization as

[TABLE]

and

[TABLE]

where $\bm{\Theta}^{(m)}\in\mathbb{R}^{I_{m}\times R}$ is the factor matrix for the features in the $m$ -th mode. It is worth noting that $\mathbf{\Theta}^{(3)}$ is shared in the two views. $\bm{\Phi}\in\mathbb{R}^{2\times R}$ is the factor matrix for the view indicator, and $\mathbf{b}^{(v,m)}\in\mathbb{R}^{1\times R}$ , which is always associated with the constant one in $\tilde{\mathbf{x}}^{(m)}=[1;\mathbf{x}^{(m)}]$ , represents the bias factors of the $m$ -th mode in the $v$ -th view. Through $\mathbf{b}^{(v,m)}$ , the lower-order interactions (the interactions excluding the features from the $m$ -th mode) in the $v$ -th view are explored in the predictive function.

Then we can transform Eq. (6) into

[TABLE]

where $\ast$ is the Hadamard (elementwise) product and $\bm{\phi}^{v}\in\mathbb{R}^{1\times R}$ is the $v$ -th row of the factor matrix $\bm{\Phi}$ .

For convenience, we let $\mathbf{h}^{(m)}=\bm{\Theta}^{(m)^{\mathrm{T}}}\mathbf{x}^{(m)}$ , $S_{M}(v)$ denote the set of modes in the $v$ -th views, $\bm{\pi}^{(v)}=\prod\limits_{m\in S_{M}(v)}\ast\left(\mathbf{h}^{(m)}+\mathbf{b}^{(v,m)^{\mathrm{T}}}\right)$ , and $\bm{\pi}^{(v,-m)}=\prod\limits_{m^{\prime}\in S_{M}(v),m^{\prime}\neq m}$ $\ast\left(\mathbf{h}^{(m^{\prime})}+\mathbf{b}^{(v,m^{\prime})^{\mathrm{T}}}\right)$ . The predictive model for the general cases is given as follows

[TABLE]

A graphical illustration of the proposed model is shown in Fig. 2. We name this model as structural factorization machines (SFMs). Clearly, the parameters are jointly factorized, which benefits parameter estimation under sparsity since dependencies exist when the interactions share the same features. Therefore, the model parameters can be effectively learned without direct observations of such interactions especially in highly sparse data. More importantly, after factorizing the weight tensor $\hat{\mathcal{W}}$ s, there is no need to construct the input tensor physically. Furthermore, the model complexity is linear in the number of original features. In particular, the model complexity is $O(R(V+I+\sum_{v}M_{v}))$ , where $M_{v}$ is the number of modes in the $v$ -th view.

4.2. Learning Structural Factorization Machines

Following the traditional supervised learning framework, we propose to learn the model parameters by minimizing the following regularized empirical risk:

[TABLE]

where $\ell$ is a prescribed loss function, $\Omega$ is the regularizer encoding the prior knowledge of $\{\mathbf{\Theta}^{(m)}\}$ and $\mathbf{\Phi}$ , and $\lambda\geq 0$ is the regularization parameter that controls the trade-off between the empirical loss and the prior knowledge.

The partial derivative of $\mathcal{R}$ w.r.t. $\mathbf{\Theta}^{(m)}$ is given by

[TABLE]

where $\frac{\partial\mathcal{L}}{\partial f}=\frac{1}{N}\left[\begin{array}[]{c}\frac{\partial\ell_{1}}{\partial f},\cdots,\frac{\partial\ell_{N}}{\partial f}\end{array}\right]^{\mathrm{T}}\in\mathbb{R}^{N}$ .

For convenience, we let $S_{V}(m)$ denote the set of views that contains the $m$ -th mode, $\mathbf{X}^{(m)}=[\mathbf{x}_{1}^{(m)},\cdots,\mathbf{x}_{N}^{(m)}]$ , $\bm{\Pi}^{(v)}=[\bm{\pi}_{1}^{(v)},\cdots,\bm{\pi}_{N}^{(v)}]^{\mathrm{T}}$ and $\bm{\Pi}^{(v,-m)}=[\bm{\pi}_{1}^{(v,-m)},\cdots,\bm{\pi}_{N}^{(v,-m)}]^{\mathrm{T}}$ . We then have that

[TABLE]

Similarly, the partial derivative of $\mathcal{R}$ w.r.t. $\mathbf{b}^{(v,m)}$ is given by

[TABLE]

The partial derivative of $\mathcal{R}$ w.r.t. $\mathbf{\Phi}$ is given by

[TABLE]

Finally, the gradient of $\mathcal{R}$ can be formed by vectorizing the partial derivatives with respect to each factor matrix and concatenating them all, i.e.,

[TABLE]

Once we have the function, $\mathcal{R}$ and gradient, $\nabla\mathcal{R}$ , we can use any gradient-based optimization algorithm to compute the factor matrices. For the results presented in this paper, we use the Adaptive Moment Estimation (Adam) optimization algorithm (kingma2014adam, ) for parameter updates. Adam is an adaptive version of gradient descent that controls individual adaptive learning rates for different parameters from estimates of first and second moments of the gradient. It combines the best properties of the AdaGrad (duchi2011adaptive, ), which works well with sparse gradients, and RMSProp (hinton2012rmsprop, ), which works well in on-line and non-stationary settings. Readers can refer to (kingma2014adam, ) for details of the Adam optimization algorithm.

4.3. Efficient Computing with Relational Structures

In relational domains, we can often observe that feature vectors of the same entity repeatedly appear in the plain formatted feature matrix $\mathbf{X}$ , where $\mathbf{X}=[\mathbf{X}^{(1)};\cdots;\mathbf{X}^{(M)}]\in\mathbb{R}^{I\times N}$ and $\mathbf{X}^{(m)}\in\mathbb{R}^{I_{m}\times N}$ is the feature matrix in the $m$ -th mode. Consider Fig. 3(a) as an example, where the parts highlighted in yellow in the forth mode (which represents the friends of the user) are repeatedly appear in the first three columns. Clearly, these repeating patterns stem from the relational structure of the same entity.

In the following, we show how the proposed SFM method can make use of relational structure of each mode, such that the learning and prediction can be scaled to predictor variables generated from relational data involving relations of high cardinality. We adopt the idea from (rendle2013scaling, ) to avoid redundant computing on repeating patterns over a set of feature vectors.

Let $\mathcal{B}=\{(\mathbf{X}^{B^{(m)}},\psi^{B^{(m)}})\}_{m=1}^{M}$ be the set of relational structures, where $\mathbf{X}^{B^{(m)}}\in\mathbb{R}^{I_{m}\times N_{m}}$ denotes the relational matrix of $m$ -th mode, $\psi^{B^{(m)}}:\{1,\cdots,N\}\rightarrow\{1,\cdots,N_{m}\}$ denotes the mapping from columns in the feature matrix $\mathbf{X}$ to columns within $\mathbf{X}^{B^{(m)}}$ . To shorten notation, the index $B$ is dropped from the mapping $\psi^{B}$ whenever it is clear which block the mapping belongs to. From $\mathcal{B}$ , one can reconstruct $\mathbf{X}$ by concatenating the corresponding columns of the relational matrices using the mappings. For instance, the feature vector $\mathbf{x}_{n}$ of the $n$ -th case in the plain feature matrix $\mathbf{X}$ is represented as $\mathbf{x}_{n}=[\mathbf{x}_{\psi(n)}^{(1)};\cdots;\mathbf{x}_{\psi(n)}^{(M)}]$ . Fig. 3(b) shows an example how the feature matrix can be represented in relational structures. Let $N_{z}(\mathbf{A})$ denote the number of non-zeros in a matrix $\mathbf{A}$ . The space required for using relational structures to represent the input data is $|\mathcal{B}|=NM+\sum_{m}N_{z}(\mathbf{X}^{B^{(m)}})$ , which is much smaller than $N_{z}(\mathbf{X})$ if there are repeating patterns in the feature matrix $\mathbf{X}$ .

Now we can rewrite the predictive model in Eq. (8) as follows

[TABLE]

with the caches $\mathbf{H}^{B^{(m)}}=[\mathbf{h}^{B^{(m)}}_{1},\cdots,\mathbf{h}^{B^{(m)}}_{N_{m}}]$ for each mode, where $\mathbf{h}^{B^{(m)}}_{j}=\bm{\Theta}^{(m)^{\mathrm{T}}}\mathbf{x}_{j}^{B^{(m)}},~{}\forall j\in[1:N_{m}]$ .

This directly shows how $N$ samples can be efficiently predicted: (i) compute $\mathbf{H}^{B^{(m)}}$ in $O(RN_{z}(\mathbf{X}^{B^{(m)}}))$ for each mode, (ii) compute $N$ predictions with Eq. (23) using caches in $O(RN(V+\sum_{v}M_{v}))$ . With the help of relational structures, SFMs can learn the same parameters and make the same predictions but with a much lower runtime complexity.

5. Experiments

5.1. Datasets

To evaluate the ability and applicability of the proposed SFMs, we include a spectrum of large datasets from different domains. The statistics for each dataset is summarized in Table 2, the schema of the structural views in each dataset is presented in Fig. 4, and the details are as follows:

Amazon222http://jmcauley.ucsd.edu/data/amazon/: The first group of datasets are from Amazon.com recently introduced by (mcauley2015inferring, ). This is among the largest datasets available that include review texts and metadata of items. Each top-level category of products on Amazon.com has been constructed as an independent dataset in (mcauley2015inferring, ). In this paper, we take a variety of large categories as listed in Tabel 2.

Each sample in these datasets has five modes, i.e., users, items, review texts, categories, and linkage. The user mode and item mode are represented by one-hot encoding. The $\ell_{2}$ -normalized TF-IDF vector representation of review text 333Stemming, lemmatization, removing stop-words and words with frequency less than 100 times, etc., are handled beforehand. of the item given by the user is used as the text mode. The category mode and linkage mode consists of all the categories and all the co-purchasing items of the item, which might be from other categories. The last two modes are $\ell_{1}$ -normalized.

Yelp444https://www.yelp.com/dataset-challenge: It is a large-scale dataset consisting of venue reviews. Each sample in this dataset contains five modes, i.e., users, venues, friends, categories and cities. The user mode and venue mode are represented by one-hot encoding. The friend mode consists of the friends’ ids of users. The category mode and city mode consists of all the categories and the city of the venue. The last three modes are $\ell_{1}$ -normalized.

BookCrossing (BX)555http://www2.informatik.uni-freiburg.de/$\sim$cziegler/BX/: It is a book review dataset collected from the Book-Crossing community. Each sample in this dataset contains five modes, i.e., users, books, countries, ages and authors. The ages are split in eight bins as in (harper2016movielens, ). The country mode and age mode consist of the corresponding meta information of the user. The author modes represents the authors of the book. All the modes are represented by one-hot encoding.

The values of samples range within [1:5] in Amazon and Yelp datasets, and range within [1:10] in BX dataset.

5.2. Comparison Methods

In order to demonstrate the effectiveness of the proposed SFMs, we compare a series of state-of-the-art methods.

Matrix Factorization (MF) is used to validate that meta information is helpful for improving prediction performance. We use the LIBMF implementation (chin2016libmf, ) for comparison in the experiment.

Factorization Machine (FM) (rendle2012factorization, ) is the state-of-the-art method in recommender systems. We compare with its higher-order extension (blondel2016higher, ) with up to second-order, and third-order feature interactions, and denote them as FM-2 and FM-3.

Polynomial Network (PolyNet) (livni2014computational, ) is a recently proposed method that utilizes polynomial kernel on all features. We compare the augmented PolyNet (which adds a constant one to the feature vector (blondelIFU16, )) with up to the second-order, and third-order kernel and denote them as PolyNet-2 and PolyNet-3.

Multi-View Machine (MVM) (cao2016multi, ) is a tensor factorization based method that explores the latent representation embedded in the full-order interactions among all the modes.

Structural Factorization Machine (SFM) is the proposed model that learns the common latent spaces shared in multi-way data.

5.3. Experimental Settings

For each dataset, we randomly split $50\%$ , $10\%$ , and $40\%$ of labeled samples as training set, validation set, and testing set, respectively. Validation sets are used for hyper-parameter tuning for each model. Each of the validation and testing sets does not overlap with any other set so as to ensure the sanity of the experiment. For simplicity and fair comparison, in all the comparison methods, the dimension of latent factors $R=20$ and the maximum number of epochs is set as $400$ and we use early stop to obtain the best results for each method. Forbenius norm regularizers are used to avoid overfitting. The regularization hyper-parameter is tuned from $\{10^{-5},~{}10^{-4},~{}\cdots,~{}10^{0}\}$ .

All the methods except MF are implemented in TensorFlow, and the parameters are initialized using scaling variance initializer (he2015delving, ). We tune the scaling factor of initializer $\sigma$ from $\{1,2,5,10,100\}$ and the learning rate $\eta$ from $\{0.01,0.1,1\}$ using the validation sets. In the experiment, we set $\sigma=2$ (default setting in TensorFlow) and $\eta=0.01$ for these methods except MVM. We found that MVM is more sensitive to the configuration, because MVM will element-wisely multiply the latent factors of all the modes which leads to an extremely small value approaching zero. $\sigma=10$ and $\eta=0.1$ yielded the best performance for MVM.

To investigate the performance of comparison methods, we adopt mean squared error (MSE) on the test data as the evaluation metrics (mcauley2013hidden, ; zheng2017joint, ). The smaller value of the metric indicates the better performance. Each experiment was repeated for 10 times, and the mean and standard deviation of each metric in each data set were reported. All experiments are conducted on a single machine with Intel Xeon $6$ -Core CPUs of 2.4 GHz and equipped with a Maxwell Titan X GPU.

5.4. Performance Analysis

The experimental results are shown in Table 3. The best method of each dataset is in bold. For clarity, on the right of the tables we show the percentage improvement of the proposed SFM method over a variety of methods. From these results, we can observe that SFM consistently outperforms all the comparison methods. We also make a few comparisons and summarize our findings as follows.

Compared with MF, SFM performs better with an average improvement of nearly 50%. MF usually performs well in practice (ling2014ratings, ; rendle2012factorization, ), while in datasets which are extremely sparse, as is shown in our case, MF is unable to learn an accurate representation of users/items. Thus MF under-performs other methods which takes the meta information into consideration.

In both FM and PolyNet methods, the feature vectors from all the modes are concatenated as a single input feature vector. The major difference between these two methods is the choice of kernel applied (blondel2016higher, ). The polynomial kernel used in PolyNet considers all monomials (the products of features), i.e., all combinations of features with replacement. The ANOVA kernel used in FM considers only monomials composed of distinct features, i.e., feature combinations without replacement. Compared with the best results obtained from FM methods and from PolyNet methods, SFM leads to an average improvement of 3.3% and 2.4% in MSE, respectively.

The primary reason behind the results is how the latent factors of each feature are learned. For any factorization based method, the latent factors of a feature are essentially learned from its interactions with other features observed in the data, as can be observed from its update rule. In FM and PolyNet, all the feature interactions are taken into consideration without distinguishing the features from different modes. As a result, important feature interactions (e.g., the interactions between the given user and her friend) would be easily buried in irrelevant feature interactions from the same modes (e.g., the interactions between the friends of the same user). Hence, the learned latent factors are less representative in FM and PolyNet, compared with the proposed SFM. Besides, we can find that including higher-order interactions in FM and PolyNet (i.e., FM-3 and PolyNet-3) does not always improve the performance. Instead, it may even degrade the performance, as shown in Cloth, Yelp, and BX datasets. This is probably due to overfitting, as they need to include more parameters to model the interactions in higher orders while the datasets are extremely sparse such that the parameters cannot be properly learned.

Compared to the MVM method, which models the full-order interactions among all the modes, our proposed SFM leads to an average improvement of 5.87%. This is because not all the modes are relevant, and some irrelevant feature interactions may introduce unexpected noise to the learning task. The irrelevant information can even be exaggerated after combinations, thereby degrading performance. This suggests that preserving the nature of relational structure is important in building predictive models.

5.5. Computational Cost Analysis

Next, we investigate the computational cost for comparison methods. The averaged training time (seconds per epoch) required for each dataset is shown in Fig. 5. We can easily find that the proposed SFM requires much less computational cost on all the datasets, especially for the Yelp dataset (roughly 11% of computational cost required for training FM-3). The efficiency comes from the use of relational structure representation. As shown in Table 2, the number of non-zeros of the feature matrix $N_{z}(\mathbf{X})$ is much larger than the number of non-zeros of the relational structure representation $N_{z}(\mathcal{B})$ . The amount of repeating patterns is much higher for the Yelp dataset than for the other dataset, because adding all the friends of a user significantly increases results in large repeating blocks in the plain feature matrix. Standard ML algorithms like the compared methods have typically at best a linear complexity in $N_{z}(\mathbf{X})$ , while using the relational structure representation for SFM have a linear complexity in $N_{z}(\mathcal{B})$ . This experiment substantiates the efficiency of the proposed SFM for large datasets.

5.6. Analysis of the Impact of Data Sparsity

We proceed by further studying the impact of data sparsity on different methods. As can be found in the experimental results, the improvement of SFM over the traditional collaborative filtering methods (e.g., MF) is significant for datasets that are sparse, mainly because the number of samples is too scarce to model the items and users adequately. We verify this finding by comparing the performance of comparison methods with MF on users with limited training data. Shown in Fig. 6 is the gain of each method compared with MF for users with limited training samples, where $G_{1}$ , $G_{2}$ , and $G_{3}$ are groups of users with $[1,3]$ , $[4,6]$ , and $[7,10]$ observed samples in the training set. Due to space limit, we only report the results from two Amazon datasets (Sport and Health) while the observations still hold for the rest datasets. It can be seen that the proposed SFM gains the most in group $G_{1}$ , in which the users have extremely few training items. The performance gain starts to decrease with the number of training items available for each user. The results indicate that including meta information can be valuable information especially when limited information available.

5.7. Sensitivity analysis

The number of latent factors $R$ is an important hyperparameter for the factorization models. We analyze different values of $R$ and report the averaged results in Fig. 7. The results again show that SFM consistently outperforms other methods with various values of $R$ . In contrast to findings in other related factorization models (yan2014coupled, ) where prediction error can steadily get reduced with larger $R$ , we observe that the performance of each method is rather stable even with the increasing of $R$ . It is reasonable in a general sense, as the expressiveness of the model is enough to describe the information embedded in data. Although larger $R$ renders the model with greater expressiveness, when the available observations regarding the target values are too sparse but the meta information is rich, only a few number of factors are required to fit the data well.

6. Conclusions

In this paper, we introduce a generic framework for learning structural data from heterogeneous domains, which can explore the high order correlations underlying multi-view multi-way data. We develop structural factorization machines (SFMs) that learn the common latent spaces shared in the multi-view tensors while automatically adjust the contribution of each view in the predictive model. With the help of relational structure representation, we further provide an efficient approach to avoid unnecessary computation costs on repeating patterns of the multi-view data. It was shown that the proposed SFMs outperform state-of-the-art factorization models on eight large-scale datasets in terms of prediction accuracy and computational cost.

Acknowledgments

This work is supported in part by NSF through grants IIS-1526499, and CNS-1626432, and NSFC 61672313, 61503253 and NSF of Guangdong Province (2017A030313339). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X GPU used for this research.

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Acar, E., Kolda, T. G., and Dunlavy, D. M. All-at-once optimization for coupled matrix and tensor factorizations. ar Xiv preprint ar Xiv:1105.3422 (2011).
2(2) Blondel, M., Fujino, A., Ueda, N., and Ishihata, M. Higher-order factorization machines. In Advances in Neural Information Processing Systems (2016), pp. 3351–3359.
3(3) Blondel, M., Ishihata, M., Fujino, A., and Ueda, N. Polynomial networks and factorization machines: New insights and efficient training algorithms. In Proceedings of the 33nd International Conference on Machine Learning (2016), pp. 850–858.
4(4) Cao, B., He, L., Kong, X., Yu, P. S., Hao, Z., and Ragin, A. B. Tensor-based multi-view feature selection with applications to brain diseases. In IEEE International Conference on Data Mining (2014), pp. 40–49.
5(5) Cao, B., Zheng, L., Zhang, C., Yu, P. S., Piscitello, A., Zulueta, J., Ajilore, O., Ryan, K., and Leow, A. D. Deepmood: Modeling mobile phone typing dynamics for mood detection. In Proceedings of ACM SIGKDD international conference on Knowledge discovery and data mining (2017), pp. 747–755.
6(6) Cao, B., Zhou, H., Li, G., and Yu, P. S. Multi-view machines. In ACM International Conference on Web Search and Data Mining (2016), pp. 427–436.
7(7) Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., et al. Wide & deep learning for recommender systems. In DLRS (2016), ACM, pp. 7–10.
8(8) Chin, W.-S., Yuan, B.-W., Yang, M.-Y., Zhuang, Y., Juan, Y.-C., and Lin, C.-J. Libmf: A library for parallel matrix factorization in shared-memory systems. The Journal of Machine Learning Research 17 , 1 (2016), 2971–2975.