Gated Attentive-Autoencoder for Content-Aware Recommendation

Chen Ma; Peng Kang; Bin Wu; Qinglong Wang; Xue Liu

arXiv:1812.02869·cs.IR·December 10, 2018

Gated Attentive-Autoencoder for Content-Aware Recommendation

Chen Ma, Peng Kang, Bin Wu, Qinglong Wang, Xue Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces GATED, a neural autoencoder with attention mechanisms that effectively combines item content and user feedback to improve personalized recommendations, especially in sparse data scenarios.

Contribution

The paper presents a novel gated attentive-autoencoder model that fuses content and rating data using attention modules, enhancing recommendation accuracy and interpretability.

Findings

01

Outperforms state-of-the-art methods on multiple datasets

02

Effectively handles sparse implicit feedback

03

Provides interpretable attention-based insights

Abstract

The rapid growth of Internet services and mobile devices provides an excellent opportunity to satisfy the strong demand for the personalized item or product recommendation. However, with the tremendous increase of users and items, personalized recommender systems still face several challenging problems: (1) the hardness of exploiting sparse implicit feedback; (2) the difficulty of combining heterogeneous data. To cope with these challenges, we propose a gated attentive-autoencoder (GATE) model, which is capable of learning fused hidden representations of items' contents and binary ratings, through a neural gating structure. Based on the fused representations, our model exploits neighboring relations between items to help infer users' preferences. In particular, a word-level and a neighbor-level attention module are integrated with the autoencoder. The word-level attention learns the…

Tables5

Table 1. Table 1 . The statistics of datasets.

Dataset	#Users	#Items	#Ratings	#Words	Density
citeulike-a	5,551	16,980	204,986	8,000	0.217%
ML20M	138,493	18,307	19,977,049	12,397	0.788%
Books	65,476	41,264	1,947,765	27,584	0.072%
CDs	24,934	24,634	478,048	24,341	0.078%

Table 2. Table 2 . The performance comparison of all methods in terms of Recall@10 and NDCG@10 . The best performing method is boldfaced. The underlined number is the second best performing method. ∗ * , ∗ ⁣ ∗ ** , ∗ ⁣ ∗ ⁣ ∗ *** indicate the statistical significance for p <= 0.05 𝑝 0.05 p<=0.05 , p <= 0.01 𝑝 0.01 p<=0.01 , and p <= 0.001 𝑝 0.001 p<=0.001 , respectively, compared to the best baseline method based on the paired t-test. Improv. denotes the improvement of our model over the best baseline method.

	WRMF	CDAE	CDL	CVAE	CML+F	ConvMF	JRL	GATE	Improv.
Recall@10
citeulike-a	0.0946	0.0888	0.1317	0.1371	0.1283	0.1153	0.1325	0.1419	3.50%
movielens-20M	0.1075	0.0751	0.1287	0.1303	0.1123	0.1201	0.1401	0.1625**	15.99%
Amazon-Books	0.0553	0.0132	0.0648	0.0632	0.0756	0.0524	0.0924	0.1133*	22.62%
Amazon-CDs	0.0779	0.0191	0.0827	0.0811	0.0824	0.0753	0.0816	0.1057***	27.81%
NDCG@10
citeulike-a	0.0843	0.0736	0.0949	0.0952	0.1035	0.0914	0.0982	0.1082	4.54%
movielens-20M	0.1806	0.1774	0.1836	0.1939	0.2479	0.1807	0.2439	0.2992**	20.69%
Amazon-Books	0.0377	0.0105	0.0393	0.0384	0.0456	0.0324	0.0592	0.0708***	19.59%
Amazon-CDs	0.0357	0.0105	0.0356	0.0349	0.0364	0.0323	0.0386	0.0477***	23.58%

Table 3. Table 3 . The ablation analysis on Amazon-CDs and Amazon-Books datasets.

Architecture	CDs		Books \bigstrut
Architecture	R@10	N@10	R@10	N@10 \bigstrut
(1) stacked AE	0.0672	0.0315	0.0745	0.0484
(2) reg: AE + W_Att	0.0676	0.0318	0.0304	0.0265
(3) gating: AE + W_Att	0.0816	0.0353	0.0793	0.0515
(4) gating: AE + GRU	0.0818	0.0352	0.0789	0.0512
(5) gating: AE + CNN	0.0777	0.0335	0.0791	0.0495
(6) GATE	0.1057	0.0477	0.1133	0.0708

Table 4. Table 4 . A case study of the importance scores computed by the neighbor-attention module. The number inside ( ⋅ ) ⋅ (\cdot) indicates the number of fluctuation ’s occurrences excluding references in an article.

Target	Neighbor	Score
Fluctuations in network dynamics	Genomic analysis of regulatory network dynamics reveals large topological changes (0)	0.07172 \bigstrut
	Frequency of occurrence of numbers in the World Wide Web (10)	0.22090 \bigstrut
	Complex networks: Structure and dynamics (16)	0.26835 \bigstrut
	Noise in protein expression scales with natural protein abundance (36)	0.43903 \bigstrut

Table 5. Table 5 . A case study of the word-attention.

The Summary of Article 16797 in citeulike-a

We present the first parallel implementation of the T-Coffee consistency-based multiple aligner. We benchmark it on the Amazon Elastic Cloud (EC2) and show that the parallelization procedure is reasonably effective. We also conclude that for a web server with moderate usage (10K hits/month) the cloud provides a cost-effective alternative to in-house deployment.

Equations31

e n c : {z_{i}^{(1)} = a_{1} (W_{1} r_{i} + b_{1}) z_{i}^{r} = a_{2} (W_{2} z_{i}^{(1)} + b_{2}) d ec : {z_{i}^{(3)} = a_{3} (W_{3} z_{i}^{r} + b_{3}) \hat{r}_{i} = a_{4} (W_{4} z_{i}^{(3)} + b_{4})

e n c : {z_{i}^{(1)} = a_{1} (W_{1} r_{i} + b_{1}) z_{i}^{r} = a_{2} (W_{2} z_{i}^{(1)} + b_{2}) d ec : {z_{i}^{(3)} = a_{3} (W_{3} z_{i}^{r} + b_{3}) \hat{r}_{i} = a_{4} (W_{4} z_{i}^{(3)} + b_{4})

D_{i} = ... ∣ e_{j - 1} ∣ ∣ e_{j} ∣ ∣ e_{j + 1} ∣ ...

D_{i} = ... ∣ e_{j - 1} ∣ ∣ e_{j} ∣ ∣ e_{j + 1} ∣ ...

a_{i} = so f t ma x (w_{a_{1}}^{⊤} t anh (W_{a_{2}} D_{i} + b_{a_{2}})),

a_{i} = so f t ma x (w_{a_{1}}^{⊤} t anh (W_{a_{2}} D_{i} + b_{a_{2}})),

z_{i}^{c} = e_{j} \in D_{i} \sum a_{i, j} e_{j} .

z_{i}^{c} = e_{j} \in D_{i} \sum a_{i, j} e_{j} .

A_{i} = so f t ma x (W_{a_{1}} t anh (W_{a_{2}} D_{i} + b_{a_{2}}) + b_{a_{1}}),

A_{i} = so f t ma x (W_{a_{1}} t anh (W_{a_{2}} D_{i} + b_{a_{2}}) + b_{a_{1}}),

Z_{i}^{c} = A_{i} D_{i}^{⊤},

Z_{i}^{c} = A_{i} D_{i}^{⊤},

z_{i}^{c} = a_{t} (Z_{i}^{c ⊤} w_{t}),

z_{i}^{c} = a_{t} (Z_{i}^{c ⊤} w_{t}),

G = s i g m o i d (W_{g_{1}} z_{i}^{r} + W_{g_{2}} z_{i}^{c} + b_{g}),

G = s i g m o i d (W_{g_{1}} z_{i}^{r} + W_{g_{2}} z_{i}^{c} + b_{g}),

z_{i}^{g} = G ⊙ z_{i}^{r} + (1 - G) ⊙ z_{i}^{c},

s_{i, j} = t anh (z_{i}^{g ⊤} W_{n} z_{j}^{g}), \forall j \in N_{i},

s_{i, j} = t anh (z_{i}^{g ⊤} W_{n} z_{j}^{g}), \forall j \in N_{i},

a_{i} = so f t ma x (s_{i}),

z_{i}^{n} = j \in N_{i} \sum a_{i, j} z_{j}^{g},

z_{i}^{(3, g)} = a_{3} (W_{3} z_{i}^{g} + b_{3}),

z_{i}^{(3, g)} = a_{3} (W_{3} z_{i}^{g} + b_{3}),

z_{i}^{(3, n)} = a_{3} (W_{3} z_{i}^{n} + b_{3}),

\hat{r}_{i} = a_{4} (W_{4} z_{i}^{(3, g)} + W_{4} z_{i}^{(3, n)} + b_{4}) .

L_{A E} = i = 1 \sum n u = 1 \sum m ∣∣ C_{u, i} (R_{u, i} - \hat{R}_{u, i}) ∣ ∣_{2}^{2} = ∣∣ C^{⊤} ⊙ (R^{⊤} - \hat{R}^{⊤}) ∣ ∣_{F}^{2},

L_{A E} = i = 1 \sum n u = 1 \sum m ∣∣ C_{u, i} (R_{u, i} - \hat{R}_{u, i}) ∣ ∣_{2}^{2} = ∣∣ C^{⊤} ⊙ (R^{⊤} - \hat{R}^{⊤}) ∣ ∣_{F}^{2},

C_{u, i} = {ρ 1 if R_{u, i} = 1 o t h er w i se

C_{u, i} = {ρ 1 if R_{u, i} = 1 o t h er w i se

L = L_{A E} + λ (∣∣ W_{*} ∣ ∣_{F}^{2} + ∣∣ w_{t} ∣ ∣_{2}^{2}),

L = L_{A E} + λ (∣∣ W_{*} ∣ ∣_{F}^{2} + ∣∣ w_{t} ∣ ∣_{2}^{2}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allenjack/GATE
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Topic Modeling · Advanced Graph Neural Networks

Full text

Gated Attentive-Autoencoder for Content-Aware Recommendation

Chen Ma

McGill University

[email protected]

,

Peng Kang

McGill University

[email protected]

,

Bin Wu

Zhengzhou University

[email protected]

,

Qinglong Wang

McGill University

[email protected]

and

Xue Liu

McGill University

[email protected]

(2019)

Abstract.

The rapid growth of Internet services and mobile devices provides an excellent opportunity to satisfy the strong demand for the personalized item or product recommendation. However, with the tremendous increase of users and items, personalized recommender systems still face several challenging problems: (1) the hardness of exploiting sparse implicit feedback; (2) the difficulty of combining heterogeneous data. To cope with these challenges, we propose a gated attentive-autoencoder (GATE) model, which is capable of learning fused hidden representations of items’ contents and binary ratings, through a neural gating structure. Based on the fused representations, our model exploits neighboring relations between items to help infer users’ preferences. In particular, a word-level and a neighbor-level attention module are integrated with the autoencoder. The word-level attention learns the item hidden representations from items’ word sequences, while favoring informative words by assigning larger attention weights. The neighbor-level attention learns the hidden representation of an item’s neighborhood by considering its neighbors in a weighted manner. We extensively evaluate our model with several state-of-the-art methods and different validation metrics on four real-world datasets. The experimental results not only demonstrate the effectiveness of our model on top-N recommendation but also provide interpretable results attributed to the attention modules.

††journalyear: 2019††copyright: acmcopyright††conference: The Twelfth ACM International Conference on Web Search and Data Mining; February 11–15, 2019; Melbourne, VIC, Australia††booktitle: The Twelfth ACM International Conference on Web Search and Data Mining (WSDM ’19), February 11–15, 2019, Melbourne, VIC, Australia††price: 15.00††doi: 10.1145/3289600.3290977††isbn: 978-1-4503-5940-5/19/02

1. Introduction

With the rapid growth of Internet services and mobile devices, it has been more convenient for people to access amounts of online products and multimedia contents, such as movies and articles. Although this growth allows users to have multiple choices, it has also made it more difficult to select one of the user’s most preferred items out of thousands of candidates. For example, users who like to watch movies may feel difficult to decide which movie to watch when there are thousands of selections, and users who are gourmet eaters may feel hard to discover new restaurants tailored to their flavors. Therefore, these needs facilitate a promising service–personalized recommender systems. These systems are becoming increasingly essential, serving a potentially huge service demand, and bringing significant benefits to at least two parties: (1) help users easily discover products that they are interested in; (2) create opportunities for product providers to increase the revenue.

To build personalized recommender systems, two types of data are generally available and utilized: user ratings and item descriptions, e.g., users’ ratings on movies and movies’ plots. Approaches based on item text modeling such as latent dirichlet allocation (LDA), stacked denoising autoencoder (SDAE), and variational autoencoder (VAE) have been proposed to additionally utilize items’ descriptions, e.g., reviews, abstracts, or synopses (Wang and Blei, 2011; Wang et al., 2015; Li and She, 2017), to enhance the top-N recommendation performance. Collaborative deep learning (CDL) (Wang et al., 2015) and collaborative variational autoencoder (CVAE) (Li and She, 2017) are two representative methods, which explicitly link the learning of item content to the recommendation task. In particular, CVAE and CDL apply a VAE and an SDAE, respectively, to learn hidden representations from items’ bag-of-words, which are integrated with the probabilistic matrix factorization (PMF) by regularizing with PMF’s item latent factors.

Although existing methods have proposed effective models and achieved satisfactory results, we argue that there are still several factors to be considered for enhancing the performance. First, previous studies (Wang et al., 2015; Li and She, 2017) learn the content hidden representations from items’ normalized bag-of-words vectors, which does not consider the importances of different words for describing a certain item. Equally treating the informative words along with other words may lead to the incomplete understanding of the item content. Second, previous works (Kim et al., 2016; Li and She, 2017) combine the hidden representations from heterogeneous information, e.g., items’ ratings and descriptions, by a weighted regularization term. This may not fully make use of the data from heterogeneous sources and trigger tedious hyper-parameter tuning, since different data sources are characterized by distinct statistical properties and different orders of magnitude, which is commonly the case for heterogeneous information. Third, it is also important to note that the relations between items, e.g., movies in the same genre and citations between articles, are neglected in previous works. It is very likely that closely related items may share the same topics or have similar attributes. As such, exploring users’ preferences on an item’s neighbors also benefits inferring users’ preferences on this item.

To address the problems mentioned above, we propose a novel recommendation model, gated attentive-autoencoder (GATE), for the content-aware recommendation. GATE consists of a word-attention module, a neighbor-attention module, and a neural gating structure, integrating with a stacked autoencoder (AE). The encoder of the stacked AE encodes the user’s implicit feedback on a certain item into the item’s hidden representation. Then the word-attention module learns the item embedding from its sequence of words, where the informative words can be adaptively selected without using complex recurrent or convolutional neural networks. To smoothly fuse the representations of items’ ratings and descriptions, we propose a neural gating layer to extract and merge the salient parts of these two hidden representations, which is inspired by the long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997). Moreover, item-item relations provide important auxiliary information to predict users’ preferences, since closely related items may have the same topics or attributes. Thus, we apply a neighbor-attention module to learn the hidden representation of an item’s neighborhood. By modeling users’ preferences on the item’s neighborhood, the users’ preferences on this item can be indirectly reflected. We extensively evaluate our model with many state-of-the-art methods and different validation metrics on four real-world datasets. The experimental results not only demonstrate the improvements of our model over other baselines but also show the effectiveness of the gating layer and attention modules.

To summarize, the major contributions of this paper are listed as follows:

•

To learn the hidden representations from items’ sequences of words, we apply a word-attention module to adaptively distinguish informative words, leading to better comprehension of the item content. Our word-attention module can achieve the same performance with complex recurrent or convolutional neural networks yet with fewer parameters.

•

To effectively fuse the hidden representations of items’ contents and ratings, we propose a neural gating layer to extract and combine the salient parts of them.

•

According to item-item relations, we utilize a neighbor-attention module to learn the hidden representation of an item’s neighborhood. Modeling user preferences on the neighborhood of an item provides a significant supplement for inferring user preferences on this item.

•

Both proposed attention modules are capable of interpreting and visualizing the important words and neighbors of items, respectively. Experiments on four real-world datasets show that the proposed GATE model significantly outperforms the state-of-the-art methods for the content-aware recommendation.

2. Related Work

In this section, we illustrate related work about the proposed model: personalized recommendation, recommendation with content features, and attention mechanisms.

2.1. Recommendation with Implicit Feedback

Early studies on recommendation have largely focused on explicit feedback (Sarwar et al., 2001; Salakhutdinov et al., 2007), recent research focus is much shifting towards implicit data (Wang et al., 2015; Li and She, 2017). The collaborative filtering (CF) with implicit feedback is usually treated as a top-N item recommendation task, where the goal is to recommend a list of items to users that users may be interested in. Compared to the rating prediction task, the item recommendation problem is more practical and challenging (Pan et al., 2008), which more accords with the real-world recommendation scenario. To make use of latent factor models for item recommendation, early works either applied a uniform weighting scheme to treat all missing data as negative samples (Hu et al., 2008), or sampled negative instances from missing data (Rendle et al., 2009). Recently, He et al. (He et al., 2016) and Liang et al. (Liang et al., 2016) proposed dedicated models to weigh missing data, and Bayer et al. (Bayer et al., 2017) developed an implicit coordinate descent method for feature-based factorization models. With the ability to learn salient representations, (deep) neural network-based methods were also adopted. In (Wu et al., 2016), Wu et al. proposed the collaborative denoising autoencoder (CDAE) for top-N recommendation learning from implicit feedback. In (He et al., 2017), He et al. proposed a neural network-based collaborative filtering model, which leverages a multi-layer perceptron to learn the non-linear user-item interactions. In (Xue et al., 2017; Guo et al., 2017; Lian et al., 2018), deep learning techniques were adopted to boost the traditional matrix factorization and factorization machine methods.

2.2. Content-Aware Recommendation

To further improve the performance, researchers incorporate the content features to help alleviate the sparseness and the cold-start problem in the user-item interaction data. Some early works (McAuley and Leskovec, 2013; Wang and Blei, 2011) applied latent dirichlet allocation (LDA) to learn abstract topics that occur in a collection of documents. In recent years, deep learning models have demonstrated a great power for effective text representation learning. In (Wang et al., 2015; Zhang et al., 2016), researchers utilized the stacked denoising autoencoder (SDAE) on items’ bag-of-words to learn the item latent representations. Li et al. in (Li et al., 2015) proposed to combine probabilistic matrix factorization with marginalized denoising autoencoders. And in (Li and She, 2017), Li et al. adopted a variational autoencoder to learn the latent representations from items’ content, which is a Bayesian probabilistic generative model. These studies model the item content through its bag-of-words. On the other hand, some studies also incorporate the contextual information for better understanding of the text. For example, in (Zhang et al., 2017), the doc2vec model (Le and Mikolov, 2014) was utilized to model the text information in user reviews. And in (Kim et al., 2016; Seo et al., 2017; Zheng et al., 2017; Chen et al., 2018), researchers adopted convolution neural networks, with max-pooling and fully connected layers, to learn the item hidden representation from item’s sequence of word embeddings, where words’ contextual information can be captured by the convolutional filters and the sliding window strategy.

2.3. Attention Mechanism in Recommendation

Recently, attention mechanism has demonstrated the effectiveness in various machine learning tasks such as image captioning (Xu et al., 2015; You et al., 2016), document classification (Yang et al., 2016), and machine translation (Luong et al., 2015; Vaswani et al., 2017). Researchers also adopt the attention mechanism on recommendation tasks. In (Pei et al., 2017), Pei et al. adopted an attention model to measure the relevance between users and items. Wang et al. (Wang et al., 2017) proposed a hybrid attention model to adaptively capture the change of editors’ selection criteria. In (Gong and Zhang, 2016), Gong et al. adopted an attention model to scan input microblogs and select trigger words. Chen et al. (Chen et al., 2017) proposed item- and component-level attention mechanisms to model the implicit feedback in the multimedia recommendation. In (Seo et al., 2017), Seo et al. proposed to model user preferences and item properties using convolutional neural networks (CNNs) with dual local and global attention. In (Tay et al., 2018), Tay et al. proposed a multi-pointer attention mechanism to enhance the rating prediction accuracy. In (Ma et al., 2018), Ma et al. integrated the attention mechanism with autoencoders to discriminate the user preferences on users’ visited locations. And in (Chen et al., 2018), an attention-based review pooling mechanism was proposed to select the important user reviews.

However, our word- and neighbor-attention modules are different from above studies. For the word-attention module, we adopt the multi-dimensional attention to select informative words by computing a score vector. While the vanilla attention computes a single importance score for each word, which cannot sufficiently express the complex relations among words when the number of words is large. For the neighbor-attention module, we learn item’s neighborhood representation according to the importance scores between the item and its neighbors. The item-item relation information is rarely considered in previous works. Moreover, we propose a neural gating layer to adaptively merge items’ hidden representations from different data sources.

3. Problem Formulation

The recommendation task considered in this paper takes implicit feedback (Hu et al., 2008) as the training and test data. The user preferences are presented by an $m$ -by- $n$ binary matrix $\mathbf{R}$ . The entire collection of $n$ items is represented by a list of documents $\mathcal{D}$ , where each document in $\mathcal{D}$ is represented by a sequence of words. The item relations are presented by a binary adjacent matrix $\mathbf{N}\in\mathbb{R}^{n\times n}$ , where $N_{ij}=1$ if item $i$ and $j$ are related or connected. Given the item descriptions $\mathcal{D}$ , the item relations $\mathbf{N}$ , and part of the ratings in $\mathbf{R}$ , the problem is to predict the rest of ratings in $\mathbf{R}$ .

Here, following common symbolic notation, upper case bold letters denote matrices, lower case bold letters denote column vectors without any specification, and non-bold letters represent scalars.

4. Methodologies

In this section, we introduce the proposed model, which is shown in Figure 1. We first illustrate the basic model to learn item representations from users’ binary ratings. We then introduce the multi-dimensional attention for learning item representations from word sequences. Next, we present the neural gating layer to combine the item representations from ratings and contents. We then demonstrate how to learn the hidden representation of an item’s neighborhood and utilize it to assist in inferring user preferences. Lastly, we go through the loss function and training process of the proposed model.

4.1. Model Basics

The substantial increase of users and items makes the user-item interactions more complex and hard to model. Classical matrix factorization (MF) methods apply the inner product to predict user preferences on items, which linearly combines users’ and items’ latent factors. However, it has been shown in (He et al., 2017; Hsieh et al., 2017) how the linear combination of the inner product can limit the expressiveness of MF. Inspired by the recent works using autoencoders (AEs) to model explicit feedback (Sedhain et al., 2015) and implicit feedback (Wu et al., 2016), we also adopt AE as our base building block due to its ability to learn richer representations and the close relationship to MF (Wu et al., 2016).

To capture users’ preferences on an item, we apply a stacked AE to encode users’ binary ratings $\mathbf{r}_{i}\in\mathbb{R}^{m}$ on a certain item $i$ into the item’s rating hidden representation $\mathbf{z}_{i}^{r}$ (the superscript $r$ indicates the hidden representation is learned from items’ binary ratings):

[TABLE]

where $\mathbf{W}_{1}\in\mathbb{R}^{h_{1}\times m}$ , $\mathbf{W}_{2}\in\mathbb{R}^{h\times h_{1}}$ , $\mathbf{W}_{3}\in\mathbb{R}^{h_{1}\times h}$ , and $\mathbf{W}_{4}\in\mathbb{R}^{m\times h_{1}}$ are the weight matrices. $m$ is the number of users, $h_{1}$ is the dimension of the first hidden layer, and $h$ is the dimension of the bottleneck layer. $\mathbf{r}_{i}$ is a multi-hot vector, where $r_{u,i}=1$ indicates that the user $u$ prefers the item $i$ .

4.2. Word-Attention Module

Unlike previous works (Wang et al., 2015; Li and She, 2017; Hsieh et al., 2017) learning item embeddings from bag-of-words and neglecting the importances of different words, we propose a word-attention module based on items’ word sequences. Compared to learning from items’ bag-of-words, the attention weights learned by our module adaptively select the informative words with different importances, and make the informative words contribute more to depict items.

Embedding Layer. In the proposed module, the input of item $i$ is a sequence of $l_{i}$ words from its text description, where each word is represented as an one-hot vector. At the embedding layer, the one-hot encoded vector is converted into a low-dimensional real-valued dense vector representation by a word embedding matrix $\mathbf{E}\in\mathbb{R}^{h\times v}$ , where $h$ is the dimension of the word embedding and $v$ is size of the vocabulary. After converted by the embedding layer, the item text is represented as:

[TABLE]

where $\mathbf{D}_{i}\in\mathbb{R}^{h\times l_{i}}$ and $\mathbf{e}_{j}\in\mathbb{R}^{h}$ .

Multi-dimensional Attention. Inspired by the Transformer (Vaswani et al., 2017) solely relying on attention mechanisms for machine translation, we apply a multi-dimensional attention mechanism on word sequences to learn items’ hidden representations without using complex recurrent or convolutional neural networks. The reason is that, in the real-world scenario, users may care more about the topics or motifs of items that can be illustrated in a few of words, rather than the word-word relations in the sequence.

The goal of the word-attention is to assign different importances on words, then aggregate word embeddings in a weighted manner to characterize the item. Given word embeddings of an item $\mathbf{D}_{i}$ , a vanilla attention mechanism to compute the attention weights is represented by a two-layer neural network:

[TABLE]

where $\mathbf{w}_{a_{1}}\in\mathbb{R}^{h}$ , $\mathbf{W}_{a_{2}}\in\mathbb{R}^{h\times h}$ , and $\mathbf{b}_{a_{2}}\in\mathbb{R}^{h}$ are the parameters to be learned, the $softmax(\cdot)$ ensures all the computed weights sum up to 1. Then we sum up the embeddings in $\mathbf{D}_{i}$ according to the weights provided by $\mathbf{a}_{i}$ to get the vector representation of the item (the superscript $c$ indicates the hidden representation is learned from items’ contents):

[TABLE]

However, assigning a single importance value to a word embedding usually makes the model focus on a specific aspect of an item content (Lin et al., 2017). It can be multiple aspects in the item content that together characterize this item, especially when the number of words is large.

Thus, we need multiple $\mathbf{a}_{i}$ to focus on different parts of the item content. Based on this inspiration, we adopt a matrix instead of $\mathbf{a}_{i}$ to capture the multi-dimensional attention and assign an attention weight vector to each word embedding. Each dimension of the attention weight vector represents an aspect of relations among all embeddings in $\mathbf{D}_{i}$ . Suppose we want $d_{a}$ aspects of attention to be extracted from the embeddings, then we extend $\mathbf{w}_{a}$ to $\mathbf{W}_{a_{1}}\in\mathbb{R}^{d_{a}\times h}$ , which behaves like a high level representation of a fixed query ”what are the informative words” over other words in the text:

[TABLE]

where $\mathbf{A}_{i}\in\mathbb{R}^{d_{a}\times l_{i}}$ is the attention weight matrix, $\mathbf{b}_{a_{1}}\in\mathbb{R}^{d_{a}}$ is the bias term, and the $softmax$ is performed along the second dimension of its input. By multiplying the attention weight matrix with word embeddings, we have the matrix representation of an item:

[TABLE]

where $\mathbf{Z}_{i}^{c}\in\mathbb{R}^{d_{a}\times h}$ is the matrix representation of the item. Then we have another neural layer to aggregate the item matrix representation into a vector representation. The hidden representation of the item is revised as:

[TABLE]

where $\mathbf{w}_{t}\in\mathbb{R}^{d_{a}}$ is the parameter in the aggregation layer, $a_{t}(\cdot)$ is the activation function.

4.3. Neural Gating Layer

We have obtained the item hidden representations from two heterogeneous data sources, i.e., the binary ratings and the content descriptions of items. The next aim is to combine these two kinds of hidden representations to facilitate the user preference prediction on unrated items. Unlike previous works (Wang et al., 2015; Li and She, 2017) regularizing these two kinds of hidden representations, we propose a neural gating layer to adaptively merge them. This is inspired by the gates in long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997). The gate $\mathbf{G}$ and the fused item hidden representation $\mathbf{z}_{i}^{g}$ are computed by:

[TABLE]

where $\mathbf{W}_{g_{1}}\in\mathbb{R}^{h\times h}$ , $\mathbf{W}_{g_{2}}\in\mathbb{R}^{h\times h}$ , and $\mathbf{b}_{g}\in\mathbb{R}^{h}$ are the parameters in the gating layer. By using a gating layer, the salient parts from these two hidden representations can be extracted and smoothly combined.

4.4. Neighbor-Attention Module

Some items have the inherent relationship between each other, e.g., paper citations. Those closely related items may form a local neighborhood that shares the same topic or has the same attributes. Therefore, for a certain item, if a user is interested in its neighborhood, the user may also be interested in this item. Besides, in an item’s local neighborhood, some items may be more representative, which should play an important role in describing the neighborhood. Inspired by this intuition, we propose a neighbor-attention module to learn the neighborhood hidden representation of a certain item. This attention mechanism is similar to which in the machine translation (Luong et al., 2015).

Formally, we define the neighbor set of item $i$ as $\mathcal{N}_{i}$ , which can be obtained from the item adjacent matrix111For items that do not inherently have item-item relations, we can compute the item-item similarity from the binary rating matrix $\mathbf{R}$ and set a threshold to select neighbors. $\mathbf{N}$ . The neighborhood hidden representation $\mathbf{z}_{i}^{n}$ of item $i$ is computed by:

[TABLE]

where $\mathbf{W}_{n}\in\mathbb{R}^{h\times h}$ is the parameters to be learned in the neighbor-attention layer.

To simultaneously capture users’ preferences on a certain item and its neighborhood, the decoder in Eq. 1 is rewritten as:

[TABLE]

4.5. Weighted Loss

To model the user preference from implicit feedback, we follow a similar manner in (Hu et al., 2008) to plug in a confidence matrix in the square loss function:

[TABLE]

where $\odot$ is the element-wise multiplication of matrices. $||\cdot||_{F}$ is the Frobenius norm of matrices. In particular, we set the confidence matrix $\mathbf{C}\in\mathbb{R}^{m\times n}$ as follows,

[TABLE]

where the hyper-parameter $\rho>1$ is a constant.

4.6. Network Training

By combining with regularization terms, the objective function of the proposed model is shown as follows:

[TABLE]

where $\lambda$ is the regularization parameter. By minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with back-propagation. We apply Adam (Kingma and Ba, 2014) to automatically adapt the learning rate during the learning procedure.

5. Experiments

In this section, we evaluate the proposed model with the state-of-the-art methods on four real-world datasets.

5.1. Datasets

The proposed models are evaluated on four real-world datasets from various domains with different sparsities: citeulike-a (Wang and Blei, 2011), movielens-20M (Harper and Konstan, 2016), Amazon-Books and Amazon-CDs (He and McAuley, 2016). The citeulike-a dataset provides user preferences on articles as well as article titles, abstracts, and citations. The movielens-20M is a user-movie dataset where the movie description is crawled from TMDB222https://www.themoviedb.org/. The Amazon-Books and Amazon-CDs datasets are adopted from the Amazon review dataset333http://jmcauley.ucsd.edu/data/amazon/, which covers a large amount of user-item interaction data, e.g., review, rating, helpfulness rating of review. We select the user review with the highest helpfulness rating as the item’s description. In order to be consistent with the implicit feedback setting, we keep those with ratings no less than four (out of five) as positive feedback and treat all other ratings as missing entries on last three datasets. Since items in the latter three datasets do not inherently have the item-item relations, we compute the item-item similarity from binary rating matrix $\mathbf{R}$ and set the threshold as 0.2 to select neighbors of items. To filter noisy data, we only keep the users with at least ten ratings and the items at least with five ratings. The data statistics after preprocessing are shown in Table 1. For each user, we randomly select 20% of her rated items as ground truth for testing. The remaining constitutes the training set. The random selection is carried out five times independently, and we report the average results.

5.2. Evaluation Metrics

We evaluate our model versus other methods in terms of Recall@k and NDCG@k. For each user, Recall@k (R@k) indicates what percentage of her rated items can emerge in the top $k$ recommended items. NDCG@k (N@k) is the normalized discounted cumulative gain at $k$ , which takes the position of correctly recommended items into account.

5.3. Methods Studied

To demonstrate the effectiveness of our model, we compare to the following recommendation methods.

Classical methods for implicit feedback:

•

WRMF, weighted regularized matrix factorization (Hu et al., 2008), which minimizes the square error loss by assigning user rated and unrated items with different confidential values.

•

CDAE, collaborative denoising autoencoder (Wu et al., 2016), which utilizes the denoising autoencoder to learn the user hidden representation from implicit feedback.

Methods learning from bag-of-words:

•

CDL, collaborative deep learning (Wang et al., 2015), is a probabilistic feedforward model for joint learning of stacked denoising autoencoder (SDAE) and collaborative filtering.

•

CVAE, collaborative variational autoencoder (Li and She, 2017), is a generative latent variable model that jointly models the generation of content and rating and uses variational Bayes with inference network for variational inference.

•

CML+F, collaborative metric learning with item features (Hsieh et al., 2017), which learns a metric space to encode not only users’ preferences but also the user-user and item-item similarities.

Methods learning from word sequences:

•

ConvMF, convolutional matrix factorization (Kim et al., 2016), which applies the convolutional neural network (CNN) to capture contextual information of documents and integrates CNN into the probabilistic matrix factorization (PMF).

•

JRL, joint representation learning (Zhang et al., 2017), is a framework that learns joint representations from different information sources for top-N recommendation.

The proposed method:

•

GATE, the proposed model, fuses hidden representations from items’ ratings and contents by a gating layer, moreover, the word-attention and neighbor-attention are adopted for selecting informative words and learning hidden representations of items’ neighborhoods, respectively.

Given our extensive comparisons against the state-of-the-art methods, we omit comparisons with methods such as HFT (McAuley and Leskovec, 2013), CTR (Wang and Blei, 2011), SVDFeature (Chen et al., 2012), and DeepMusic (van den Oord et al., 2013) since they have been outperformed by the recently proposed CDL, CVAE, and JRL.

5.4. Experiment Settings

In the experiments, the latent dimension of all the models is set to 50. WRMF adopts the same heuristic weighting function with the proposed model. For CDAE, we follow the settings in the original paper. For CDL, we set $a=1$ , $b=0.01$ , and find that when $\lambda_{u}=1$ , $\lambda_{v}=10$ , $\lambda_{n}=100$ , and $\lambda_{w}=0.0001$ can achieve good performance. For CVAE, the parameters $a=1$ , $b=0.01$ are the same. When $\lambda_{u}=0.1$ , $\lambda_{v}=10$ , $\lambda_{r}=0.01$ , CVAE can achieve good performance. For CML+F, we follow the author’s code to set the margin $m=2.0$ , $\lambda_{f}=0.1$ , and $\lambda_{c}=1$ , respectively. The item features are learned by a multi-layer perceptron with a 512-dimensional hidden layer and $0.3$ dropout. For ConvMF, we set the CNN configuration the same as the original paper and find it can achieve a good result when $a=1$ , $b=0.01$ , $\lambda_{u}=0.1$ , and $\lambda_{v}=10$ . For JRL, we follow the original paper setting to set batch size as $64$ , the number of negative samples $t=5$ , and $\lambda_{1}=1$ . The network architectures of above methods are also set the same with the original papers.

For GATE, the ratings of an item is a binary rating vector from all users; the content of an item is the word sequence from its description. We set the maximum length of the word sequence to $300$ , and the same setting is also adopted in ConvMF and JRL. Hyper-parameters are set by grid search. The network architecture is set to $[m,100,50,100,m]$ on all datasets. $\rho$ is set to $5$ on citeulike-a, $20$ on movielens-20M, $15$ on Amazon-Books, and $20$ on Amazon-CDs, respectively. $d_{a}$ is set to $20$ , where its effect is shown in section 5.7. The learning rate and $\lambda$ are set to $0.01$ and $0.001$ , respectively. The activation function is set to $tanh$ . And the batch size is set to $1024$ . Our experiments are conducted with PyTorch444https://pytorch.org/ running on GPU machines of Nvidia GeForce GTX 1080 Ti555The code is available on Github: https://github.com/allenjack/GATE.

5.5. Performance Comparison

The performance comparison results are shown in Figure 2, 3, 4 and 5, and Table 2. Since CDAE is not as good as other state-of-the-art methods when the dataset becomes sparse, we do not present the results of CDAE in the aforementioned figures.

Observations about our model. First, the proposed model—GATE, achieves the best performance on three datasets with all evaluation metrics, except for the Recall@15 and Recall@20 on citeulike-a, which illustrates the superiority of our model. Second, GATE obtains better results than JRL and ConvMF. Although JRL and ConvMF capture the contextual information in item descriptions by the doc2vec model (Le and Mikolov, 2014) and the convolutional neural network, respectively, they equally treat each word of items, which does not consider the effects of informative words, leading to the incomplete understanding of item content information. Third, GATE outperforms CML+F, CVAE, and CDL. The reasons are two-fold: (1) these three methods learn the item content representation through bag-of-words, which neglects the effect that important words can describe the topics or synopses of items; (2) these three methods link the hidden representations from different data sources by a regularization term, which may not smoothly balance the effects of various data representations and incur tedious hyper-parameter tuning. Fourth, GATE achieves better results than WRMF and CDAE, since these two methods do not incorporate the content information, which is crucial when the user-item interaction data is sparse. Fifth, it is important to note that all the compared methods do not consider the user preference on an item’s neighborhood, which is captured by the neighbor-attention module of GATE. Sixth, GATE does not significantly improve the performance over other methods on the citeulike-a dataset. One possible reason is that the citeulike-a dataset is relatively small, which makes GATE overfit the data.

Other observations. First, all the results reported on citeulike-a and movielens-20M are better than the results on Amazon-Books and Amazon-CDs, the major reason is that the latter two datasets are more sparse and the data sparsity declines the recommendation performance. Second, JRL and CML+F perform better than other state-of-the-art methods on more sparse datasets. The reason may be that JRL models the contextual information in the item descriptions, which captures the word-word relations in the text. On the other hand, CML+F encodes user-item relationships and user-user/item-item similarities in a joint metric space, which are helpful to find users’ preferred items when the data is sparse. Third, although ConvMF models the contextual information from items’ descriptions, it still does not perform better than JRL, CML+F, CVAE, and CDL. One possible reason is that the regularization term in ConvMF does not effectively pick up the latent features learned from text to benefit the item latent factors learned from matrix factorization. Fourth, CVAE and CDL achieve similar results on all datasets. One reason is that they have a similar Bayesian probabilistic framework. Fifth, WRMF and CDAE only adopt implicit feedback as input and does not model the auxiliary information, that is why their performance drops when the dataset becomes sparse. In addition, WRMF has similar results with CDL and CVAE on some metrics, which may illustrate that CDL and CVAE may not fully take advantage of the heterogeneous data.

5.6. Ablation Analysis

To verify the effectiveness of the proposed word-attention, gating layer, and neighbor-attention modules, we conduct an ablation analysis in Table 3 to demonstrate the performance each module contributes to the GATE model. In (1), we utilize the weighted stacked AE without any other components. In (2), we regularize $\mathbf{z}_{i}^{r}$ and $\mathbf{z}_{i}^{c}$ by $L2$ norm on the top of (1), following the same manner in (Wang et al., 2015; Li and She, 2017). We tried the regularization parameters $\{0.01,0.1,0.5,1,10\}$ , where $0.1$ gives the best results. In (3), we plug the gating layer to connect $\mathbf{z}_{i}^{r}$ and $\mathbf{z}_{i}^{c}$ on the top of (1). In (4), we adopt the a recurrent neural network structure–gated recurrent units (GRUs) (Cho et al., 2014) to learn $\mathbf{z}_{i}^{c}$ , which is also linked to $\mathbf{z}_{i}^{r}$ by the proposed gating layer. In (5), we replace the GRUs in (4) with a convolutional neural network (CNN), where the structure and hyper-parameters are set the same in (Kim et al., 2016). In (6), we present the overall GATE model to show the significance of the neighbor-attention module.

From the results shown in Table 3, we have some observations. First, from (2) and (3), the gating layer achieves better results than regularization. One possible reason is that the neural gate can extract representative parts and mask off insignificant parts from the input hidden representations. Second, from (3), (4), and (5), we observe that our word-attention module has similar performance with GRUs and CNNs but with fewer parameters666We verified the number of parameters of all three models by the named_parameters() function provided by PyTorch. (if we set the word embedding size to 50 ( $h=50$ ), then the number of learned parameters of our word-attention module is 3,590, the number of parameters of the one-recurrent-layer GRU is 15,300, the number of parameters of the CNN in (Kim et al., 2016) is 75,350). This result demonstrates that the proposed word-attention module can effectively learn the item hidden representation from items’ descriptions. Third, from (1), (3), and (6), we observe that our neighbor-attention may play a critical role in the overall model. The results demonstrate that modeling users’ preferences on an item’s neighborhood is an effective supplementary for inferring their preferences on this item.

5.7. The Sensitivity of Hyper-parameters

The effects of $\rho$ and $d_{a}$ are shown in Figure 6, which have similar trends on other datasets. We can observe that with the increase of $\rho$ , the performance improves and becomes stable. The reason is that the larger value of $\rho$ makes the model concentrate more on the items that users interacted with before, where users’ preferences are more accurately captured. For the variation of $d_{a}$ , we verify that utilizing a vector to measure the importance of a word is more effective than just using a single value in our scenario because the score vector describes the relations between each word from different aspects. Note that we do not include the neighbor-attention module when testing the effect of $d_{a}$ .

5.8. Word- and Neighbor-Attention Case Studies

To visualize the word-attention effects, we sum along the first dimension of $\mathbf{A}_{i}\in\mathbb{R}^{d_{a}\times l_{i}}$ (Eq. 4) to get $\mathbf{a}_{i}\in\mathbb{R}^{l_{i}}$ , which can be treated as the accumulated attention weights of each word. For the ease of visualization, we normalize $\mathbf{a}_{i}$ following the same procedure in (Lin et al., 2017) and words with lower scores are not colored. Two examples of word-attention visualization are shown in Table 5. From the first example, we can observe that the words aligner and cloud have the highest importance scores, which may reflect the topic and platform of this paper. On the other hand, the words present, show, and conclude are widely used in all the papers, which are less attractive. In the second example, the situation is the same. The most important word that selected by the word-attention is metaphor, which may reveal the motif of the article.

The neighbor-attention case study is shown in Table 4. The neighbors of the target article are provided by the citation graph of the citeulike-a dataset. From this case, we observe that the neighbor attention score can identify an item’s important neighbors. In the example, the target article finds a scaling rule in network dynamics, and the fourth neighbor of the target also observes the same scaling behavior for all groups of genes. If we treat fluctuation as the key topic of the target article, the number of fluctuation’s occurrences in the target’s neighbors may reveal how related the target with its neighbors. We also list the count of fluctuation after the article title. The counts of fluctuation further verify the importance scores computed by our neighbor-attention module.

6. Conclusion

In this paper, we proposed a gated autoencoder with the word- and neighbor-attention. The model learned items’ hidden representations from ratings and contents in a gated manner. Moreover, the model also captured items’ informative words and representative neighbors by word- and neighbor-attention modules, respectively. Experimental results on four real-world datasets clearly validated the performance of our model over many state-of-the-art methods and showed the effectiveness of the gating and attention modules.

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Bayer et al . (2017) Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Steffen Rendle. 2017. A Generic Coordinate Descent Framework for Learning from Implicit Feedback. In WWW . ACM, 1341–1350.
3Chen et al . (2018) Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2018. Neural Attentional Rating Regression with Review-level Explanations. In WWW . ACM, 1583–1592.
4Chen et al . (2017) Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention. In SIGIR . ACM, 335–344.
5Chen et al . (2012) Tianqi Chen, Weinan Zhang, Qiuxia Lu, Kailong Chen, Zhao Zheng, and Yong Yu. 2012. SVD Feature: a toolkit for feature-based collaborative filtering. Journal of Machine Learning Research 13 (2012), 3619–3622.
6Cho et al . (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP . ACL, 1724–1734.
7Gong and Zhang (2016) Yuyun Gong and Qi Zhang. 2016. Hashtag Recommendation Using Attention-Based Convolutional Neural Network. In IJCAI . IJCAI/AAAI Press, 2782–2788.
8Guo et al . (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. Deep FM: A Factorization-Machine based Neural Network for CTR Prediction. In IJCAI . ijcai.org, 1725–1731.