Signed Distance-based Deep Memory Recommender

Thanh Tran; Xinyue Liu; Kyumin Lee; Xiangnan Kong

arXiv:1905.00453·cs.IR·May 3, 2019

Signed Distance-based Deep Memory Recommender

Thanh Tran, Xinyue Liu, Kyumin Lee, Xiangnan Kong

PDF

1 Repo

TL;DR

This paper introduces a deep learning recommender system that captures complex non-linear user-item relationships using signed distance measures, significantly outperforming existing models on multiple real-world datasets.

Contribution

The paper proposes a novel deep memory recommender leveraging signed distance to model non-linear user-item interactions explicitly and implicitly.

Findings

01

Achieved significant improvements over ten state-of-the-art models.

02

Performed well in both general and shopping basket-based recommendation tasks.

03

Validated on six real-world datasets.

Abstract

Personalized recommendation algorithms learn a user's preference for an item by measuring a distance/similarity between them. However, some of the existing recommendation models (e.g., matrix factorization) assume a linear relationship between the user and item. This approach limits the capacity of recommender systems, since the interactions between users and items in real-world applications are much more complex than the linear relationship. To overcome this limitation, in this paper, we design and propose a deep learning framework called Signed Distance-based Deep Memory Recommender, which captures non-linear relationships between users and items explicitly and implicitly, and work well in both general recommendation task and shopping basket-based recommendation task. Through an extensive empirical study on six real-world datasets in the two recommendation tasks, our proposed approach…

Figures25

Click any figure to enlarge with its caption.

Tables4

Table 1. Table 1 . Statistics of the four datasets in the general recommendation task.

Statistics	ML-100k	ML-1M	Netflix	Epinions
# of users	943	6,040	1,888	23,137
# of items	1,682	3,706	3,724	23,585
# of interactions	100,000	1,000,209	103,254	461,982
Density (%)	6.3%	4.5%	1.5%	0.08%

Table 2. Table 2 . Statistics of the two real-world transactional datasets in the shopping basket-based recommendation task.

Statistics	IJCAI-15	Tafeng
# of users	2,433	22,851
# of items	4,534	22,291
avg # of items in a transaction	6.28	9.28
# of generated instances	15,422	523,653
Density (%)	0.14%	0.10%

Table 3. Table 3 . General Recommendation Task: Overall performance of the baselines, and our proposed SDP, SDM, and SDMR on four datasets. The last four lines show the relative improvement of the SDM and SDMR over the best baseline method in General Recommenders (Group 1) and Sequential Recommenders (Group 2), respectively.

Method type	Method	ML-100k		ML-1M		Netflix		Epinions
Method type	Method	hit@10	NDCG@10	hit@10	NDCG@10	hit@10	NDCG@10	hit@10	NDCG@10
\pbox1cmGeneral Recommenders (Group 1)	Item-KNN	0.166	0.073	0.235	0.110	0.039	0.019	0.121	0.096
	SLIM	0.520	0.298	0.677	0.420	0.358	0.212	0.249	0.189
	MF-BPR	0.554	0.316	0.595	0.352	0.352	0.193	0.384	0.232
	CML	0.596	0.326	0.662	0.390	0.447	0.254	0.376	0.237
	NeuMF++	0.623	0.341	0.716	0.438	0.509	0.279	0.428	0.274
	CMN++	0.620	0.344	0.729	0.442	0.523	0.293	0.423	0.272
\pbox1cmSequential Recommenders (Group 2)	PRME	0.638	0.381	0.724	0.486	0.509	0.329	0.538	0.346
	PRME_s	0.674	0.398	0.734	0.491	0.539	0.348	0.380	0.244
	TransRec	0.684	0.402	0.770	0.524	0.511	0.345	0.551	0.357
	Caser	0.674	0.386	0.826	0.606	0.480	0.253	0.326	0.268
Ours	SDP	0.616	0.349	0.694	0.424	0.497	0.279	0.416	0.266
	SDM	0.713	0.435	0.816	0.584	0.584	0.379	0.575	0.390
	SDMR	0.695	0.562	0.810	0.662	0.592	0.449	0.568	0.423
\pbox1cmCompared to Group 1	Imprv. of SDM	14.54%	26.51%	11.93%	32.13%	11.71%	29.32%	34.35%	42.34%
\pbox1cmCompared to Group 1	Imprv. of SDMR	11.65%	63.44%	11.11%	49.77%	13.24%	53.20%	32.71%	54.38%
\pbox1cmCompared to Group 2	Imprv. of SDM	4.24%	8.21%	-1.21%	-3.63%	8.35%	8.91%	4.36%	9.24%
\pbox1cmCompared to Group 2	Imprv. of SDMR	1.61%	39.80%	-1.94%	9.24%	9.83%	29.02%	3.09%	18.49%

Table 4. Table 4 . Shopping basket-based Recommendation Task: Overall performance of the baselines, and our proposed models on two datasets. The last two lines show the relative improvement of the SDM and SDMR over the best baseline.

Method	IJCAI-15		Ta-Feng
Method	hit@10	NDCG@10	hit@10	NDCG@10
PRME	0.276	0.177	0.594	0.365
PRME_s	0.229	0.133	0.590	0.355
TransRec	0.262	0.168	0.622	0.401
Caser	0.173	0.096	0.605	0.373
SDP	0.323	0.201	0.633	0.401
SDM	0.316	0.189	0.646	0.439
SDMR	0.336	0.222	0.627	0.559
Imprv. of SDM	14.49%	6.78%	3.86%	9.48%
Imprv. of SDMR	21.74%	25.42%	0.80%	39.40%

Equations38

e^{(1)} = f_{1} (W^{(1)} [u_{i} v_{j}] + b^{(1)})

e^{(1)} = f_{1} (W^{(1)} [u_{i} v_{j}] + b^{(1)})

e^{(2)} = f_{2} (W^{(2)} e^{(1)} + b^{(2)})

\dots

e^{(ℓ)} = f_{ℓ} (W^{(ℓ)} e^{(ℓ - 1)} + b^{(ℓ)})

e^{(ℓ + 1)} = s q u a r e (e^{(ℓ)})

o^{(S D P)} = w^{(o)}^{⊤} e^{(ℓ + 1)} + b^{(o)}

\displaystyle\bm{q}_{ij}=f\Big{(}\mathbf{W}_{a}\begin{bmatrix}\bm{u}_{i}^{(i)}\\ \bm{v}_{j}^{(i)}\end{bmatrix}+\bm{b}_{a}\Big{)}

\displaystyle\bm{q}_{ij}=f\Big{(}\mathbf{W}_{a}\begin{bmatrix}\bm{u}_{i}^{(i)}\\ \bm{v}_{j}^{(i)}\end{bmatrix}+\bm{b}_{a}\Big{)}

\displaystyle\bm{p}_{ij}=f\Big{(}\mathbf{W}_{b}\begin{bmatrix}\bm{u}_{i}^{(o)}\\ \bm{v}_{j}^{(o)}\end{bmatrix}+\bm{b}_{b}\Big{)}

\displaystyle\bm{p}_{ij}=f\Big{(}\mathbf{W}_{b}\begin{bmatrix}\bm{u}_{i}^{(o)}\\ \bm{v}_{j}^{(o)}\end{bmatrix}+\bm{b}_{b}\Big{)}

\displaystyle z_{ijk}=\Big{\|}{f\Big{(}\mathbf{W}_{c}\begin{bmatrix}\bm{q}_{ij}\\ \bm{v}_{k}^{(i)}\end{bmatrix}+\bm{b}_{c}\Big{)}}\Big{\|}_{2}^{2}

\displaystyle z_{ijk}=\Big{\|}{f\Big{(}\mathbf{W}_{c}\begin{bmatrix}\bm{q}_{ij}\\ \bm{v}_{k}^{(i)}\end{bmatrix}+\bm{b}_{c}\Big{)}}\Big{\|}_{2}^{2}

a_{ij k} = \frac{e x p ( - z _{ij k} )}{\sum _{p \in T_{j}^{i}} e x p ( - z _{ij p} )}

a_{ij k} = \frac{e x p ( - z _{ij k} )}{\sum _{p \in T_{j}^{i}} e x p ( - z _{ij p} )}

a_{ij} = [a_{ij 1}, \dots, a_{ij s}]^{T}

a_{ij} = [a_{ij 1}, \dots, a_{ij s}]^{T}

o_{ij}^{(S D M)} = w_{e}^{⊤} e_{ij} + b_{e}

o_{ij}^{(S D M)} = w_{e}^{⊤} e_{ij} + b_{e}

\displaystyle\bm{e}_{ij}=\sum_{k\in\mathcal{T}^{i}_{j}}\bm{a}_{ijk}\times square\Big{(}f\Big{(}\mathbf{W}_{d}\begin{bmatrix}\bm{p}_{ij}\\ \bm{v}_{k}^{(o)}\end{bmatrix}+\bm{b}_{d}\Big{)}\Big{)}\vspace{-5pt}

\displaystyle\bm{e}_{ij}=\sum_{k\in\mathcal{T}^{i}_{j}}\bm{a}_{ijk}\times square\Big{(}f\Big{(}\mathbf{W}_{d}\begin{bmatrix}\bm{p}_{ij}\\ \bm{v}_{k}^{(o)}\end{bmatrix}+\bm{b}_{d}\Big{)}\Big{)}\vspace{-5pt}

g^{(h - 1)}

g^{(h - 1)}

q^{(h)}

\displaystyle\alpha_{ijk}^{(h)}=\frac{exp(-z^{(h)}_{ijk})}{\sum_{p\in\mathcal{T}^{i}_{j}}exp(-z^{(h)}_{ijp})}\text{, where}\;z_{ijk}^{(h)}=\Big{\|}{f\Big{(}\mathbf{W}_{c}^{(h)}\begin{bmatrix}\bm{q}^{(h)}_{ij}\\ \bm{v}_{k}^{(i)}\end{bmatrix}+\bm{b}_{c}\Big{)}}\Big{\|}_{2}^{2}

\displaystyle\alpha_{ijk}^{(h)}=\frac{exp(-z^{(h)}_{ijk})}{\sum_{p\in\mathcal{T}^{i}_{j}}exp(-z^{(h)}_{ijp})}\text{, where}\;z_{ijk}^{(h)}=\Big{\|}{f\Big{(}\mathbf{W}_{c}^{(h)}\begin{bmatrix}\bm{q}^{(h)}_{ij}\\ \bm{v}_{k}^{(i)}\end{bmatrix}+\bm{b}_{c}\Big{)}}\Big{\|}_{2}^{2}

o_{ij}^{(S D M - h)} = w_{e}^{⊤} e_{ij}^{(h)} + b_{e}^{(h)}

o_{ij}^{(S D M - h)} = w_{e}^{⊤} e_{ij}^{(h)} + b_{e}^{(h)}

\displaystyle\bm{e}_{ij}^{(h)}=\sum_{k\in\mathcal{T}^{i}_{j}}\bm{a}_{ijk}^{(h)}\times square\Big{(}f\Big{(}\mathbf{W}_{d}^{(h)}\begin{bmatrix}\bm{p}_{ij}^{(h)}\\ \bm{v}_{k}^{(o)}\end{bmatrix}+\bm{b}_{d}^{(h)}\Big{)}\Big{)}

\displaystyle\bm{e}_{ij}^{(h)}=\sum_{k\in\mathcal{T}^{i}_{j}}\bm{a}_{ijk}^{(h)}\times square\Big{(}f\Big{(}\mathbf{W}_{d}^{(h)}\begin{bmatrix}\bm{p}_{ij}^{(h)}\\ \bm{v}_{k}^{(o)}\end{bmatrix}+\bm{b}_{d}^{(h)}\Big{)}\Big{)}

o = β o^{(SDP)} + (1 - β) o^{(SDM)}

o = β o^{(SDP)} + (1 - β) o^{(SDM)}

\displaystyle o=ReLU\bigg{(}\bm{w}_{u}^{\top}\begin{bmatrix}\bm{e}^{(\ell+1)}\\ \bm{e}^{(h)}\end{bmatrix}+b_{u}\bigg{)}

\displaystyle o=ReLU\bigg{(}\bm{w}_{u}^{\top}\begin{bmatrix}\bm{e}^{(\ell+1)}\\ \bm{e}^{(h)}\end{bmatrix}+b_{u}\bigg{)}

\begin{aligned} \mathcal{L}=\operatorname*{argmin}_{\theta}\Big{(}-\sum_{(u,i^{+},i^{-})}\text{log }\sigma(o_{ui^{-}}-o_{ui^{+}})+\lambda\|\theta\|^{2}\Big{)}\\ \end{aligned}

\begin{aligned} \mathcal{L}=\operatorname*{argmin}_{\theta}\Big{(}-\sum_{(u,i^{+},i^{-})}\text{log }\sigma(o_{ui^{-}}-o_{ui^{+}})+\lambda\|\theta\|^{2}\Big{)}\\ \end{aligned}

P M I (j, k) = l o g \frac{P ( j , k )}{P ( j ) \times P ( k )}

P M I (j, k) = l o g \frac{P ( j , k )}{P ( j ) \times P ( k )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thanhdtran/SDMR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Signed Distance-based Deep Memory Recommender

Thanh Tran, Xinyue Liu, Kyumin Lee, Xiangnan Kong

Department of Computer Science

Worcester Polytechnic InstituteMassachusettsUSA

tdtran, xliu4, kmlee, [email protected]

(2019)

Abstract.

Personalized recommendation algorithms learn a user’s preference for an item by measuring a distance/similarity between them. However, some of the existing recommendation models (e.g., matrix factorization) assume a linear relationship between the user and item. This approach limits the capacity of recommender systems, since the interactions between users and items in real-world applications are much more complex than the linear relationship. To overcome this limitation, in this paper, we design and propose a deep learning framework called Signed Distance-based Deep Memory Recommender, which captures non-linear relationships between users and items explicitly and implicitly, and work well in both general recommendation task and shopping basket-based recommendation task. Through an extensive empirical study on six real-world datasets in the two recommendation tasks, our proposed approach achieved significant improvement over ten state-of-the-art recommendation models.

Memory recommender; signed distance; metric-based attention.

††copyright: rightsretained††journalyear: 2019††copyright: iw3c2w3††conference: Proceedings of the 2019 World Wide Web Conference; May 13–17, 2019; San Francisco, CA, USA††booktitle: Proceedings of the 2019 World Wide Web Conference (WWW ’19), May 13–17, 2019, San Francisco, CA, USA††doi: 10.1145/3308558.3313460††isbn: 978-1-4503-6674-8/19/05

1. Introduction

Recommender systems (Aggarwal, 2016) have been deployed in many online applications such as e-commerce, music/video streaming services, social media, etc. They have played a vital role for users to explore new items and for companies to increase their revenues. Most of recommendation algorithms model user preferences and item properties based on observed interactions (e.g., clicks, reviews, ratings) between users and items (Koren, 2009, 2010; Liu et al., 2016). In a perspective, we can view most of the recommendation models as a measurement of similarity or distance between a user and an item. For instance, the well known latent factor (i.e., matrix factorization) models (Koren, 2008) usually employ an inner product function to approximate the similarity between the user and the item. Although the latent factor models achieved competitive performance in some datasets, they did not correctly capture complex (i.e., non-linear) relationships between users and items because the inner product function follows limited linear nature.

Existing recommendation algorithms faced difficulties in finding good kernels for different data patterns (Liu et al., 2016), only focused on user-item latent space without considering the item-item latent space together (He et al., 2017b; Liang et al., 2018; Wu et al., 2016; Li et al., 2015; Sedhain et al., 2015), or required additional auxiliary information (e.g., item description, music content, reviews) (Kim et al., 2016; Van den Oord et al., 2013; Liu et al., 2017; Chen et al., 2017; Lu et al., 2018a). To overcome the drawbacks, in this paper we aim to propose and build a deep learning framework to learn a non-linear relationship between a user and a target item by measuring a distance from the observed data. In particular, we propose Signed Distance-based Deep Memory Recommender (SDMR), which captures non-linear relationship of the user and item explicitly and implicitly, combines explicitly and implicitly measured relationship to produce a final distance score for the recommendation, and performs well in both general recommendation task and shopping basket-based recommendation task.

SDMR internally combines two signed distances, each of which is measured by our proposed Signed Distance-based Perceptron (SDP) and Signed Distance-based Memory Network (SDM). On one hand, SDP explicitly measures a non-linear signed distance between the user and the item. Many existing models (He et al., 2016; Hu et al., 2008) rely on a pre-defined metric such as Euclidean distance (the green line in Figure 1) which is much more limited than the customized non-linear signed distance learned from the data (the red curves in Figure 1). On the other hand, SDM implicitly measures a non-linear signed distance between the user and the item via the user’s recently consumed items. SDM is similar to the item neighborhood-based recommender (Sarwar et al., 2001; Ning and Karypis, 2011) in nature. However, it is more advanced in several aspects, as shown in the right side of Figure 1. First, SDM only focuses on a set of recently consumed items of the target user (e.g., the book, CD and camera in Figure 1) as context items. Second, it employs additional memories to learn a novel personalized metric-based attention on the consumed items. The goal of our proposed attention is to compute weights of each consumed item w.r.t. the target item (i.e., the camera lens). In the example, the attention module assigns higher weights on the camera and lower weights on the book and CD. Unlike our approach, most of the existing neighborhood-based models consider contribution of consumed items to the target item equally, leading to suboptimal results. Last but not the least, we update the attention weights via a gated multi-hop to build a long-term memory within SDM. This multi-hop design helps refine our attention module and produces more accurate attentive scores.

The contributions of this work are summarized as follows:

$\bullet$

We design a deep learning framework which can tackle both general recommendation task and shopping basket-based recommendation task.

$\bullet$

We propose SDMR that combines two signed distance scores internally measured by SDP and SDM, which capture non-linear relationship between a user and an item explicitly and implicitly.

$\bullet$

To better balance the weights among consumed items of the user, we propose a novel multi-hop memory network with a personalized metric-based attention mechanism in SDM.

$\bullet$

Extensive experiments on six datasets in two different recommendation tasks demonstrate the effectiveness of our proposed methods against ten baselines.

2. Related Work

Latent Factor Models (LFM) have been extensively studied in the literature, which include Matrix Factorization (Hu et al., 2008), Bayesian Personalized Ranking (Rendle et al., 2009), fast matrix factorization for implicit feedbacks (eALS) (He et al., 2016), etc. Despite their success, LFM suffer from several limitations. First, LFM overlook associations between the user’s previously consumed items and the target item (e.g. mobile phones and phone cases). Second, LFM usually rely on inner product function, whose linearity limits the capability of modeling complex user-item interactions. To address the second issue, several non-linear latent factor models have been proposed, with the help of Gaussian process (Lawrence and Urtasun, 2009) or kernels (Liu et al., 2016; Zhou et al., 2012). However, they either require expensive hyper-parameter tuning or face difficulties in finding good kernels for different data patterns.

Neighborhood-based models (Sarwar et al., 2001; Ning and Karypis, 2011) are usually based on the principle that similar users prefer similar items. The problem turns into finding the neighbors of a user or an item based on a pre-defined distance/similarity metric, such as cosine vector similarity (Lang, 1995; Billsus and Pazzani, 2000), Person Correlation similarity (Deshpande and Karypis, 2004), etc. The recommendation quality highly depends on a chosen metric, but finding a good pre-defined metric is usually very challenging. Furthermore, these models are also sensitive to the selection of neighbors. Our proposed SDM is similar to neighborhood-based models in nature, but it exploits a novel personalized metric-based attention for assigning attentive weights to context items. Therefore, our approach is more robust and less sensitive than conventional neighborhood-based models.

NeuMF (He et al., 2017b) is a neural network that generalizes matrix factorization via Multi Layer Perceptron (MLP) for learning non-linear interaction functions. Similarly, some other works (Liang et al., 2018; Wu et al., 2016; Li et al., 2015; Sedhain et al., 2015) substitute MLP with auto-encoder architecture. It is worth noting that all these approaches are limited by only considering the user-item latent space, and overlook the correlations in the item-item latent space. Besides, some deep learning based works (Lu et al., 2018b; Tay et al., 2018b; Seo et al., 2017; Ma et al., 2019, 2018) employ auxiliary information such as item description (Kim et al., 2016), music content (Van den Oord et al., 2013), item visual features (Liu et al., 2017; Chen et al., 2017), reviews (Lu et al., 2018a) to address the cold-start problem. However, this auxiliary information is not always available, and it limits their applicability in many real-world systems. Another line of works use deep neural networks to model temporal effects of consumed items (Hidasi et al., 2015; Wu et al., 2017; Quadrana et al., 2017; Tang and Wang, 2018). Although our proposed methods do not explicitly consider the temporal effects, SDM utilizes the time information to select a set of recently consumed items as the context items of the target item.

The most closely related work to our work is recently proposed (Collaborative Memory Network (CMN) (Ebesu et al., 2018)). In this work, Memory Network (Sukhbaatar et al., 2015) is adapted to measure similarities between users and user neighbors. Key differences between our work and CMN are as follows: (i) First, we follow an item neighborhood based design, whereas CMN follows a user neighborhood based design. The prior work showed that item neighborhood based models slightly outperformed user neighbor based models (Linden et al., 2003; Sarwar et al., 2001); (ii) Second, our proposed SDM model uses our proposed personalized metric-based attention mechanism and produces signed distance scores as output, whereas CMN exploited a traditional inner product based attention; (iii) Third, we use a gated multi-hop architecture (Liu and Perez, 2017), which was shown to perform better than the original multi-hop design (Sukhbaatar et al., 2015).

3. Problem Statement

In this section, we describe two recommendation problems: (i) general recommendation task; and (ii) shopping basket-based recommendation task. In following sections, we focus on solving them.

General recommendation task: Given a whole item set $V=\{v_{1},v_{2},...,v_{|V|}\}$ , and a whole user set $U=\{u_{1},u_{2},...,u_{|U|}\}$ . Each user $u_{i}\in U$ may consume several items $\{v_{i1},v_{i2},...,v_{ik}\}$ in $V$ , denoted as a set of context items $c$ . In this task, given previously consumed items of a user $u_{i}$ , a recommendation model predicts a next target item $v_{j}$ that $u_{i}$ may prefer, denoting this task as estimating $P(u_{i},v_{j}|c)$ . Note that some existing works assume independent relationships between $v_{j}$ and context items in the set $\bm{c}$ , leading to $P(u_{i},v_{j}|c)=P(u_{i},v_{j})$ (He et al., 2017b, 2016). In our work, we model the $u_{i}$ ’s preference on $v_{j}$ in two steps: (i) an explicit preference of $u_{i}$ on $v_{j}$ in a signed distance based perceptron, and (ii) an implicit preference of $u_{i}$ on $v_{j}$ via summing attentive effects of context items toward target item $v_{j}$ in a signed distance based memory network.

Shopping Basket-based recommendation task: This problem is based on the fact that users go shopping offline/online and add some items into a basket/cart together. Each shopping basket/cart is seen as a transaction, and each user may shop once or multiple times, leading to one or multiple transactions. Let $T^{(u)}=\{t_{1},t_{2},...,t_{|T^{(u)}|}\}$ as a set of the user $u$ ’s transactions, where $|T^{(u)}|$ denotes the number of user $u$ ’s transactions. Each transaction $t_{i}=\{v_{1},v_{2},...,v_{|t_{i}|}\}$ consists of several items in the whole item set $V$ . In this problem, it is assumed that all the items in $t_{i}$ are inserted into the same basket at the same time, ignoring the actual order of the items being inserted and considering $t_{i}$ ’s transaction time as each item’s insertion time. Given a target item $v_{j}\in t_{i}$ , the rest of the items in $t_{i}$ will be seen as the context items of $v_{j}$ , denoted as $c$ (i.e. $c=t_{i}\textbackslash\{v_{j}\}$ ). Then, given the set of context items $c$ , a recommendation model predicts a conditional probability $P(u,v_{j}|c)$ , which is interpreted as the conditional probability that $u$ will add the item $v_{j}$ into the same basket with the other items $c$ .

Both of the recommendation tasks above are popular in the literature (Quadrana et al., 2017; Rendle et al., 2010; Feng et al., 2015; He et al., 2017b). The general recommendation task differs from the shopping basket-based recommendation task because there is no specific context items of the target item in the general recommendation task. Note that the two tasks are personalized recommendation problems. In fact, there are non-personalized recommendation problems such as session-based recommendation (Hidasi et al., 2015), where users (i.e. user IDs) are not available in transactions. However, in this paper, we focus on personalized recommendation tasks because they are more preferred in the literature (Quadrana et al., 2017; Rendle et al., 2010; Feng et al., 2015).

4. Proposed Methods

Our proposed Signed Distance-based Deep Memory Recommender (SDMR) consists of two major components: Signed Distance-based Perceptron (SDP) and Signed Distance-based Memory network (SDM). We first describe an overview of our models as follows:

$\bullet$

Given a target user $i$ and a target item $j$ as two one-hot vectors, we pass the two vectors through the user and item embedding spaces to get user embedding $u_{i}$ and item embedding $v_{j}$ .

$\bullet$

On one hand, our proposed Signed Distance-based Perceptron (SDP) will measure a signed distance score between $u_{i}$ and $v_{j}$ by a multi-layer perceptron network.

$\bullet$

On the other hand, given target user $i$ , target item $j$ , and the user $i$ ’s recently consumed context items $s$ as the input, our Signed Distance-based Memory network (SDM) will measure a signed distance score between user $i$ and item $j$ via attentive distances between context items $s$ and target item $j$ .

$\bullet$

Then, the Signed Distance-based Deep Memory Recommender (SDMR) model will measure a total distance between user $i$ and item $j$ by learning a combination of SDP and SDM. The smaller the total distance is, the more likely user $i$ will consume item $j$ .

Next, we describe SDP, SDM, and SDMR in detail.

4.1. Signed Distance-based Perceptron (SDP)

We first propose Signed Distance-based Perceptron (SDP) that explicitly learns a signed distance between a target user $i$ and a target item $j$ . An illustration of SDP is shown in Figure 2. Let the embedding of a target user $i$ be $\bm{u}_{i}\in\mathbb{R}^{d}$ , and the embedding of a target item $j$ be $\bm{v}_{j}\in\mathbb{R}^{d}$ , where $d$ is the number of dimensions in each embedding. First, SDP takes a concatenation of these two embeddings as the input and proceeds as follows:

[TABLE]

where $f_{l}(\cdot)$ refers to a non-linear activation function at the layer $l^{th}$ (e.g. sigmoid, ReLu or tanh), and $square(\cdot)$ denotes an element-wise square function (e.g $square([2,3])=[6,9]$ ). Through experimental results, we choose tanh as the activation function because it yields slightly better results than ReLu. From now on, we will use $f(\cdot)$ to denote the tanh function. It can be easily observed that Eq. (1) – (4) form a trivial Multi-Layer Perceptron (MLP) network, which is a popular design (He et al., 2017b; Xue et al., 2017) to learn a complex and non-linear interaction between user embedding $\bm{u}_{i}$ and item embedding $\bm{v}_{j}$ . Our new design starts at Eq. (5) – Eq. (6). In Eq. (5), we apply the element-wise squared function $square(\cdot)$ to the output vector $\bm{e}^{(l)}$ of the MLP and obtain a new output vector $\bm{e}^{(l+1)}$ . Next, in Eq. (6), we use a fully connected layer $\bm{w}^{(o)}$ to combine different dimensions in $\bm{e}^{(l+1)}$ and yields a final distance value $o^{(SDP)}$ . Our idea of using $\bm{w}^{(o)}$ in here is that after applying the element-wise square function $square(\cdot)$ in Eq. (5), all the dimensions in $\bm{e}^{(l+1)}$ will be non-negative. Thus, we consider each dimension of $\bm{e}^{(l+1)}$ as a distance value. The edge weights $\bm{w}^{(o)}$ will then be used to combine those distant dimensions to provide a more fine-grained distance.

We note that SDP can be reduced to a squared Euclidean distance with the following setting: at Eq. (1), $\mathbf{W}^{(1)}=[\mathbb{1},-\mathbb{1}]$ with $\mathbb{1}$ denotes an identity matrix and so $\mathbf{W}^{(1)}\begin{bmatrix}\bm{u}_{i}\\ \bm{v}_{j}\end{bmatrix}=\bm{u}_{i}-\bm{v}_{j}$ ; the activation $f(\cdot)$ is an identity function; the number of MLP layers $\ell=1$ ; the edge-weights layer at Eq. (6): $\bm{w}^{(o)}=\bm{1}$ (e.g. the all-ones matrix), bias $\bm{b}^{(o)}=\bm{0}$ . Note that if $\bm{w}^{(o)}$ in Eq. (6) is an all-negative layer, it will yield a negative value, which we name as a signed distance111https://en.wikipedia.org/wiki/Signed_distance_function score. If we see each user $\bm{i}$ as a point in multi dimensional space, and the user’s preference space is defined by a boundary $\Omega$ , we can interpret this signed distance score as follows: When the item $\bm{j}$ is out of the user $\bm{i}$ ’s preference boundary $\Omega$ , the distance $d(\bm{i},\bm{j})$ between them is positive (i.e. $d(\bm{i},\bm{j})$ ¿ 0) and it reflects that user $\bm{i}$ does not prefer item $\bm{j}$ . When the distance between user $\bm{i}$ and item $\bm{j}$ is shortened and $\bm{j}$ is right on the boundary $\Omega$ , the distance between them is zero and it indicates user $\bm{i}$ likes item $\bm{j}$ . As $\bm{j}$ is coming inside $\Omega$ , the distance between them becomes negative and reflects a higher preference of user $\bm{i}$ on item $\bm{j}$ . In short, we can see SDP as a signed distance function, which could learn a complex signed distance between a user and an item via a MLP architecture with non-linear activations and an element-wise square function $square(\cdot)$ . In the recommendation domain, the signed distances will provide more fine-grained distance values, thus, reflecting users’ preferences on items more accurately.

4.2. Signed Distance-based Memory Network (SDM)

We propose a multi-hop memory network, Signed Distance-based Memory network (SDM), to model implicit preference of a user on the target item via the user’s previously consumed items (i.e., context items). The implicit preference is represented as a signed distance. First, we describe a single-hop SDM, and then describe how to extend it into a multi-hop design. Following the traditional architecture of a memory network (Sukhbaatar et al., 2015; Liu and Perez, 2017; Xiong et al., 2016), our proposed single-hop SDM has four main components: a memory module, an input module, an attention module, and an output module. The overview of SDM’s architecture is presented in Figure 3. We will go into details of each SDM’s module as follows:

4.2.1. Memory Module:

We maintain two memories called input memory and output memory. The input memory contains two embedding matrices $\mathbf{U}^{(i)}\in\mathbb{R}^{M\times d}$ and $\mathbf{V}^{(i)}\in\mathbb{R}^{N\times d}$ , where $M$ and $N$ are the number of users and the number of items in the system, respectively. $d$ denotes the embedding size of each user and each item. Similarly, the output memory also contains two embedding matrices $\mathbf{U}^{(o)}\in\mathbb{R}^{M\times d}$ and $\mathbf{V}^{(o)}\in\mathbb{R}^{N\times d}$ . As shown in Figure 3, the input memory will be used to calculate attention weights of a user’s consumed items (i.e., context items), whereas the output memory will be used to measure a final signed distance between the target user and the target item via the user’s context items.

Given a target user $i$ , a target item $j$ and a set of user $i$ ’s consumed items as context items $\mathcal{T}^{i}_{j}$ , the output of this module is the embeddings of user $i$ , item $j$ , and all context items $k\in\mathcal{T}^{i}_{j}$ : ( $\bm{u}_{i},\bm{v}_{j},$ ¡ $\bm{v}_{1},\bm{v}_{2},...,\bm{v}_{k}$ ¿). Since this module has a separated input memory and output memory, we obtain ( $\bm{u}_{i}^{(i)},\bm{v}_{j}^{(i)},$ ¡ $\bm{v}_{1}^{(i)},\bm{v}_{2}^{(i)},...,\bm{v}_{k}^{(i)}$ ¿) as the output of the input memory, and ( $\bm{u}_{i}^{(o)},\bm{v}_{j}^{(o)},$ ¡ $\bm{v}_{1}^{(o)},\bm{v}_{2}^{(o)},...,\bm{v}_{k}^{(o)}$ ¿) as the output of the output memory. It is obvious that $\bm{u}_{i}^{(i)}$ is the $i$ -th row of $\mathbf{U}^{(i)}$ , $\bm{v}_{j}^{(i)}$ and $\bm{v}_{k}^{(i)}$ are the corresponding $j$ -th and $k$ -th row of $\mathbf{V}^{(i)}$ . A similar explanation is applied to $\bm{u}_{i}^{(o)}$ $\bm{v}_{j}^{(o)}$ , and $\bm{v}_{k}^{(o)}$ .

4.2.2. Input Module:

The goal of the input module is to form a non-linear combination between the target user embedding and the target item embedding. Given the target user embedding $\bm{u}_{i}^{(i)}$ and the target item embedding $\bm{v}_{j}^{(i)}$ from the input memory in the memory module, following the widely adopted design in multimodal deep learning work (Zhang et al., 2014; Srivastava and Salakhutdinov, 2012), the input module simply concatenates the two embeddings, and then applies a fully connected layer with a non-linear activation $f(\cdot)$ (i.e. tanh function) to obtain a coherent hidden feature vector as follows:

[TABLE]

where $\mathbf{W}_{a}\in\mathbb{R}^{d\times 2d}$ is the weights of input module. Note that $q_{ij}\in\mathbb{R}^{d}$ can be seen as a query embedding in Memory Network (Sukhbaatar et al., 2015).

Similarly, if the inputs of the input module are the target user embeddings $\bm{u}_{i}^{(o)}$ and the target item embeddings $\bm{v}_{j}^{(o)}$ from the output memory, we can form a non-linear combination between $\bm{u}_{i}^{(o)}$ and $\bm{v}_{j}^{(o)}$ (i.e. an output query), denoted as $\bm{p}_{ij}$ , as follows:

[TABLE]

4.2.3. Attention Module:

The goal of the attention module is to assign attentive scores to different context items (or candidates) given the combined vector (or a query) $\bm{q}_{ij}$ of the target user $i$ and target item $j$ obtained in Eq. (7). First, we calculate the squared $\mathcal{L}2$ distance between $\bm{q}_{ij}$ and each candidate item $\bm{v}_{k}^{(i)}$ as follows:

[TABLE]

where $||\cdot||_{2}$ refers to the $\mathcal{L}2$ distance (or Euclidean distance), which is widely used in previous works to measure similarity among items (Feng et al., 2015) or between users and items (Hsieh et al., 2017). To better understand our intuition in Eq. (9), we will break it into smaller parts and explain them. First, similar to the intuition of Eq. (7), we have $f\Big{(}\mathbf{W}_{c}\begin{bmatrix}\bm{q}_{ij}\\ \bm{v}_{k}^{(i)}\end{bmatrix}+\bm{b}_{c}\Big{)}$ component to define a non-linear combination between the input query $\bm{q}_{ij}$ and each context item embeddings $\bm{v_{k}}^{(i)}$ . Then, $||\cdot||_{2}^{2}$ will measure the squared $\mathcal{L}2$ distance of the combined vector. It is worth to note that with a following setting: $\bm{W}_{a}$ = $[\bm{0},\mathbb{1}]$ where $\mathbb{1}$ refers to an identity matrix and $\bm{0}$ is an all-zeros matrix; $f(\cdot)$ is an identity function; $\bm{W}_{c}$ = $[\mathbb{1},-\mathbb{1}]$ ; bias terms $\bm{b}_{a}=\bm{b}_{c}=0$ . Then, in Eq. (7), $\bm{q}_{ij}=f\Big{(}\mathbf{W}_{a}\begin{bmatrix}\bm{u}_{i}^{(i)}\\ \bm{v}_{j}^{(i)}\end{bmatrix}+\bm{b}_{a}\Big{)}=\bm{v}_{j}^{(i)}$ ; in Eq. (9), $f\Big{(}\mathbf{W}_{c}\begin{bmatrix}\bm{q}_{ij}\\ \bm{v}_{k}^{(i)}\end{bmatrix}+\bm{b}_{c}\Big{)}=\bm{v}_{j}^{(i)}-\bm{v}_{k}^{(i)}$ , and $z_{ijk}=||(\bm{v}_{j}^{(i)}-\bm{v}_{k}^{(i)})||_{2}^{2}$ , which simply generalizes a squared $\mathcal{L}2$ distance between the target item $j$ and the context item $k$ . Additionally, with another setting: $\bm{W}_{a}$ = $[\mathbb{1},-\mathbb{1}]$ ; $f(\cdot)$ is an identity function; $\bm{W}_{c}$ = $[\mathbb{1},\mathbb{1}]$ ; bias terms $\bm{b}_{a}=\bm{b}_{c}=0$ . Then, in Eq. (7), $\bm{q}_{ij}=f\Big{(}\mathbf{W}_{a}\begin{bmatrix}\bm{u}_{i}^{(i)}\\ \bm{v}_{j}^{(i)}\end{bmatrix}+\bm{b}_{a}\Big{)}=\bm{u}_{i}^{(i)}-\bm{v}_{j}^{(i)}$ , in Eq. (9), $f\Big{(}\mathbf{W}_{c}\begin{bmatrix}\bm{q}_{ij}\\ \bm{v}_{k}^{(i)}\end{bmatrix}+\bm{b}_{c}\Big{)}=\bm{u}_{i}^{(i)}-\bm{v}_{j}^{(i)}+\bm{v}_{k}^{(i)}$ , and $z_{ijk}=||(\bm{v}_{k}^{(i)}+\bm{u}_{i}^{(i)}-\bm{v}_{j}^{(i)})||_{2}^{2}$ , which simply generalizes a squared $\mathcal{L}2$ distance between the target item $j$ and the context item $k$ where the user $i$ plays as a translator (He et al., 2017a). The two examples above show that our proposed design can learn a more generalized distance between target and context items.

The output squared $\mathcal{L}2$ distance in Eq. (9) will show how similar the target item $j$ and the context item $k$ are. The lower the distance score is, the more similar two items $j$ and $k$ are. Next, we use the Softmax function to normalize and obtain attentive score between $j$ and $k$ as follows:

[TABLE]

where $\mathcal{T}^{i}_{j}$ is the set of user $\bm{i}$ ’s neighborhood items. The minus sign in Eq. (10) is used to assign a higher attention score for a lower distance between two items ( $j$ , $k$ ).

We note that the $\mathcal{L}2$ distance (or Euclidean distance) satisfies four conditions of a metric 222https://en.wikipedia.org/wiki/Metric_(mathematics). While the crucial triangle inequality property of a metric was shown to provide a better performance compared to the inner product (Shrivastava and Li, 2014; Ram and Gray, 2012; Hsieh et al., 2017) in recommendation domains, to our best of knowledge, most of existing attention designs (Vaswani et al., 2017; Luong et al., 2015; Lin et al., 2017; Choi et al., 2018; Seo et al., 2016; Bahdanau et al., 2014; Xu et al., 2015) adopted the inner product for measuring attentive scores. Hence, this proposed attention design is the first attempt to bring metric properties into the attention mechanism.

Similar to (Tay et al., 2018a), we limit the number of considering context items by choosing the user $\bm{i}$ ’s $\bm{s}$ most recently consumed items before target item $\bm{j}$ as the context items of target item $\bm{j}$ . Here, $\bm{s}$ can be selected via tuning with a development dataset. The soft attention vector containing attentive contribution scores of $\bm{s}$ context items toward the target item $\bm{j}$ of a user $\bm{i}$ is given as follows:

[TABLE]

4.2.4. Output Module:

Given the attentive scores $\bm{a}_{ij}$ in Eq.(11) and the combined vector $\bm{p}_{ij}\in\mathbb{R}^{d}$ of the user embedding $\bm{u}_{i}^{(o)}$ and item embedding $\bm{v}_{j}^{(o)}$ from the output memory $\bm{U}^{(o)}$ and $\bm{V}^{(o)}$ , the goal of this output module is to measure a total output distance $\bm{o}_{ij}^{(SDM)}$ between the output target item embeddings $\bm{v}_{j}^{(o)}$ and all the user $\bm{i}$ ’s output context item embeddings $\bm{v}_{k}^{(o)}(k\in T_{j}^{i})$ using attention weights $\bm{a}_{ij}$ and the output query $\bm{p}_{ij}$ as follows:

[TABLE]

where $\bm{e}_{ij}\in\mathbb{R}^{d}$ is calculated as follows:

[TABLE]

In here, let $\bm{r}_{ijk}=f\Big{(}\mathbf{W}_{d}\begin{bmatrix}\bm{p}_{ij}\\ \bm{v}_{k}^{(o)}\end{bmatrix}+\bm{b}_{d}\Big{)}$ . Similar to the previously discussed intuition in Eq (9), $\bm{r}_{ijk}$ is a flexible combination between $\bm{p}_{ij}$ and each output context item embeddings $\bm{v}_{k}^{(o)}$ ; $square(\cdot)$ is an element-wise squared function. Our idea in Eq. (12), (13) is similar to the idea in Eq. (5), (6) of the SDP model. First, in Eq. (13), each context item $k$ will attentively contribute to the target item $j$ via a squared Euclidean measure. Second, in Eq. (12), each non-negative dimension in $e_{ij}$ will be considered as a distance dimension and we use an edge-weights layer $\bm{w}_{e}$ to combine them flexibly. When there is only one context item in $\mathcal{T}_{j}^{i}$ , then in Eq. (13), the attention score $\bm{a}_{ijk}$ =1.0, leading to $\bm{e}_{ij}=square(\bm{r}_{ijk})$ , which is similar to Eq. (5). In this case, SDM will measure the distance between target item $j$ and context item $k$ in the same way as SDP model does. Note that Eq. (13) is similar to Eq. (6) so SDM can also learn a signed distance value, which also provides a more fine-grained distance compared to a general distance value.

4.2.5. Multi-hop SDM:

Inspired by previous work (Sukhbaatar et al., 2015) where the multi-hop design helped to refine the attention module in Memory Network, we also integrate multiple hops to further extend our SDM model to build a deeper network (Figure 4). As the gated multi-hop design (Liu and Perez, 2017) was shown to perform better than the original multi-hop design with a simple residual connection in (Sukhbaatar et al., 2015), we employ this gated memory update from hop to hop as follows:

[TABLE]

where $\bm{q}^{(h-1)}$ is the input query embedding as shown in Eq. (7) at hop $h-1$ , $\mathbf{W}_{g}^{(h-1)}$ and bias $\bm{b}_{g}^{(h-1)}$ are hop-specific parameters, $\sigma$ is the sigmoid function, ${e}^{(h-1)}$ is the output of Eq. (13) at hop $h-1$ , $\bm{q}^{(h)}$ is the input query embedding at the next hop $h$ . So the attention could be updated at hop $h$ accordingly using $\bm{q}^{(t)}$ as follows:

[TABLE]

The multi-hop architecture with gated design further refines the attention for different users based on the previous output from hop to hop. Hence, if the final hop is $h$ then the SDM model with $h$ hops, denoted as SDM-h, will use $\bm{a}_{ij}^{(h)}$ to yield a final signed distance score as follows:

[TABLE]

where $\bm{e}_{ij}$ is calculated as:

[TABLE]

Weight constraints in multi-hop SDM model: To save memory, we use the global weight constraint in multi-hop SDM. Particularly, input memory $\bm{U^{(i)}},\bm{V^{(i)}}$ and output memory $\bm{U^{(o)}},\bm{V^{(o)}}$ are shared among different hops. All the weights are shared from hop to hop $\bm{W}_{a}^{(1)}$ = $\bm{W}_{a}^{(2)}$ = … = $\bm{W}_{a}^{(h)}$ ; $\bm{W}_{b}^{(1)}$ = $\bm{W}_{b}^{(2)}$ = … = $\bm{W}_{b}^{(h)}$ ; $\bm{W}_{c}^{(1)}$ = $\bm{W}_{c}^{(2)}$ = … = $\bm{W}_{c}^{(h)}$ ; $\bm{W}_{d}^{(1)}$ = $\bm{W}_{d}^{(2)}$ = … = $\bm{W}_{d}^{(h)}$ ; and so do all bias terms. The gate weights are also global weights: $\bm{W}_{g}^{(1)}$ = $\bm{W}_{g}^{(2)}$ = … = $\bm{W}_{g}^{(h)}$ .

4.3. Signed Distance-based Deep Memory Recommender (SDMR)

Now we propose Signed Distance-based Deep Memory Recommender (SDMR), a hybrid network that combines SDP and SDM. The first approach to combine them is to employ a weighted summation of the output scores from SDP and SDM as follows:

[TABLE]

where $o^{\text{(SDP)}}$ is the signed distance score obtained at Eq. (6), $o^{\text{(SDM)}}$ is the signed distance score obtained at Eq. (17), and $\beta\in[0,1]$ is a hyper-parameter to control the contribution of SDP and SDM. When $\beta$ =0, SDMR becomes SDM. When $\beta$ =1, SDMR becomes SDP.

However, to avoid tuning an additional hyper-parameter $\beta$ , we do not use Eq. (19) for SDMR. Instead, we let SDMR self-learns the combination of SDM and SDM as follows:

[TABLE]

where $\bm{e}^{(\ell+1)}$ is the final layer embedding from SDP and is obtained at Eq. (5), $\bm{e}^{(h)}$ is the final hop output from the multi-hop SDM obtained at Eq. (18). We note that SDP and SDM are first pre-trained separately using the BPR loss function (see the next section). Then, we obtain $\bm{e}^{(\ell+1)}$ from SDP, and $\bm{e}^{(h)}$ from SDM, and keep them fixed in Eq. (20) to learn $\bm{w}_{u}$ and $b_{u}$ . We use ReLU in Eq. (20) because ReLU encourages sparse activations and helps to reduce over-fitting when combining the two components SDP and SDM.

4.4. Loss Functions

We adopt the Bayesian Personalized Ranking (BPR) as our loss function, which is similar to the idea of AUC (area under the curve):

[TABLE]

where we uniformly sample tuples in a form of $(u,i^{+},i^{-})$ for user $u$ with positive item (consumed) $i^{+}$ and negative item (unconsumed) $i^{-}$ . $\lambda$ is a hyper-parameter to control the regularization term, and $\sigma(\cdot)$ is the sigmoid function. Note that other pairwise probability functions could be plugged in Eq. (21) to replace $\sigma(\cdot)$ . Both SDP and SDM are end-to-end differentiable since we uses soft attention over the output memory. Hence, we can utilize back-propagation to learn our models with stochastic gradient descent or Adam (Kingma and Ba, 2014).

5. Empirical Study

We evaluate our SDP, SDM, SDMR models against ten state-of-the-art baselines in two recommendation tasks: (i) general recommendation task, and (ii) shopping basket-based recommendation task. We mainly aim to answer the following research questions (RQs):

$\bullet$

RQ1: How do SDP, SDM, and SDMR perform compared to other state-of-the-art models in both general recommendation task and shopping basket-based recommendation task?

$\bullet$

RQ2: Why/How does the multi-hop design help to improve the proposed models’ performance?

5.1. Datasets

General recommendation task: In this task, we evaluate our proposed models and state-of-the-art methods using different datasets with various density levels as follows:

$\bullet$

Movielens (Resnick et al., 1994): It is a widely adopted benchmark dataset for collaborative filtering evaluation. We use two versions of this benchmark dataset, namely MovieLens100k (or ML-100k) and MovieLens1M (or ML-1M).

$\bullet$

Netflix Prize 333https://www.netflixprize.com/: It is a real-world dataset collected by Netflix. This dataset was collected from 1999 to 2005, and consists of 463,435 users and 17,769 items with 56.9M of interactions. Since the dataset is extremely large, we subsample the Netflix dataset by randomly picking one-month data for evaluation.

$\bullet$

Epinions (Massa and Avesani, 2007) 444http://www.trustlet.org/downloaded_epinions.html: It is an online rating dataset where users can share product feedback by giving explicit ratings and reviews.

In preprocessing preparation, we adopted a popular k-core preprocessing step (He and McAuley, 2016b; Liang et al., 2018; Tran et al., 2018) (with k-core = 5) to filter out inactive users with less than five ratings and items which are consumed by less than five users. Since ML-100k and ML-1M are already preprocessed, we only apply 5-core preprocessing step on the Netflix and Epinions datasets. We also binarize the rating scores as implicit feedback by converting all observed rating scores as positive interactions and the remaining as negative interactions. The statistics of the four datasets are summarized in Table 1.

Shopping basket-based recommendation task: We use two real-world transaction datasets as follows:

$\bullet$

IJCAI-15 555https://tianchi.aliyun.com/datalab/dataSet.htm?id=1: It consists of shopping logs of users from Tmall 666https://www.tmall.com. Since the original dataset is extremely large scale. We subsample IJCAI-15 by randomly picking 20k transactions for evaluation.

$\bullet$

Tafeng 777http://stackoverflow.com/questions/25014904/download-link-for-ta-feng-grocery-dataset: It is a grocery store transaction data. It contains four month transaction data from November 2000 to February 2001 by T-Feng supermarket.

Users in both IJCAI-15 and Tafeng datasets are logged under four types of actions: click, add-to-cart, purchase, and add-to-favourite. We consider all the four types as the click action. We only keep transactions with at least five items. This is because we will take one item out for testing, another item for development. In the remaining three items, one will be taken out as a target item and the two items will be used as the context items. Attentive scores will be assigned to the context items. In each of original transactions, we generate data instances of the format $<\mathbf{c},v_{c}>$ where $v_{c}$ is the target/predicting item and $\mathbf{c}$ is a set of all other items in the same transaction with $v_{c}$ . In particular, in each transaction $t$ , each time we pick one item out as a target item and leave the rest of items in $t$ as corresponding context items. Subsequently, for each transaction $t$ containing $|t|$ items, we can generate $|t|$ data instances. The statistics of the two transactional datasets are summarized in Table 2.

For an easy reference, we call (ML-100k, ML-1M, Netflix, Epinions) as Group-1 dataset and (IJCAI-15, Ta-Feng) as Group-2 datasets.

5.2. Baselines and State-of-the-art Methods

We compared our proposed models against several strong baselines in the general recommendation task as follows:

$\bullet$

Itemknn (Sarwar et al., 2001): It is an item neighborhood-based collaborative filtering method. It exploited cosine item-item similarities to produce recommendation results.

$\bullet$

Bayesian Personalized Ranking (MF-BPR) (Rendle et al., 2009): It is a state-of-the-art pairwise matrix factorization method for implicit feedback datasets. It minimizes $\sum_{i}\sum_{j^{+},j^{-}}-log\sigma(u_{i}^{T}v_{j^{+}}$ - $u_{i}^{T}v_{j^{-}})$ + $\lambda(||u_{i}||^{2}+||v_{j^{+}}||^{2})$ where ( $u_{i}$ , $v_{j^{+}}$ ) is a positive interaction and ( $u_{i}$ , $v_{j^{-}}$ ) is a negative sample.

$\bullet$

Sparse LInear Method (slim) (Ning and Karypis, 2011): It learns a sparse item-item similarity matrix by minimizing the squared loss $||A-AW||^{2}+\lambda_{1}||W||+\lambda_{2}||W||^{2}$ , where A is a $m\times n$ user-item interaction matrix and W is a $n\times n$ sparse matrix of aggregation coefficients of context items.

$\bullet$

Collaborative Metric Learning (CML) (Hsieh et al., 2017): It is a state-of-the-art collaborative metric-based model that utilizes Euclidean distance to measure similarities between users and items. For fair comparison, we learn CML with BPR loss by minimizing $-\sum_{i,j^{+},j^{-}}log(\sigma(||u_{i}-v_{j^{-}}||_{2}^{2}-||u_{i}-v_{j^{+}}||_{2}^{2}))$ , where $||\cdot||_{2}^{2}$ is a squared Euclidean distance, ( $u_{i}$ , $v_{j^{+}}$ ) is a positive interaction and $(u_{i},v_{j^{-}})$ is a negative sample.

$\bullet$

Neural Collaborative Filtering (NeuMF++) (He et al., 2017b): It is a state-of-the-art matrix factorization method using deep learning architecture. We use a pre-trained NeuMF to achieve its best performance, and denote it as NeuMF++.

$\bullet$

Collaborative Memory Network (CMN++) (Ebesu et al., 2018): It is a state-of-the-art memory network based recommender. Its architecture follows traditional user neighborhood based collaborative filtering approaches. It adopts a memory network to assign attentive weights for other similar users.

Even though our approaches do not model the order of consumed items in the user’s purchase history (e.g. rigid orders of items), since we consider latest $s$ items as the context items to predict the next item, we still compare our models with some key sequential models to further show our models’ effectiveness as follows:

$\bullet$

Personalized Ranking Metric Embedding (PRME) (Feng et al., 2015):

Given a user $u$ , a target item $j$ , and a previous consumed item $k$ , it models a personalized first-order Markov behavior with two components: $d_{ujk}=\alpha||v_{u}-v_{j}||^{2}+(1-\alpha)||v_{k}-v_{j}||^{2}$ , where $||\cdot||_{2}^{2}$ is a squared $\mathcal{L}2$ distance. Then PRME is learned by minimizing BPR loss.

$\bullet$

PRME_s: It is our extension of PRME, where the distance between the target item $j$ and the previous consumed item $k$ is replaced by the average distance between $j$ and each of previous $s$ items: $d_{ujs}=\alpha||v_{u}-v_{j}||^{2}+(1-\alpha)\frac{1}{|s|}\sum_{k\in s}||v_{k}-v_{j}||^{2}$ . We use BPR loss to learn PRME_s.

$\bullet$

Translation-based Recommendation (TransRec) (He et al., 2017a): It uses first-order Markov and considers a user $u$ as a translator of his/her previous consumed item $k$ to a next item $j$ . In another word, $prob(j|u,k)\propto\beta_{j}-d(u+v_{k}-v_{j})$ where $\beta_{j}$ is an item bias term, $d$ is a distance function (e.g. $\mathcal{L}1$ or $\mathcal{L}2$ distance). We use $\mathcal{L}2$ distance because it was shown to perform better than $\mathcal{L}1$ (He et al., 2017a). TransRec is then learned with BPR loss.

$\bullet$

**Convolutional Sequence Embedding Recommendation

(Caser)** (Tang and Wang, 2018): It is a state-of-the-art sequential model. It uses convolution neural network with many horizontal and vertical kernels to capture the complex relationships among items.

The strong sequential baselines above surpassed many other sequential models such as: TransRec outperformed FMC(Rendle et al., 2010), FPMC (Rendle et al., 2010), HRM (Wang et al., 2015); Caser surpassed GRU4Rec (Hidasi et al., 2015) and Fossil (He and McAuley, 2016a), so we exclude them in our evaluation.

Comparison: In the general recommendation task, we compare our proposed models with all ten strong baselines listed above. In the shopping basket-based recommendation task, since the sequential models often work better than general recommendation-based models (see Table 5.3), we only compared our proposed models with sequential baselines. We name general recommendation baselines (i.e. ItemKNN, BPR, SLIM, CML, NeuMF++, CMN++) as Group-1 baselines, and call sequential baselines (i.e. PRME, PRME_s, TransRec, Caser) as Group-2 baselines for an easy reference.

5.3. Experimental Settings

Protocol: We adopt the widely used leave-one-out setting (He et al., 2017b; Xue et al., 2017), in which for each user, we reserve her last interaction as the test sample. If there are no timestamps available in the dataset, then the test sample is randomly drawn. Among the remaining data, we randomly hold one interaction for each user to form the development set, while all others are utilized as the training set. Since it is very time-consuming and unnecessary to rank all the unobserved items for each user, we follow the standard strategy to randomly sample 100 unobserved items for each user. Then, we rank them together with the test item (He et al., 2017b; Koren, 2008).

Assigning item orders: Sequential models need rigid orders of consumed items but consumed items in the same transaction (in IJCAI-15 and TaFeng datasets) are assigned the same timestamp of the transaction containing these items. Hence, we assigned the item timestamps where the orders of items are kept as in the original dataset. This may give credits to sequential models but not our methods (because our methods will use all consumed items in the same transaction as context items and do not model the item orders).

Hyper-parameters selection: We perform a grid search for the embedding size $d$ from $\{8,16,32,64,128\}$ and regularization terms from $\{0.1,0.01,0.001,0.0001,0.00001\}$ in all the models. We select the best number of hops for CMN++ and our SDM from $\{1,2,3,4\}$ . In NeuMF++, we select the best number of MLP layers from $\{1,2,3\}$ . In our models, we fix the batch size to $256$ . We adopt Adam optimizer (Kingma and Ba, 2014) with a fixed learning rate of 0.001. Similar to CMN++ and NeuMF++, the number of negative samples is set to 4. We use one layer perceptron for SDP (more complex datasets may need more than one layer to get better results). In the four datasets used in general recommendation task (e.g ML-100k, ML-1M, Netflix, Epinions), to avoid too many zero paddings for users with a smaller number of consumed items or too many context items are kept in the memory, which unnecessarily slow down the model’s execution, we follow (Tay et al., 2018a) to limit the number of context items using latest s consumed items. We search s in {5, 10, 20}. In the two shopping basket-based recommendation datasets (i.e. IJCAI-15 and TaFeng), since the maximum number of items in a transaction is small (e.g. 13 in IJCAI-15, and 18 in TaFeng), we consider all the other items in the same transaction with the target item as its context items. All the hyper-parameters are tuned using the development dataset. Our source code is available at: https://github.com/thanhdtran/SDMR.

Evaluation Metrics: We evaluate all models’ performance by two widely used metrics: Hit Ratio (hit@ $k$ ), and Normalized Discounted Cumulative Gain (NDCG@ $k$ ), where $k$ is a truncated number or top-k item recommendation. Intuitively, hit@ $k$ shows whether the test item is in the top-k list or not, while NDCG@ $k$ accounts for the position of the hits by assigning higher scores to the hits at top ranks and downgrading the scores to hits by $log_{2}$ at lower ranks.

Bibliography64

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Aggarwal (2016) Charu C Aggarwal. 2016. Recommender systems . Springer.
3Bahdanau et al . (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473 (2014).
4Billsus and Pazzani (2000) Daniel Billsus and Michael J Pazzani. 2000. User modeling for adaptive news access. User modeling and user-adapted interaction 10, 2-3 (2000), 147–180.
5Chen et al . (2017) Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval . 335–344.
6Choi et al . (2018) Heeyoul Choi, Kyunghyun Cho, and Yoshua Bengio. 2018. Fine-grained attention mechanism for neural machine translation. Neurocomputing 284 (2018), 171–176.
7Deshpande and Karypis (2004) Mukund Deshpande and George Karypis. 2004. Item-based top-n recommendation algorithms. ACM Transactions on Information Systems 22, 1 (2004), 143–177.
8Ebesu et al . (2018) Travis Ebesu, Bin Shen, and Yi Fang. 2018. Collaborative Memory Network for Recommendation Systems. In Proceedings of the 41st ACM International Conference on Research and Development in Information Retrieval .