Integrating Reviews into Personalized Ranking for Cold Start   Recommendation

Guang-Neng Hu; Xin-Yu Dai

arXiv:1701.08888·cs.IR·January 15, 2021

Integrating Reviews into Personalized Ranking for Cold Start Recommendation

Guang-Neng Hu, Xin-Yu Dai

PDF

TL;DR

This paper proposes two models that incorporate item reviews into Bayesian personalized ranking to improve recommendation accuracy, especially for cold-start items, by leveraging review text features and uncovering review-based user preferences.

Contribution

The paper introduces novel models that integrate item reviews into Bayesian ranking, enhancing cold-start recommendation performance using text features and review dimensions.

Findings

01

Leveraging item reviews improves ranking accuracy.

02

Models outperform baseline methods on six datasets.

03

Review-based features help mitigate cold-start issues.

Abstract

Item recommendation task predicts a personalized ranking over a set of items for each individual user. One paradigm is the rating-based methods that concentrate on explicit feedbacks and hence face the difficulties in collecting them. Meanwhile, the ranking-based methods are presented with rated items and then rank the rated above the unrated. This paradigm takes advantage of widely available implicit feedback. It, however, usually ignores a kind of important information: item reviews. Item reviews not only justify the preferences of users, but also help alleviate the cold-start problem that fails the collaborative filtering. In this paper, we propose two novel and simple models to integrate item reviews into Bayesian personalized ranking. In each model, we make use of text features extracted from item reviews using word embeddings. On top of text features we uncover the review…

Tables2

Table 1. Table 1: Statistics of Datasets

Datasets		#Users	#Items	#Feedback	#Words	#Cold Users	#Cold Items	Density (%)
Girls	778	3,963	5,474	302M	572	3,946	0.177
Boys	981	4,114	6,388	302M	787	4,080	0.158
Baby	1,238	4,592	8,401	302M	959	4,482	0.147
Men	21,793	55,647	157,329	302M	15,821	52,031	0.013
Women	62,928	157,656	504,847	302M	41,409	143,444	0.005
Phones	58,741	77,979	420,847	210M	43,429	67,706	0.009

Table 2. Table 2: AUC Performance Results (#factors = 15, best result is boldfaced ).

Datasets		Setting	POP	BPR-MF	TBPR-Diff	TBPR-Shared	Improv1 (%)	Improv2 (%)
Girls	All	0.1699	0.5658	0.5919	0.5939	4.966	7.09
Boys	All	0.2499	0.5493	0.5808	0.5852	6.535	11.99
Baby	All	0.3451	0.5663	0.5932	0.6021	6.321	16.18
	All	0.5486	0.6536	0.6639	0.6731	2.983	18.57
Men	Cold	0.4725	0.5983	0.6114	0.6225	4.044	19.23
	All	0.5894	0.6735	0.6797	0.6842	1.588	12.72
Women	Cold	0.4904	0.6026	0.6110	0.6152	2.090	11.22
	All	0.7310	0.7779	0.7799	0.7809	0.386	6.39
Phones	Cold	0.5539	0.6415	0.6464	0.6467	0.811	5.94

Equations22

P, Q min \sum_{x_{u, i} \neq = 0} (x_{u, i} - \overset{x}{^}_{u, i})^{2} + λ (∥ P ∥_{F}^{2} + ∥ Q ∥_{F}^{2}),

P, Q min \sum_{x_{u, i} \neq = 0} (x_{u, i} - \overset{x}{^}_{u, i})^{2} + λ (∥ P ∥_{F}^{2} + ∥ Q ∥_{F}^{2}),

\overset{x}{^}_{u, i}^{D i f f} = α + β_{u} + β_{i} + P_{u}^{T} Q_{i} + θ_{u}^{T} (H f_{i}) + β^{'^{T}} f_{i},

\overset{x}{^}_{u, i}^{D i f f} = α + β_{u} + β_{i} + P_{u}^{T} Q_{i} + θ_{u}^{T} (H f_{i}) + β^{'^{T}} f_{i},

\overset{x}{^}_{u, i}^{S ha r e d} = Q_{i}^{T} (P_{u} + ∣ N_{u} ∣^{- 1/2} \sum_{k \in N_{u}} H f_{k}) + α + β_{u} + β_{i} + β^{'^{T}} f_{i} .

\overset{x}{^}_{u, i}^{S ha r e d} = Q_{i}^{T} (P_{u} + ∣ N_{u} ∣^{- 1/2} \sum_{k \in N_{u}} H f_{k}) + α + β_{u} + β_{i} + β^{'^{T}} f_{i} .

L (Θ) \equiv \sum_{(u, i, j) \in D_{S}} ln σ (\overset{x}{^}_{u ij}) - λ ∥Θ ∥^{2},

L (Θ) \equiv \sum_{(u, i, j) \in D_{S}} ln σ (\overset{x}{^}_{u ij}) - λ ∥Θ ∥^{2},

Θ \leftarrow Θ + η (σ (- \overset{x}{^}_{u ij}) \frac{\partial x ^ _{u ij}}{\partial Θ} - λ Θ) .

Θ \leftarrow Θ + η (σ (- \overset{x}{^}_{u ij}) \frac{\partial x ^ _{u ij}}{\partial Θ} - λ Θ) .

\frac{\partial}{\partial P _{u}} \overset{x}{^}_{u ij} = Q_{i} - Q_{j}, \frac{\partial}{\partial β ^{'}} \overset{x}{^}_{u ij} = f_{i} - f_{j}, \frac{\partial}{\partial β _{i}} \overset{x}{^}_{u ij} = 1, \frac{\partial}{\partial β _{j}} \overset{x}{^}_{u ij} = - 1.

\frac{\partial}{\partial P _{u}} \overset{x}{^}_{u ij} = Q_{i} - Q_{j}, \frac{\partial}{\partial β ^{'}} \overset{x}{^}_{u ij} = f_{i} - f_{j}, \frac{\partial}{\partial β _{i}} \overset{x}{^}_{u ij} = 1, \frac{\partial}{\partial β _{j}} \overset{x}{^}_{u ij} = - 1.

\frac{\partial}{\partial Q _{i}} \overset{x}{^}_{u ij} = P_{u}, \frac{\partial}{\partial Q _{j}} \overset{x}{^}_{u ij} = - P_{u}, \frac{\partial}{\partial θ _{u}} \overset{x}{^}_{u ij} = H (f_{i} - f_{j}), \frac{\partial}{\partial H} \overset{x}{^}_{u ij} = θ_{u} (f_{i} - f_{j})^{T} .

\frac{\partial}{\partial Q _{i}} \overset{x}{^}_{u ij} = P_{u}, \frac{\partial}{\partial Q _{j}} \overset{x}{^}_{u ij} = - P_{u}, \frac{\partial}{\partial θ _{u}} \overset{x}{^}_{u ij} = H (f_{i} - f_{j}), \frac{\partial}{\partial H} \overset{x}{^}_{u ij} = θ_{u} (f_{i} - f_{j})^{T} .

\frac{\partial}{\partial Q _{i}} \overset{x}{^}_{u ij} = P_{u} + ∣ N_{u} ∣^{- 1/2} \sum_{k \in N_{u}} H f_{k}, \frac{\partial}{\partial Q _{j}} \overset{x}{^}_{u ij} = - (P_{u} + ∣ N_{u} ∣^{- 1/2} \sum_{k \in N_{u}} H f_{k}),

\frac{\partial}{\partial Q _{i}} \overset{x}{^}_{u ij} = P_{u} + ∣ N_{u} ∣^{- 1/2} \sum_{k \in N_{u}} H f_{k}, \frac{\partial}{\partial Q _{j}} \overset{x}{^}_{u ij} = - (P_{u} + ∣ N_{u} ∣^{- 1/2} \sum_{k \in N_{u}} H f_{k}),

\frac{\partial}{\partial H} \overset{x}{^}_{u ij} = ∣ N_{u} ∣^{- 1/2} (Q_{i} - Q_{j}) (\sum_{k \in N_{u}} f_{k})^{T} .

\frac{\partial}{\partial H} \overset{x}{^}_{u ij} = ∣ N_{u} ∣^{- 1/2} (Q_{i} - Q_{j}) (\sum_{k \in N_{u}} f_{k})^{T} .

f_{i} \equiv \frac{1}{∣ d _{i} ∣} \sum_{w \in d_{i}} e_{w} .

f_{i} \equiv \frac{1}{∣ d _{i} ∣} \sum_{w \in d_{i}} e_{w} .

A U C = \frac{1}{∣ U ∣} u \in U \sum \frac{1}{∣ E ( u ) ∣} (i, j) \in E (u) \sum δ (\overset{x}{^}_{u, i} > \overset{x}{^}_{u, j}),

A U C = \frac{1}{∣ U ∣} u \in U \sum \frac{1}{∣ E ( u ) ∣} (i, j) \in E (u) \sum δ (\overset{x}{^}_{u, i} > \overset{x}{^}_{u, j}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\epstopdfDeclareGraphicsRule

.pspdf.pdfps2pdf -dEPSCrop #1 \OutputFile

11institutetext: Department of Computer Science and Engineering,

Hong Kong University of Science and Technology, Hong Kong, China,

11email: [email protected] 22institutetext: National Key Laboratory for Novel Software Technology,

Nanjing University, Nanjing 210023, China

22email: [email protected]

Integrating Reviews into Personalized Ranking for Cold Start Recommendation

Guang-Neng Hu1

Xin-Yu Dai2 Corresponding author

Abstract

Item recommendation task predicts a personalized ranking over a set of items for each individual user. One paradigm is the rating-based methods that concentrate on explicit feedbacks and hence face the difficulties in collecting them. Meanwhile, the ranking-based methods are presented with rated items and then rank the rated above the unrated. This paradigm takes advantage of widely available implicit feedback. It, however, usually ignores a kind of important information: item reviews. Item reviews not only justify the preferences of users, but also help alleviate the cold-start problem that fails the collaborative filtering. In this paper, we propose two novel and simple models to integrate item reviews into Bayesian personalized ranking. In each model, we make use of text features extracted from item reviews using word embeddings. On top of text features we uncover the review dimensions that explain the variation in users’ feedback and these review factors represent a prior preference of users. Experiments on six real-world data sets show the benefits of leveraging item reviews on ranking prediction. We also conduct analyses to understand the proposed models.

1 Introduction

Users confront with the “information overload” dilemma and it is increasingly difficult for them to choose the preferred items over others because of the growing large item set, e.g., hundreds of millions products at Amazon.com [6], tens of thousands videos at Netflix.com [1]. Recommender systems (RSs) assist users in tackling this problem and help them make choices by ranking the items based on their past history behavior. Item recommendation predicts a personalized ranking over a set of items for individual user and hence leads to personalized recommendation.

The rating-based (or point-wise) methods predict ratings that a user will give to items and then rank the items according to their predicted ratings. Several methods are proposed and matrix factorization based models are most popular due to their scalability, simplicity, and flexibility [9, 5, 2]. This paradigm concentrates on explicit feedback and it faces the difficulties in collecting them. Meanwhile, the ranking-based (pair-wise) methods are presented with seen items and then rank the seen above the unseen. Bayesian personalized ranking (BPR-MF) and collaborative item selection are typical representatives [13, 10]. This paradigm takes advantage of widely available implicit feedback but it usually ignores a kind of important information: item reviews.

**Related works. ** Item reviews justify the preferences of users and help alleviate the cold-start problem. It is a diverse and complementary data source for recommendation beyond the user-item co-rating information. The collective matrix factorization (CMF) method [14] can be adapted to factorize the item-word matrix as well as the user-item matrix. The collaborative topic regression (CTR) [17] and (hidden factors and topics (HFT) [7] models integrate user-item interactions with item text content to build better rating predictors. They both employ topic modeling to learn hidden topic factors which explain the variations of users’ preferences. The CTRank model [18] also adopts topic modeling to exploit item meta-data like article titles and abstracts using bag-of-words representation for one-class CF [11]. The CDR [19] and CKE [20] use deep learning techniques or neural networks such as stacked denoising autoencoders to mine the text content. Nevertheless, integrating item reviews into the ranking-based methods presents both opportunities and challenges for traditional BPR. There are few works on leveraging item reviews to improve personalized ranking. Beyond reviews, other auxiliary sources such as social relations are also intergraded into CF models [4]. We focus on the item reviews.

In this paper we propose two novel and simple models to incorporate item reviews into matrix factorization based Bayesian personalized ranking. Like HFT, they integrate item reviews and unlike HFT they generate a ranked list of items for individual ranking. Like CTRank, they focus on personalized ranking and unlike CTRank they are based on matrix factorization and using word embeddings to extract features. Like BPR-MF, they rank preferred items over others and unlike BPR-MF they leverage the information from item reviews. In each of the two models, we make use of text features extracted from item reviews using word embeddings. And on top of text features we uncover the review dimensions that explain the variation in users’ feedback. These review factors represent a prior preference of a user. One model treats the review factor space independent of the latent factor space; another connects implicit feedback and item reviews through the shared item space.

The contributions of this work are summarized as follows.

We propose two novel models to integrate item reviews into matrix factorization based Bayesian personalized ranking (Section 3.2 and Section 3.3). They generate a ranked list of items for individual user by leveraging the information from item reviews.
For exploiting item reviews, we build the proposed models on the top of text features extracted from them. We demonstrate a simple and effective way of extracting features from item reviews by averagely composing word embeddings (Section 4).
We empirically evaluate the proposed models on multiple real-world datasets which contains over millions of feedback in total. The experimental results show the benefit of leveraging item reviews on personalized ranking prediction. We also conduct analyses to understand the proposed models including the training efficiency and the impact of the number of latent factors.

2 Notation and Problem Statement

Before proposing our models, we briefly review the personalized ranking task and then describe the problem statement. To this end, we first introduce the notations used throughout the paper.

2.1 Notation

Suppose there are $M$ users $\mathcal{U}=\{u_{1},...,u_{M}\}$ and $N$ items $\mathcal{I}=\{i_{1},...,i_{N}\}$ . We reserve $u,v$ for indexing users and $i,j$ for indexing items. Let $X\in\mathbb{R}^{M\times N}$ denote the user-item binary implicit feedback matrix, where $x_{u,i}$ is the preference of user $u$ on item $i$ , and we mark a zero if it is unknown. Define $N_{u}$ as the set of items on which user $u$ has an action: $N_{u}\equiv\{i|i\in\mathcal{I}\land x_{u,i}>0\}$ . Rating-based methods [9, 5] and ranking-based methods [13, 3] are mainly to learn the latent user factors $P=[P_{1},...,P_{M}]\in\mathbb{R}^{F\times M}$ and latent item factors $Q=[Q_{1},...,Q_{N}]\in\mathbb{R}^{F\times N}$ from partially observed $X$ .

Item $i$ may have text information, e.g., review $d_{ui}$ commented by user $u$ . We aggregate all reviews of a particular item as a ‘doc’ $d_{i}=\cup_{u\in\mathcal{U}}{d_{ui}}$ . Approaches like CTR and HFT [17, 7] integrate item content/reviews with explicit ratings for rating prediction using topic modeling. Another approach is to learn word embeddings and then compose them into document level as the item text features; we adopt this way of extracting text features $f_{i}\in\mathbb{R}^{D}$ from $d_{i}$ (see Section 4).

2.2 Problem Statement

Our work focuses on the item recommendation or personalized ranking task where a ranked list of items is generated for each individual user. The goal is to accurately rank the unobserved items which contain both truly negative items (e.g., the user dislikes the Netflix movies or is not interesting in buying Amazon products) and missing ones (e.g., the user wants to see a movie or buy a product in the future when she knows it).

Instead of accurately predicting unseen ratings by learning a model from training samples $(u,i,x_{u,i})$ where $x_{u,i}>0$ , personalized ranking optimizes for correctly ranking item pairs by learning a model from training samples $D_{S}\equiv\{(u,i,j)|u\in\mathcal{U}\land i\in N_{u}\land j\in\mathcal{I}\backslash N_{u}\}$ . The meaning of item pairs of a user $(u,i,j)$ is that she prefers the former than the latter, i.e., the model tries to reconstruct parts of a total order $>_{u}$ for each user $u$ . From the history feedback $X$ we can infer that the observed items $i$ are ranked higher than the unobserved ones $j$ ; and for both observed items $i_{1},i_{2}$ or both unobserved items $j_{1},j_{2}$ we can infer nothing. Random (negative) sampling is adopted since the number of such pairs is huge. See the original BPR paper [13] for more details.

**Problem 1. **Personalized Ranking with Item Reviews.

Input: 1) A binary implicit feedback matrix $X$ , 2) an item reviews corpus $C$ , and 3) a user $u$ in the user set $\mathcal{U}$ .

Output: A ranked list $>_{u}$ over the unobserved items $\mathcal{I}\backslash N_{u}$ .

In Problem 1, to generate the ranked list, we have item reviews to exploit besides implicit feedback.

3 The Proposed Models

In this section, we propose two models as a solution to Problem 1 which leverage item reviews into Bayesian personalized ranking. One model treats the review factor space independent of the latent factor space (Section 3.2). Another model connects implicit feedback and item reviews through the shared item space (Section 3.3). In each of the two proposed models, we make use of text features extracted from item reviews using word embeddings (Section 4). On top of text features we uncover the review dimensions that explain the variation in users’ feedback and these review factors represent a prior preference of a user. Both models are based on basic matrix factorization (Section 3.1) and learned with Bayesian personalized ranking (Section 3.4).

3.1 Basic Matrix Factorization

The basic matrix factorization (Basic MF) is mainly to find the latent user-specific feature matrix $[P_{u}]^{M}_{1}$ and item-specific feature matrix $[Q_{i}]^{N}_{1}$ to approximate the partially observed feedback matrix $X$ in the regularized least-squares (or ridge regression) sense by solving the following problem:

[TABLE]

where $\lambda$ is the regularization parameter to avoid over-fitting. The predicted scores $\hat{x}_{u,i}$ can be modeled by various forms which embody the flexibility of matrix factorization. A basic form is $\hat{x}_{u,i}^{Basic}=\alpha+\beta_{u}+\beta_{i}+P_{u}^{\textrm{T}}Q_{i}$ , where $\alpha$ , $\beta_{u}$ and $\beta_{i}$ are biases [5].

3.2 Integrating Item Reviews into Basic MF: Different Space Case

In this section, we propose our first model TBPR-Diff to integrate item reviews with implicit feedback. Analogical to the Basic MF which factorizes the ratings into user- and item- latent factors, we can factorize the reviews into user- and item- text factors (see the illustration in Figure 1—Up). The TBPR-Diff model sharpens this idea and teases apart the rating dimensions into latent factors and text factors:

[TABLE]

where the term $\theta_{u}^{\mathsf{T}}(Hf_{i})$ is newly introduced to capture the text interaction between user $u$ and item $i$ . To exploit item reviews, text features $f_{i}\in\mathbb{R}^{D}$ are firstly extracted from item reviews using word embeddings. The embedding kernel $H\in\mathbb{R}^{K\times D}$ linearly transforms $f_{i}$ from text features space (e.g., 200) into a low-dimensional text rating space (e.g., 15) and then it interacts with text factors of users $\theta_{u}\in\mathbb{R}^{K}$ . A text bias vector $\beta^{\prime}$ is also introduced to model users’ overall preferences towards the item reviews. The details of text features extracted from item reviews using word embeddings are described later (see Section 4).

Since the text factors of users $\theta_{u}$ and of items $(Hf_{i})$ are independent of latent factors $P_{u}$ and $Q_{i}$ , there is no deep interactions between the information sources of observed feedback and item reviews, and hence they cannot benefit from each other. Also additional parameters increase the model complexity. Based on these observations, we propose another model to alleviate the above challenges.

3.3 Integrating Item Reviews into Basic MF: Shared Space Case

In this section, we propose our second model TBPR-Shared to integrate item reviews with implicit feedback more compactly. For an item $i$ , its latent factors $Q_{i}$ learned from feedback can be considered as characteristics that it processes; meanwhile, these characteristics are probably discussed in its reviews and hence exhibit in its text factors $Hf_{i}$ (see the illustration in Figure 1—Down). For user $u$ , if we let $Q_{i}$ and $\{Hf_{k}|k\in N_{u}\}$ be in the same space then it leads to deep interactions between text factors of user $u$ and the latent factor of item $i$ . The TBPR-Shared model sharpens this idea and enables the deep interactions between text factors and latent factors as well as reduces complexity of the model:

[TABLE]

On the right hand, the last four terms are the same with the TBPR-Diff model. Different from the TBPR-Diff model, the shared item factors $Q_{i}$ now have two-fold meanings: one is item latent factors that represent items’ characteristics; another is to interact with item text factors that capture items’ semantics from item reviews. Also different from the TBPR-Diff model, the preferences of a user now have a prior term which shows the ‘text influence of her rated items’ captured by the text factors of corresponding items. In summary, on top of text features the TBPR-Shared model uncovers the review dimensions that explain the variation in users’ feedback and these factors represent a prior preference of user.

**Remarks ** I. The VBPR model [3] proposed an analogical formulation with Eq (2). It exploits visual features extracted from item images and we leverage item features extracted from item reviews. The SVD++ and NSVD [5, 12] models proposed similar formulas with Eq (3). They learn an implicit feature matrix to capture implicit feedback and we learn a text correlation matrix to capture text factors; note that they didn’t exploit item reviews and hence they had no the text bias term. II. There can be an adjustable weight on the term of text (i.e., $\theta_{u}^{\mathsf{T}}(Hf_{i})$ in Eq (2) and $Q_{i}^{\mathsf{T}}|N_{u}|^{-1/2}\sum\nolimits_{k\in N_{u}}Hf_{k}$ in Eq (3)) to balance the influence from feedback and from reveiws, but here we just let feedback and reviews be equally important.

Before we delve into the learning algorithm, the preference predictors of TBPR-Diff and of TBPR-Shared models are shown in Figure 1.

3.4 Model Learning with BPR

Revisit Problem 1, we need to generate a ranked list of items for individual user. Bayesian personalized ranking [13] is a generic pair-wise optimization framework that learns from the training item pairs using gradient descent. Denote the model parameters as $\Theta$ and let $\hat{x}_{uij}(\Theta)$ (for simplicity we omit model parameters and notation $x_{ui}$ is the same with $x_{u,i}$ ) represent an arbitrary real-valued mapping under the model parameters. Then the optimization criterion for personalized ranking BPR-OPT is:

[TABLE]

where $\hat{x}_{uij}\equiv\hat{x}_{ui}-\hat{x}_{uj}$ , and the sigmoid function is defined as $\sigma(x)=1/({1+\exp(-x)})$ . The meaning behind BPR-OPT requires ranking items accurately as well as using a simple model.

Under the generic BPR-OPT framework, we derive the learning process for our proposed models TBPR-Diff and TBPR-Shared by embodying $\hat{x}_{ui}$ with $\hat{x}_{ui}^{Diff}$ and $\hat{x}_{ui}^{Shared}$ , respectively. The BPR-OPT defined in Eq (4) is differentiable and hence gradient ascent methods can be used to maximize it. For stochastic gradient ascent, a triple $(u,i,j)$ is randomly sampled from training sets $D_{S}$ and then update the model parameters by:

[TABLE]

The same gradients for user latent factors and bias terms of both models are:

[TABLE]

Parameter gradients of the model TBPR-Diff are:

[TABLE]

Parameter gradients of the model TBPR-Shared are:

[TABLE]

**Complexity of Models and Learning. ** The complexity of model TBPR-Diff is $(M+N)F+(M+D)K+D$ while the complexity of model TBPR-Shared is $(M+N)F+(D+1)K$ . We can see that the latter model reduces the complexity by $\mathcal{O}(MK)$ , i.e., the parameters $[\theta_{u}]^{M}_{1}$ . For updating each training sample $(u,i,j)\in D_{S}$ , the complexity of learning TBPR-Diff is linear in the number of dimensions ( $F,K,D$ ) while the complexity of learning TBPR-Shared is also linear provided that the scale of rated items of users is amortizing constant, i.e., $\sum_{u\in\mathcal{U}}{|N_{u}|/|\mathcal{U}|}\approx const\ll|\mathcal{I}|$ , which holds in real-world datasets because of sparsity (see Table 1).

4 Feature Representations of Item Reviews

Recall that when generating the ranked list of items for individual user, we have item reviews to exploit besides implicit feedback. To exploit item reviews, we extract text features from them, i.e., there is a feature vector for each item. Our proposed two models are both built on the top of text features ( $[f_{i}]_{i=1}^{N}$ ) and hence they are important for improving personalized ranking. In this section, we give one simple way to extract text features from reviews of item—word embedding.

The SGNS model [8] is an architecture for learning continuous representations of words from large corpus; these representations, or word embeddings, can capture the syntactic and semantic relationships of words. We first run the Google word2vec code on Amazon reviews corpus (see Table 1) using the default setting (particularly, dimensionality $D=200$ ) to learn a vector $\mathbf{e}_{w}$ for each word $w$ . And then we directly sum up all of the embeddings in an item’s reviews (excluding stop words) and get a composition vector as the text feature for this item:

[TABLE]

To get $f_{i}$ , we can also use complex methods to compose the individual embeddings [15] and to learn the doc representation directly [16]; these complex methods are left for future investigation.

5 Experiments

We have proposed two models towards a solution to the Problem 1. The two models TBPR-Diff and TBPR-Shared integrate item reviews into Bayesian personalized ranking optimization criterion and uncover the text dimensions in users’ feedback. We want to know the benefit of leveraging item reviews and so we compare them with BPR-MF [13] which ignores the information of item reviews. In addition we report the results for the most popular (POP) baseline that predicts item pairs by their corresponding ‘popularity’ and this method doesn’t show personalized ranking. Furthermore, we analyse the impact of the number of latent factors on our proposed models.

5.1 Datasets

We evaluate our models on six Amazon datasets http://jmcauley.ucsd.edu/data/amazon/. They consist of five from clothing and shoes category, and one from cell phones and accessories. We use the review history as implicit feedback and aggregate all users’ reviews to an item as a doc for this item. We draw the samples from original datasets such that every user has rated at least five items (i.e., $\forall u\in\mathcal{U}:|N_{u}|\geq 5$ ) and the statistics of final evaluation datasets are show in Table 1. From the table we can see that: 1) the observed feedback is very sparse, typically less than 0.01%; 2) the average feedback events for users are typical about ten, i.e., $\sum_{u\in\mathcal{U}}{|N_{u}|/|\mathcal{U}|}\approx 10\ll|\mathcal{I}|$ holds; 3) more than half of the users and of the items are cold and have feedback less than seven. Note that the cold-users/-items are those that have less than seven feedback events, and the feedback Density = $\#Feedback/(\#Users*\#Items)$ .

We split each of the whole datasets into three parts: training, validation, and test. In detail, for each user $u\in\mathcal{U}$ , we randomly sample two items from her history feedback for test set $Test_{u}$ , two for validation set $Valid_{u}$ , and the rest for training set $Train_{u}$ ; and hence $N_{u}=Train_{u}\cup Valid_{u}\cup Test_{u}$ . This is the reason that we discard users who rated items less than five to ensure that there is at least one training sample for her.

5.2 Evaluation Protocol

For item recommendation or personalized ranking, we need to generate a ranked list over the unobserved items. Therefore for the hold-out test item $i\in Test_{u}$ of individual user $u$ , the evaluation calculates how accurately the model rank $i$ over other unobserved items $j\in\mathcal{I}\backslash N_{u}$ . The widely used measure Area Under the ROC Curve (AUC) sharpens the ranking correctness intuition:

[TABLE]

where $E(u)=\{(i,j)|i\in Test_{u}\land j\in\mathcal{I}\land j\notin N_{u}\}$ and the $\delta(\cdot)$ is an indicator function. A higher AUC score indicates a better recommendation performance.

The validation set $\mathcal{V}=\cup_{u\in\mathcal{U}}Valid_{u}$ is used to tune hyperparameters and we report the corresponding results on the test set $\mathcal{T}=\cup_{u\in\mathcal{U}}Test_{u}$ .

5.3 Comparing Methods

We compare our proposed models TBPR-Diff (see Eq (2)) and TBPR-Shared (see Eq (3)) with the Most Popular (POP) and BPR-MF [13] baselines. The difference of models lies in their preference predictors.

Reproducibility. We use the released code in [3] to implement the comparing methods and our proposed models. The hyperparameters are tuned on the validation set. Referring to the default setting, for the BRP-MF model, the norm-penalty $\lambda=11$ , and learning rate $\eta=0.005$ . As with our proposed models TBPR-Diff and TBPR-Shared, the norm-penalty $\lambda_{latent}=11$ for latent factors and $\lambda_{text}=5$ for text factors, and learning rate $\eta=0.001$ . For simplicity, the number of latent factors equals to the number of text factors; the default values for them are both fifteen (i.e., $F=K=15$ ). The impact of the number of factors is analysed in Section 5.5. Since the raw datasets, comparing code, and parameter setting are given publicly, we confidently believe our experiments are easily reproduced.

5.4 Performance Results

The AUC performance results on eight Amazon.com datasets are shown in Table 2 where the last but one column is $(AUC_{\mathrm{TBPR-Shared}}-AUC_{\mathrm{BPR-MF}})/AUC_{\mathrm{BPR-MF}}\times 100\%$ , and the last column is $(AUC_{\mathrm{TBPR-Shared}}-AUC_{\mathrm{BPR-MF}})/(AUC_{\mathrm{BPR-MF}}-AUC_{\mathrm{POP}})\times 100\%$ . For each dataset there are three evaluation settings: The All Items or All setting evaluates the models on the full test set $\mathcal{T}$ ; the Cold Start or Cold setting evaluates the models on a subset $\mathcal{T}_{cold}\subseteq\mathcal{T}$ such that the number of training samples for each item within $\mathcal{T}_{cold}$ is no greater than three (i.e., $|Train_{u}|\leq 3$ or $|N_{u}|\leq 7$ ); the Warm setting evaluates the models on the difference set of All and Cold. Revisit the Table 1 we can see that: 1) almost all of the items are cold-item for datasets Girls, Boys, and Baby; and hence the results of Cold setting are almost the same with All and the results of Warm setting is not available to get a statistical reliable results; and 2) for other three datasets, the percent of cold-items is also more than 86% which requires the model to address the inherent cold start nature of the recommendation problem.

There are several observations from the evaluation results.

Under the All setting, TBPR-Shared is the top performer, TBPR-Diff is the second, with BPR-MF coming in third and POP the weakest. These results firstly show that leveraging item reviews besides the feedback can improve the personalized ranking; and also show that the personalization methods are distinctly better than the user-independent POP method. For example, TBPR-Shared averagely obtains relative 4.83% performance improvement compared with BPR-MF on the first three smaller datasets in terms of AUC metric, and 2.74% in total six datasets. This two figures show, to some extent, that transferring the knowledge from auxiliary data source (here item reviews) helps most when the target data source (here rating feedback) is not so rich.
Under the Cold setting, TBPR-Shared is the top performer, TBPR-Diff is the second, with BPR-MF coming in third and POP is also the weakest. These results firstly show that leveraging item reviews besides the feedback can improve the personalized ranking even in the cold start setting; and also show that the personalization methods are distinctly better than the user-independent POP method since the cold items are not popular. In detail, TBPR-Shared averagely obtains relative 2.31% performance improvement compared with BPR-MF in terms of AUC metric. Furthermore, TBPR-Shared compared with BPR-MF, the relative improvement in the cold start setting is about 1.6 times than that in the All setting which implies that integrating item reviews more benefits when observed feedback is sparser. As with the results on the Phones dataset, revisiting Table 1 we can see that the ratio of cold items over all item is 86.8% which is far less than those on other two datasets ( $\sim 92.2\%$ ). And in this case adding auxiliary information doesn’t help much.

We also evaluate on the Warm setting (not shown in Table 2), and all of the personalized, complex methods are worse than the user-independent, simple method POP. Warm items are more likely to be popular and show less personalized characteristics. It reminds us the commonplace that recommendation plays an important role in long-tailed items.

5.5 Analysis of the Proposed Models

After demonstrating the benefits of leveraging item reviews, we analyse the proposed models from two points; one is the impact of number of latent factors, and one is the training efficiency and convergence analysis. More depth investigation like the impact of embedding dimensionality and of corpus source to train the embeddings, is left to future work.

**Impact of the Number of Latent Factors. ** The two proposed models TBPR-Shared and TBPR-Diff have two important hyperparameters; one is the number of latent factors $F$ and one is the number of text factors $K$ . For simplicity, we let the two values equal. We vary the number of latent factors $\#factors=\{5,10,15,20,25\}$ to observe the performance results of different methods. The test AUC scores are shown in Figure 2. On the Girls and Boys datasets, both of the personalized models are to perform better as the number of factors increases; on the other datasets, the performance improves as the number of factors increases to around fifteen; then it doesn’t go up and may even downgrade. We set the default value as 15.

Also the plots visually show the benefits of integrating item reviews (TBPR-Shared vs. BPR-MF) and of generating a personalized ranking item list for individual user (TBPR-Shared and BPR-MF vs. POP).

**Training Efficiency and Convergence Analysis. ** The complexity of learning is approximately linear in the number of parameters of our proposed models. Figure 3 shows the AUC scores of the TBPR-Shared model on validation sets with increasing training iterations. In summary, our models take 3-4 times more iterations to converge than BPR-MF. On three smaller datasets (Girls, Boys, and Baby), the first five iterations are enough to get a better score than POP; and on the other larger datasets (Men, Women, and Phones), it takes longer.

As a reference, the BPR-MF model usually converges in 50 iterations. As another reference, all of our experiments are completed in about one week using one server that has 65GiB memory and 12 cores with frequency 3599MHz.

6 Conclusion and Future Work

Item reviews justify the rating behavior of users and hence they are useful for improving recommender performance. Based on matrix factorization techniques we proposed two models to integrate item reviews into Bayesian personalized ranking. In each of the two models, we make use of text features extracted from item reviews using word embeddings. On top of text features we uncover the review dimensions that explain the variation in users’ feedback. These review factors represent a prior preference of a user and show the ‘text influence of her rated items’. Empirical results on multiple real-world datasets demonstrated that our proposed models lead to improved ranking prediction performance under the All setting and the cold start setting in terms of AUC. And the shared space model is slightly better than the different space one which shows the benefits of considering the interactions between latent factors and text factors and the benefits of reducing the model complexity. Furthermore, we analyzed the impact of the dimensionality of latent factors and the efficiency of model learning.

Focusing on leveraging item reviews for improving personalized ranking, there are several directions we want to explore. First, we integrate the item reviews with implicit feedback by adding text dimensions into the rating predictors; certainly, this is not the only way to exploit item features. Second, more evaluation metrics should be explored besides AUC (e.g., hit rate). Third, the construction strategy of positive/negative samples is also worth further investigating because it deeply affects the modeling design, the learning results, and the evaluation performance. Last but not the least, since we investigate the benefits of leveraging item reviews, we only compare our models with BPR-MF (and POP); and to know the effectiveness, comparing with more baselines is needed.

**Acknowledgments ** The work is supported by HKPFS (PF15-16701), NSFC (61472183), and 863 Program (2015AA015406).

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Bennett and S. Lanning. The netflix prize. In Proceedings of KDD cup and workshop , 2007.
2[2] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of ACM Rec Sys , 2010.
3[3] R. He and J. Mc Auley. VBPR: Visual bayesian personalized ranking from implicit feedback. In Proceedings of AAAI , 2016.
4[4] G. Hu, X. Dai, Y. Song, S. Huang, J. Chen. A Synthetic Approach for Recommendation: Combining Ratings, Social Relations, and Reviews. In Proceedings of IJCAI , 2015.
5[5] Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of SIGKDD , 2008.
6[6] G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative filtering. Internet Computing, IEEE , 7(1):76–80, 2003.
7[7] J. Mc Auley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of ACM Rec Sys , 2013.
8[8] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in NIPS , 2013.