Deep Personalized Re-targeting
Meisam Hejazinia, Pavlos Mitsoulis-Ntompos, Serena Zhang

TL;DR
This paper introduces a hybrid neural network and gradient boosting model to predict traveler booking probability and value, improving accuracy by 7% in vacation rental marketplaces.
Contribution
It presents a novel hybrid model combining deep and shallow neural embeddings with gradient boosting, tailored for large-scale traveler behavior prediction.
Findings
Hybrid model improves prediction accuracy by 7%
Latent traveler preferences are learned from sparse session logs
Deployed architecture is suitable for production systems
Abstract
Predicting booking probability and value at the traveler level plays a central role in computational advertising for massive two-sided vacation rental marketplaces. These marketplaces host millions of travelers with long shopping cycles, spending a lot of time in the discovery phase. The footprint of the travelers in their discovery is a useful data source to help these marketplaces to predict shopping probability and value. However, there is no one-size-fits-all solution for this purpose. In this paper, we propose a hybrid model that infuses deep and shallow neural network embeddings into a gradient boosting tree model. This approach allows the latent preferences of millions of travelers to be automatically learned from sparse session logs. In addition, we present the architecture that we deployed into our production system. We find that there is a pragmatic sweet spot between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Customer churn and segmentation · Consumer Market Behavior and Pricing
Deep Personalized Re-targeting
Meisam Hejazinia1, Pavlos Mitsoulis-Ntompos2, Serena Zhang3
Vrbo, part of Expedia Group
[email protected], [email protected], [email protected]
Abstract
Predicting booking probability and value at the traveler level plays a central role in computational advertising for massive two-sided vacation rental marketplaces. These marketplaces host millions of travelers with long shopping cycles, spending a lot of time in the discovery phase. The footprint of the travelers in their discovery is a useful data source to help these marketplaces to predict shopping probability and value. However, there is no one-size-fits-all solution for this purpose. In this paper, we propose a hybrid model that infuses deep and shallow neural network embeddings into a gradient boosting tree model. This approach allows the latent preferences of millions of travelers to be automatically learned from sparse session logs. In addition, we present the architecture that we deployed into our production system. We find that there is a pragmatic sweet spot between expensive complex deep neural networks and simple shallow neural networks that can increase the prediction performance of a model by seven percent, based on offline analysis.
Index Terms:
computational advertising, re-targeting, personalized advertising, shopping funnel, deep learning, embeddings, e-commerce, gradient boosting
I Introduction
Every day, millions of travelers enter massive online two-sided marketplaces from various advertising channels such as search engines, display ads and meta search engines to discover and book their dream vacation property from millions of options. A significant portion of these travelers have never booked a vacation rental before, or are not willing to repeat their previous trips as they seek variety. Marketplace platforms must bid appropriately in order to gain traffic from these travelers through online advertising channels. To accomplish this, marketplaces may estimate the booking probability and value of potential travelers based on various engagement signals. While traditional customer lifetime value estimation methods are strong in the context of repeated purchases, these methods fall short in the context of infrequent or first time travelers. This paper attempts to fill in this gap by proposing a solution that leverages search and engagement signals to predict booking intent and value for a traveler as they progress through the shopping cycle, regardless of their previous booking history. In particular, we propose a hybrid method that infuses shallow and deep neural network embeddings into a gradient boosting tree model in order to automatically extract features that improve booking intent and value prediction. We discover a pragmatic sweet spot between expensive complex deep neural networks and shallow neural networks that improves the shopping intent prediction model by seven percent. We explain how our method is lightweight in terms of computational resources and easily deployable into a large-scale production system.
There are four areas of study related to this paper including probabilistic customer lifetime value, machine learning in computational advertising and click through rate (CTR) prediction, embeddings, and deep neural networks. We compare and contrast this study with relevant studies in each of the areas in the following subsections.
I-A Probabilistic Customer Lifetime Value
Historically, businesses have leveraged approaches to value their customers in order to optimize their return on advertising spend (ROAS) [1], particularly for traditional media, affiliate marketing, display ads, and search engine advertising. Industry has adopted probabilistic and statistical learning approaches due to their explainability, availability of small amounts of data, and computational restrictions. These models put simplified assumptions on the time between purchase, time til churn event, purchase count, and purchase value, proposing the following structures: Bayesian Pareto/ Negative Binomial Distribution baseline (NBD), Beta-Gamma(BG) /NBD, Gamma/Gompertz, and various enhancement on them [2] [3] [4] [5] [6]. As computational power allowed, more complex approaches in the same vein as Bayesian Hierarchical Hidden Markov Model (HHMM) were proposed [7]. While these studies open up a path for valuing customers, they might not be the right approach for re-targeting in the massive online marketplace with millions of travelers and listings, where travelers are heterogeneous, seeking variety in experience, not identifying themselves, and booking infrequently. In such an environment, we need approaches that are not only scalable, but also leverage various pre-purchase signals, such as browsing behavior to predict shopping intent and value. Furthermore, traveler’s preference shifts from trip to trip, making the historical behavior less relevant. Machine learning approaches have a relatively strong track record handling such cases.
I-B Machine learning in computational advertising
In more recent years, industry practitioners leveraged machine learning approaches for re-targeting at scale. Due to their ability to generate superior predictions, tree based approaches, such as random forest and gradient boosted trees were first to gain popularity [8] [9] [10]. However, these approaches required a significant effort in feature engineering, making them hard to generalize and expensive to maintain. To overcome this limitation, many studies began to leverage deep neural networks to automatically generate latent features in this domain [11] [12] [13] [14] [15] [16]. While these approaches have achieved significant improvements, they require a lot of time to train on a multi-million by multi-million sparse-space of traveler and listing graph. There might be a sweet spot between the tree based methods and deep neural networks which under computational budget constraints delivers better results. Next, we review embedding methods which are reputed for handling sparse spaces, and then we review the history of deep neural networks to introduce the space we explore in this study.
I-C Embedding methods
The data sparsity problem has long been studied in the recommendation system literature. Collaborative filtering methods used projection or embeddings of user-item matrix in lower dimensional space through matrix decomposition [17] [18] [19] [20] [21] [22]. More recently session data has been proposed as an alternative to purchase data for collaborative filtering methods, generating a new trend in recommendation system literature that focuses on session based recommendation systems (SBRS) [23] [24]. In essence, the methods used in SBRS extend natural language processing (NLP) methods to predict the context of user browsing, defined by the items the user views before or after a given item, using neural network methods [25]. The advantage of these methods is that they can project millions of items into a lower dimensional space, in which contextually similar items appear close to each-other [26] [22] [27] [28] [29] [30] [31] [32]. These lower dimension representations can be leveraged as automatically generated features in shopping intent and valuation models conditional on the user activity [33] [34]. Many of these studies extend the embedding space from shallow to deep neural network models.
I-D Deep Neural Network
Due to their success in text and image domain, deep neural networks have recently gained popularity in recommendation systems and embedding spaces [35] [36] [37]. Recurrent neural networks have shown to be effective in capturing temporal dependencies between user item views [38] [39] [40] [35] [32] and convolutional neural networks have shown success in capturing latent intent structure in the item images [41] [40] [42]. Recently, a new architecture called attention networks have gained popularity, due to its ability to automatically reweigh all signals that users can capture, resembling user memory attention [43] [44] [45] [46]. In this study, we evaluate the merits of various candidate solutions such as Deep Average Networks, Long Short Term Memory, and Attention networks in adding value to shopping intent and value prediction. The closest study to this paper is [47], which suggests using embeddings in customer lifetime value prediction. However, our paper differentiates itself by extending the embedding features from linear to non-linear spaces, finding a sweet spot with low cost and high value in combining a simple deep neural network with a shallow neural network and a tree based method to predict shopping intent and value.
II Notations and Problem Formulation
We formalize the personalized re-targeting problem as follows. The traveler performs a set of activities in a traveler shopping session when they visit our two-sided platform, where is the length of the sequence. An activity here could be either a click or page view. We also represent each session as a set of listings that the traveler interacts with , where . At the end of multiple visit sessions the traveler may either make a booking, represented by or they might leave the platform for another time, represented by . We denote probability of booking conditional on historical session context with , and conditional probability of booking value . For re-targeting returning travelers, we want to estimate the distribution of both booking event and booking value.
In an advertising real time bidding (RTB) system (such as Bing Ads, Google AdWords, and Criteo), a quantitative bidding function for bid utility can be boiled into two components: the estimated utility of the ad opportunity and the estimated cost to win it [1]. The hypothesis is that all other information and the bid price are conditionally independent given these two components. Further, we can decompose the utility of bidding into two components: first the on-site conversion , and second marginal value , in summary . Ideally, we should optimize this objective function under budget constraints, yet we focus on estimating the utility in this paper. In particular, our methodology estimates and , to extract and , respectively, to re-target travelers for their next visit in the shopping funnel. We will not cover other aspects of the bid optimization here. We propose a hybrid deep learning framework deployed in an end-to-end fashion in order to solve the above challenges.
III Model
In this section, we describe our solutions to the personalized re-targeting problem. Our solution has two modular parts: the conversion prediction and marginal value prediction. We use the XGBoost algorithm in both of the components. Later, we will focus on describing our ongoing efforts to supplement the handcrafted features in the deployed systems with automatic feature learning using traveler embeddings. The traveler embeddings are generated based on a two-stage neural network model described below.
III-A Traveler Booking Intent and Booking Value XGBoost Models
We used XGBoost, a scalable machine learning system for gradient tree boosting [48] [9] in our solutions for both modular parts. There are a few advantages of using XGBoost for this problem:
- •
Tree boosting has been shown to give state-of-the-art results on many standard classification benchmarks.
- •
Parallel and distributed computation and capability to handle sparse data.
- •
Deployment of XGBoost in an end-to-end system efficiently scales to large data sets.
Feature engineering and domain dependent data analysis play a pivotal role in this solution. For this model, we have created a rich set of handcrafted features using session data that includes user onsite interactions, e.g. search, property listing page, check out flow and contacts (inquiries and messages). An in-house model was used to remove bot traffic. The session data are further aggregated cross the entire period for each traveler to obtain the final set of features and labels (, ), where is the binary booking indicator in the booking intent model and continuous booking value in the booking value model.
Within the XGBoost framework, the objective is using additive functions to predict the output , where and the space of classification and regression trees. The following regularized objective function is optimized in an incremental and greedy fashion:
[TABLE]
where is the prediction of the -th instance at the -th iteration, is the loss function (log likelihood) that measures the difference between the prediction and the target, and is a regularization term which penalizes the complexity of the tree functions. XGBoost uses second order approximation to fit a base learner to minimize (1) at each iteration. In addition, shrinkage and column sub-sampling are employed in XGBoost to avoid over-fitting.
For the prediction of traveler booking intent, we use a binary classification framework that applies a sigmoid function to return a probability value. In booking value conditional on booking prediction, we use a regression framework that applies log transformation to improve accuracy.
III-B Skip-gram Sequence Model
In order to generate feature learnings for traveler’s session sequence to augment handcrafted features, we used a skip-gram model in our context attempts to predict listings before and after a given listing and viewed in a traveler session , based on the premise that traveler’s view of listings in the same session signals the similarity of those listings. We use a shallow neural network with one hidden layer with lower dimension for this purpose. The training objective is to find the listing local representation that specifies surrounding most similar manifold. More formally the objective function can be specified by the log probability maximization problem as follows:
[TABLE]
where is the window size representing listing context. The basic skip-gram formulation defines using softmax function as follows:
[TABLE]
where and are input and output representation vector or neural network weights, and is the number of listings available on our platform. To simplify the task, we used the sigmoid formula, which makes the model a binary classifier, with negative samples, which we draw randomly from the list of all available listings on our platform[25]. Formally, we use the following formula: for positive samples, and the following formula for negative ones: .
We have two more issues to address, sparsity and heterogeneity in views per item. It is not uncommon to observe a long tail distribution of views for the listings. For this purpose we leverage approaches mentioned by [25] wherein especially frequent items are downsampled using the inverse square root of the frequency. Additionally we removed listings with very low frequency. To resolve the cold start issue, we leverage the contextual information that relates destinations (or search terms) to the listings based on the booking information. Formally, considering that the destinations are driving , proportion of the demand for a given listing, we form the expectation of the latent representation for each location using , where is the normalizing factor. Then, given latitude and longitude of the cold listing (for which we have no data), we form the belief about the proportion of demand driven from each of the search terms . Then, we use our destination embedding from the previous step to find the expected listing embedding for the cold listing as follows .
III-C Deep Average Network (DAN) and Alternatives
In the second stage, given the listing’s embedding from the previous stage we model traveler embeddings using a sandwiched encoder-decoder non-linear Relu function. In contrast to relatively weak implicit view signals, in this stage we leverage strong booking signals as a target variable based on historical traveler listing interaction. We have various choices for this purpose including Deep Average Network with Auto-Encoder-Decoder, Long Short Term Memory(LSTM), and Attention Networks. The simplest approach is to take the point-wise average of embedding vector and use it directly in the model. The second approach could be to feed the average embedding into a dimensionality expansion and reduction non-linear encoder-decoder architecture, or Deep Average Network to extract the signals [35]. Hypothetically, this architecture may project the embeddings first into a larger space to isolate noise and then into smaller space to remove it[49]. The third approach could incorporate LSTM networks [38] [12]. Hypothetically, this architecture may emulate the travelers’ memory signal gathering and forgetting in the shopping funnel[50]. The fourth approach could have an attention layer on top of the LSTM [45]. Hypothetically, this architecture may capture an extra step in travelers’ trip booking behavior in putting different weights on signals accumulated in their memory[51].
We take a probabilistic approach to model traveler booking events based on the embedding vectors of historical units they have interacted with. Formally, given the traveler embeddings (or last layer of the traveler booking prediction neural network), the probability of the booking is defined as:
[TABLE]
Then, the Deep Average Network layers are defined as:
[TABLE]
Alternatively, we can use an LSTM network with forget, input, and output gates as follows:
[TABLE]
And finally, we can also use an attention network on the top of LSTM network as follows:
[TABLE]
where are weight and bias parameters to estimate and represents the hidden layer parameter or function to estimate.
Among these models, DAN is more consistent with Occam’s razor, so it is more parsimonious, and faster to train. However, LSTM and Attention Networks on the top of it are more theoretically appealing. As a result, from the pragmatic stand point, for millions of listings and travelers DAN seems to be more appealing for deployment as depicted in Figure 1. We use adaptive stochastic gradient descent method to train the binary cross entropy of these neural networks.
Table I summarizes the notation used in this section. In the next section we review how to deploy this model end to end.
IV System Overview
The high-level real-time system architecture is shown in Figure 2. Each traveler’s interaction (listing view, dated search, etc) leads to an event that is transformed into handcrafted features and if there is a listing view, traveler embeddings. The Model Stream Processor, which has loaded the Traveler Booking Intent Classification and Booking Value Regression models, consumes the handcrafted features and traveler embeddings for each traveler, and calls the prediction functions of the two models. Predictions are consumed by the Bidding Optimizer. More specifically, we use thresholds to assign travelers to buckets , where is the total number of buckets based on their booking probability and value predictions. We tune the thresholds to evenly distribute the density for each of the buckets. As a result, the expected average booking Revenue Per Click (RPC) in the buckets is monotonically increasing by design. From an empirical stand point, these buckets mirror the shopping funnel of the travelers, i.e. bucket contains travelers with the lowest predicted booking probability and values, resembling discovery stage, whereas contains travelers that are close to the end of the booking funnel. The Bidding Optimizer generates higher ROI with the same budget when these buckets with traveler identifiers are given to it. This happens as the Bidding Optimizer is guided to allocate financial resources to each bucket accordingly.
For training, real-time handcrafted features and traveler embeddings are persisted in the S3111an Amazon Web Services cloud solution that provides object storage based Data Lake where they are joined with booking data to get labels. Then, the XGBoost Traveler Booking Intent and Value models are trained on this data using H2O [52]. Another offline process trains the Tensorflow [53] Deep Average Network based traveler embeddings frequently using historical data from that Data Lake to address the seasonality of the traveler industry and cold start issue. Finally, traveler booking probabilities, values and embeddings are consumed by other systems and stakeholders apart from the bidding optimizer. Such systems are in-house recommender systems and email marketing related models.
V Experiments and Results
We compare the Traveler Booking Intent (TBI) accuracy-uplift of our Deep Average Network based approach to various baselines in this section. For offline evaluation, we merged the handcrafted features and the traveler embeddings, generated by all different model settings, and fed them to the TBI model.
V-A Methodologies
In this subsection, we describe three baseline methods that we compare against our proposed Deep Average Network (DAN) on the top of Skip-Gram:
Random: a heuristic rule that chooses a random listing embedding, among those listings a traveler has previously interacted with in the current session. 2. 2.
Averaging Embeddings: a simple point-wise averaging of listing embeddings a traveler has previously interacted with, in the current session. 3. 3.
LSTM with Attention: A recurrent neural network, inspired by [38], [11] and [12], that uses LSTM units and an attention mechanism on top of it in order to combine embeddings of listings a user has previously interacted with in the current session.
V-B Datasets
For the experiments, anonymized clickstream data was collected for millions of users from two different seven-day periods. The first dataset was used to generate embeddings using Deep Average Network and the LSTM with Attention. The second dataset was used to evaluate the learned embeddings on the Traveler Booking Intent Model.
V-C Results
We evaluated the performance of the Traveler Booking Intent model on the different settings using AUC, Precision, Recall and F1 scores. The best results of each model are shown in Table II. It shows that our proposed Deep Average Network approach contributes more uplift to the TBI model.
Moreover, Table III shows the TBI performance improvement when the DAN generated traveler embeddings are merged with the initial handcrafted features. Our finding suggests embeddings have comparative predictive power to handcrafted features.
VI Conclusion
In this paper, we introduced a hybrid deep learning framework for a massive vacation rental marketplace. Deployed in an end-to-end manner, this pragmatic framework aims to help solve challenges in prediction of traveler shopping journey within a re-targeting scope. Our results show that by leveraging neural network traveler embeddings trained on session logs we are able to enhance the prediction of our original booking intent model which used handcrafted features. We also find that there is a pragmatic sweet spot between expensive complex deep neural networks and simple shallow neural networks that can increase the performance of the boosting tree model, based on offline analysis. Furthermore, incremental complexity in this model enables extracting traveler embeddings, which can also be used for personalizing recommender systems. To further improve and extend this work, we are starting to explore options to transfer the learnings to other problems in the marketing domain which are subject to data sparsity issues. It can also be beneficial to infuse other contextual spatio-temporal information into our model to help drive a smooth and personalized full-cycle traveler booking experience.
VII Acknowledgments
This project is a collaborative effort between the recommendation, marketing data science and growth marketing teams. The authors would like to thank Chandri Krishnan, Andrew Reuben, Travis Brady, Wenjun Ke, Ali Miraftab and Ravi Divvela for their contribution to this paper.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Wang, W. Zhang, S. Yuan et al. , “Display advertising with real-time bidding (rtb) and behavioural targeting,” Foundations and Trends® in Information Retrieval , vol. 11, no. 4-5, pp. 297–435, 2017.
- 2[2] P. S. Fader, B. G. Hardie, and K. L. Lee, “Counting your customers the easy way: An alternative to the pareto/nbd model,” Marketing science , vol. 24, no. 2, pp. 275–284, 2005.
- 3[3] P. S. Fader and B. G. Hardie, “The pareto/nbd is not a lost-for-good model,” 2016.
- 4[4] P. Fader and B. G. Hardie, “Probability models for customer-base analysis,” Journal of interactive marketing , vol. 23, no. 1, pp. 61–69, 2009.
- 5[5] N. Glady, B. Baesens, and C. Croux, “A modified pareto/nbd approach for predicting customer lifetime value,” Expert Systems with Applications , vol. 36, no. 2, pp. 2062–2071, 2009.
- 6[6] T. Y. Chan, C. Wu, and Y. Xie, “Measuring the lifetime value of customers acquired from google search advertising,” Marketing Science , vol. 30, no. 5, pp. 837–850, 2011.
- 7[7] O. Netzer, J. M. Lattin, and V. Srinivasan, “A hidden markov model of customer relationship dynamics,” Marketing science , vol. 27, no. 2, pp. 185–204, 2008.
- 8[8] Y. Juan, Y. Zhuang, W.-S. Chin, and C.-J. Lin, “Field-aware factorization machines for ctr prediction,” in Proceedings of the 10th ACM Conference on Recommender Systems . ACM, 2016, pp. 43–50.
