Modeling Time to Open of Emails with a Latent State for User Engagement Level
Moumita Sinha, Vishwa Vinay, Harvineet Singh

TL;DR
This paper introduces a survival analysis framework using Cox Proportional Hazards and a mixture model to predict email open times, accounting for user engagement levels, and demonstrates improved accuracy on real-world marketing data.
Contribution
It extends CoxPH models with a latent state mixture approach to better capture user engagement variability in email open time prediction.
Findings
Mixture model outperforms standard models in accuracy.
Survival analysis jointly models open event and time-to-open.
Approach effective on large real-world marketing dataset.
Abstract
Email messages have been an important mode of communication, not only for work, but also for social interactions and marketing. When messages have time sensitive information, it becomes relevant for the sender to know what is the expected time within which the email will be read by the recipient. In this paper we use a survival analysis framework to predict the time to open an email once it has been received. We use the Cox Proportional Hazards (CoxPH) model that offers a way to combine various features that might affect the event of opening an email. As an extension, we also apply a mixture model (MM) approach to CoxPH that distinguishes between recipients, based on a latent state of how prone to opening the messages each individual is. We compare our approach with standard classification and regression models. While the classification model provides predictions on the likelihood of an…
| Dataset | #Recipients | |
|---|---|---|
| #Emails | ||
| (millions) | ||
| Training | 2.05 | 31.86 |
| Validation | 2.04 | 22.73 |
| Test | 2.22 | 19.14 |
| B: Baselines |
| LR: Logistic Regression (Classification)/Linear Regression (Time to Open) |
| CPH-L: CoxPH Model with relative hazard, |
| CPH-G: CoxPH Model with relative hazard, from a GBM |
| MM: Mixture Model with Proportional Hazards |
| Censoring Window = 3 hours | Censoring Window = 6 hours | Censoring Window = 12 hours | |||||||||||||
| Model | B | LR* | CPH-L | CPH-G | MM | B | LR* | CPH-L | CPH-G | MM | B | LR* | CPH-L | CPH-G | MM |
| AUC | 0.863 | 0.931 | 0.931 | 0.932 | 0.929 | 0.870 | 0.939 | 0.939 | 0.940 | 0.938 | 0.878 | 0.948 | 0.948 | 0.949 | 0.948 |
| MRAD(A) | 1.226 | 1.332 | 1.085 | 0.941 | 0.483 | 2.504 | 1.653 | 1.835 | 1.372 | 0.678 | 5.079 | 2.332 | 1.707 | 1.572 | 1.318 |
| MRAD(O) | 26.641 | 8.411 | 11.953 | 12.217 | 9.499 | 40.602 | 11.706 | 23.245 | 14.788 | 9.832 | 62.740 | 17.501 | 19.831 | 28.978 | 15.657 |
| Censoring | Model | MRAD(O) | |||||
|---|---|---|---|---|---|---|---|
| Window | |||||||
| 3 hours | CPH-L | 11.952 | 12.608 | 13.929 | 16.744 | 23.462 | 25.716 |
| CPH-G | 12.217 | 14.593 | 18.501 | 25.407 | 26.629 | 26.641 | |
| MM | 9.499 | 26.641 | 26.641 | 26.641 | 26.641 | 26.641 | |
| 6 hours | CPH-L | 23.245 | 17.456 | 18.857 | 19.870 | 27.041 | 34.542 |
| CPH-G | 14.788 | 19.588 | 26.233 | 30.970 | 38.102 | 40.602 | |
| MM | 9.832 | 12.483 | 40.602 | 40.602 | 40.602 | 40.602 | |
| 12 hours | CPH-L | 19.831 | 34.632 | 27.229 | 32.510 | 40.356 | 40.750 |
| CPH-G | 28.978 | 31.866 | 29.986 | 34.590 | 57.086 | 61.040 | |
| MM | 15.657 | 21.705 | 62.545 | 62.740 | 62.740 | 62.740 | |
| Censoring | Model | AUC | MRAD(O) | ||
|---|---|---|---|---|---|
| Window | Mean | StdDev | Mean | StdDev | |
| 3 hours | LR* | 0.931 | 4e-5 | 8.215 | 0.036 |
| CPH-L | 0.931 | 2e-5 | 13.579 | 1.913 | |
| CPH-G | 0.929 | 8e-3 | 13.746 | 0.743 | |
| MM | 0.929 | 3e-4 | 9.277 | 1.854 | |
| 6 hours | LR* | 0.939 | 4e-5 | 11.651 | 0.079 |
| CPH-L | 0.939 | 1e-5 | 20.301 | 2.486 | |
| CPH-G | 0.939 | 3e-4 | 19.208 | 0.798 | |
| MM | 0.938 | 2e-4 | 9.753 | 1.194 | |
| 12 hours | LR* | 0.948 | 4e-5 | 17.514 | 0.096 |
| CPH-L | 0.948 | 7e-5 | 38.143 | 3.333 | |
| CPH-G | 0.949 | 2e-4 | 29.455 | 1.071 | |
| MM | 0.948 | 2e-4 | 15.444 | 1.683 | |
|
Censoring
Window |
Model | AUC | MRAD(O) |
|---|---|---|---|
| 3 hours | LR* | 0.937 | 7.753 |
| CPH-L | 0.937 | 10.009 | |
| CPH-G | 0.934 | 13.787 | |
| MM | 0.935 | 7.381 | |
| 6 hours | LR* | 0.944 | 10.194 |
| CPH-L | 0.944 | 23.227 | |
| CPH-G | 0.940 | 18.339 | |
| MM | 0.942 | 9.340 | |
| 12 hours | LR* | 0.952 | 13.409 |
| CPH-L | 0.952 | 29.131 | |
| CPH-G | 0.950 | 21.080 | |
| MM | 0.951 | 11.653 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Modeling Time to Open of Emails with a Latent State for User Engagement Level
Moumita Sinha
Adobe Research
,
Vishwa Vinay
Adobe Research
and
Harvineet Singh
Adobe Research
(2018)
Abstract.
Email messages have been an important mode of communication, not only for work, but also for social interactions and marketing. When messages have time sensitive information, it becomes relevant for the sender to know what is the expected time within which the email will be read by the recipient. In this paper we use a survival analysis framework to predict the time to open an email once it has been received. We use the Cox Proportional Hazards (CoxPH) model that offers a way to combine various features that might affect the event of opening an email. As an extension, we also apply a mixture model (MM) approach to CoxPH that distinguishes between recipients, based on a latent state of how prone to opening the messages each individual is. We compare our approach with standard classification and regression models. While the classification model provides predictions on the likelihood of an email being opened, the regression model provides prediction of the real-valued time to open. The use of survival analysis based methods allows us to jointly model both the open event as well as the time-to-open. We experimented on a large real-world dataset of marketing emails sent in a 3-month time duration. The mixture model achieves the best accuracy on our data where a high proportion of email messages go unopened.
Email interaction data, survival analysis, time-to-event prediction, enterprise email marketing, Cox-proportional hazards model
††journalyear: 2018††copyright: acmcopyright††conference: WSDM 2018: The Eleventh ACM International Conference on Web Search and Data Mining ; February 5–9, 2018; Marina Del Rey, CA, USA††booktitle: WSDM 2018: WSDM 2018: The Eleventh ACM International Conference on Web Search and Data Mining , February 5–9, 2018, Marina Del Rey, CA, USA††price: 15.00††doi: 10.1145/3159652.3159683††isbn: 978-1-4503-5581-0/18/02
1. Introduction
Email has a rich history of being a data source for machine learning techniques. Starting with spam filtering (Cormack, 2008), the range of applications today covers a rich spectrum of scenarios. The Enron Corpus (Klimt and Yang, 2004) enabled research into the modeling of users’ interactions with email in a collaborative environment (Chapanond et al., 2005). For email service providers, detailed understanding of consumers’ interactions with the email system allows building predictive models for specific actions, e.g. if an email will be replied to or not (Yang et al., 2017) and creating rich experiences (Karagiannis and Vojnovic, 2009; Kannan et al., 2016) for the recipients. On the consumer side, given its popularity, there has been much work on different ways to handle large volumes of email effectively (Whittaker and Sidner, 1996). An early paper by Horvitz et al. (Horvitz et al., 1999) proposed that autonomous agents may be able to identify and prioritize emails that need attention. The authors of (Di Castro et al., 2016) show that historical data allows the prediction of what actions a user might take on the receipt of an email, for example, marking it for deletion (Dabbish et al., 2003). Apart from being a mode of communication, email is also used as a personal information management environment (Ducheneaut and Bellotti, 2001), leading to the need to support other forms of interactions like search (Narang et al., 2017).
The domain of interest in the current paper is marketing, where the email channel ranks high in popularity (VanBoskirk et al., 2011) alongside social media, search & display advertising. Email based marketing is predicted to have a compound annual growth rate of (VanBoskirk et al., 2011) and nearly every enterprise marketer uses it as a delivery channel (Tsirulnik, 2011; Chaffey, 2009). The engagement levels however are typically low, as compared to personal email messages at work or among friends. The open rates for the marketing email messages, vary by industry - ranging from to in the e-commerce, beauty and personal care, and gambling industries, and in the range of to in the hobbies, home and garden or health and fitness industries (Wells, 2016). Marketers are therefore always on the lookout for techniques that might enhance the engagement levels. For example, Kumar et al. (Kumar et al., 2014) modeled opt-in and opt-out behaviour and related these to transactions made by the consumer. Bonfrer et al. (Bonfrer and Drèze, 2009) proposed a framework that allows real-time evaluation of an email campaign.
In this submission, we propose the use of survival analysis for jointly modeling the open event on an email, as well as the time-to-open. The next section provides technical background to some important concepts in survival analysis that are relevant in the current scenario.
2. Survival Analysis
Survival analysis refers to an area of statistical modeling where the main variable of interest is the time to an event. Historically, the event is assumed to be death. One characteristic of data that makes the use of survival models appropriate is the presence of censoring. This refers to the fact that not all individuals would have experienced the event within the observation window. The censoring may be because at the time of analysis the event had not yet occurred, or if the corresponding individual can no longer be tracked. Figure 1 is a pictorial representation of survival data in the context of emails. Observations are synchronized at , which is the time at which the individuals receive the email. If the event of the email being read is not within a chosen time interval, e.g. hours, this would be a censored data point. And some recipients may of course not read the email at all.
Consider a random variable for the time to the event of interest, with the corresponding probability density function and the cumulative distribution function being at a given time . Then the survivor function is defined as
[TABLE]
It represents the probability that an individual will survive beyond time . Equivalently, given that the individual has not yet experienced the event till time , the hazard function represents the instantaneous chance of the event occurring at time .
[TABLE]
The relationship between the survivor function and the hazard function can be derived as being S(t)=exp\big{\{}-H(t)\big{\}}, where is the cumulative hazard function corresponding to .
A survival analysis dataset containing N individuals is represented as , with . For the individual, is a vector of features that are believed to be predictive of the survival time. The target represents the survival time, where represents the duration of time for which the individual was observed and is also known as the censoring window. If observed within the censoring window, is the time to event for the individual. The indicator variable encodes if the individual experienced the event of interest within the censoring window.
[TABLE]
2.1. Cox Proportional Hazard Regression
Given a feature vector for the individual, the hazard function for the individual at any given time can be defined as
[TABLE]
Here is the baseline hazard function at time , and incorporates the dependence on the individual-specific features , which are independent of time. The specific factorization of into a global time-dependent component () and an individual’s time-independent factor () is the Proportional Hazards assumption - Section 3.2 provides a methodology to validate this assumption on a given dataset. What has been defined above is a semi-parametric approach, in that no assumptions have been made about the shape of the baseline hazard function . The parametric alternative would be to impose a functional form, e.g. a Weibull distribution. Based on the relation between the survivor and hazard functions, the survivor function of the individual for Cox Proportional Hazard (CoxPH) regression is
[TABLE]
The corresponding partial likelihood function (Cox, 1972) is defined as
[TABLE]
where the function has been parameterized by that controls the combination of the features. is the set of individuals who are at-risk of the event at time , that is, the set of individuals for whom the event has not occurred yet. is also the observed time to event of the individual. Note that the numerator of the likelihood is a function of only the individuals that observed the event, and censored individuals only contribute to the denominator of Equation 6. The values are estimated by maximizing the above likelihood using a gradient based method.
The most common form of \psi(X_{i})=exp\big{(}\beta^{T}X_{i}\big{)}, where is a vector of parameters controlling the dependence between the features in and target . Doing so assumes a linear scaling of the relative (log) hazards of different individuals with respect to the values of the features. Ridgeway (Ridgeway, 1999) proposed that the likelihood in Equation 6 can alternatively be optimized directly using gradient boosting methods that might provide benefits in scenarios where the effect of the features is non-linear. Note that this is still a Proportional Hazards model, but with taken to be the output of a gradient boosting machine (GBM).
2.2. Mixture Model with Cox Proportional Hazard Regression
The CoxPH model assumes that all individuals will eventually experience the event. But there may be a proportion of individuals who are not prone to the event, i.e., who are not predisposed to opening emails. The level at which an individual user is engaged with marketing messages influences his/her act of opening the email (and how quickly). The CoxPH model described earlier tries to explain all the observations using only the features () as the explanatory factors. Through the use of mixture models (Farewell, 1982; Branders et al., 2015), we might expect to get more discriminatory power. The individual is now represented as , where is a latent indicator variable such that
[TABLE]
is a set of features that help predict if an individual is prone to the event of interest or not. The feature set can also be the same as the feature set .
[TABLE]
The probability is estimated using logistic regression here, and is introduced as a mixture probability into the overall survivor function:
[TABLE]
If the individual is predisposed to not experiencing the event, then , leading to a prediction of a survival probability close to . Conversely, a scenario with leads to the first term dominating, with the quantity representing the survival probability in the traditional sense. A proportional hazards assumption can be encoded by setting S(t_{i}|L=1,X_{i})=S_{0}(t)^{exp\big{(}\beta^{T}X_{i}\big{)}} as before. The likelihood of the model is given by:
[TABLE]
Since there are latent variables (the ), the optimization is an Expectation Maximization based iterative procedure that estimates the , along with (for calculating ) and controlling how the features of an individual affect the relative hazards. In the current setting, we are interpreting as the engagement level of a given user , the model however is more general. For e.g., it can be used to represent the probability that a patient has been cured, which in turn affects the chances that he/she will experience the event.
2.3. Related Work
Survival analysis has traditionally been used in the health-care domain to determine the time to ‘death’ in patients, but the usage of this range of techniques has recently expanded to other application areas (Wang et al., 2017). Examples include prediction of early student dropouts (Ameri et al., 2016), post-click engagement on native ads (Barbieri et al., 2016), query specific micro-blog ranking for improved retrieval (Efron, 2012), recommender systems in e-commerce (Wang and Zhang, 2013), search engine evaluation via the use of ”absence time” (Chakraborty et al., 2014), and predicting time for crowd-sourced tasks (Lease et al., 2011).
By appropriately defining the event being modeled, existing marketing concepts also lend themselves survival analysis techniques. E.g. re-purchasing behavior is an indicator of high engagement (Lee et al., 2012) and a proxy for the potential value of a customer (Drye et al., 2001; Lu and Park, 2003). Attrition modeling helps businesses identify customers who are most at-risk so that attempts can be made to keep them in the system, and (Lee et al., 2012) proposes a survival analysis based solution.
Much of the literature referred to above involve applying well-known and established models (like CoxPH) in different scenarios. But more recently, growing interest in the use of survival analysis has led to modeling improvements. For instance, when modeling time-to-event of related tasks, the parameters of the different models can be more reliably estimated using regularization techniques commonly used in multi-task learning (Li et al., 2016). Even in traditional application areas of survival analysis, given a large number of data points and a variety of features that potentially have a highly non-linear dependence on the time-to-event, deep latent models provide better performance (Ranganath et al., 2016).
The closest related work to that presented here is described in (Dave et al., 2017) where time-to-event is modelled in the email domain. Given this context, the contribution of the current paper is two-fold: (1) we describe techniques from the rich history of survival analysis to identify those models whose assumptions are better matched with the characteristics of the data (2) for the application of predicting time-to-event when the censored rows dominate, the mixture model (MM) described above is shown to not only describe the data better but also provide better predictive performance.
3. Problem Definition and Data Description
When emails containing time sensitive information are sent, it may be relevant for the sender to know what is the expected time within which the email will be read by the recipient. Specifically in marketing messages, if the email advertises a flash sale, the marketer will need to decide on the time window for the sale - to optimize between reaching sufficient consumers within the window and yet keep it exclusive. Prediction of time-to-open of an email by a consumer helps to determine the size of the recipient list one wants to reach.
Our dataset corresponds to email marketing campaigns that are sent out to consumers of an enterprise and we are interested in a predictive model that answers questions of two types: (a) Is a particular email likely to be opened by a given recipient? (b) Can we predict the time within which the email will be opened?
In the dataset, there is a high degree of variability amongst the marketing messages - some are sent to a large group of recipients, while others are targeted at a narrow set of consumers - e.g. a personalized birthday communication. We expect that the nature of people’s interaction with these different types of emails varies drastically. In particular, we are interested in modeling how people differ in terms of their engagement with the mass marketing emails. For this reason, the analysis presented here includes only those emails that were sent to at least of the total consumers. We have additionally dropped those consumers who received fewer than messages during the period of interest.
The time at which an email reaches a consumer is labelled as its start-time. In the event that the email is read, the email has a corresponding open-time. The difference between the two time-stamps is referred to as the time-to-open. The emails are divided into 3 non-overlapping buckets based on the start-time: a Training dataset (spanning 4 weeks) and one dataset each for Validatation & Test (spanning 3 weeks each) respectively. Table 3 shows the size of each of these datasets. Chronologically ordered, these 3 datasets cover 13 weeks of email messages with a week gap between Validatation and Test. Within each group, data from the initial two weeks are used to compute features that will be used to model users’ interaction with emails sent in the subsequent week(s).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Ameri et al . (2016) Sattar Ameri, Mahtab J. Fard, Ratna B. Chinnam, and Chandan K. Reddy. 2016. Survival Analysis Based Framework for Early Prediction of Student Dropouts. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM ’16) . ACM, New York, NY, USA, 903–912. https://doi.org/10.1145/2983323.2983351 · doi ↗
- 3Barbieri et al . (2016) Nicola Barbieri, Fabrizio Silvestri, and Mounia Lalmas. 2016. Improving post-click user engagement on native ads via survival analysis. In Proceedings of the 25th International Conference on World Wide Web . International World Wide Web Conferences Steering Committee, 761–770.
- 4Bonfrer and Drèze (2009) André Bonfrer and Xavier Drèze. 2009. Real-time evaluation of e-mail campaign performance. Marketing Science 28, 2 (2009), 251–263.
- 5Branders et al . (2015) Samuel Branders, Roberto D’Ambrosio, and Pierre Dupont. 2015. A mixture Cox-Logistic model for feature selection from survival and classification data. ar Xiv preprint ar Xiv:1502.01493 (2015).
- 6Burke et al . (1997) Harry B Burke, Philip H Goodman, David B Rosen, Donald E Henson, John N Weinstein, Frank E Harrell, Jeffrey R Marks, David P Winchester, and David G Bostwick. 1997. Artificial neural networks improve the accuracy of cancer survival prediction. Cancer 79, 4 (1997), 857–862.
- 7Chaffey (2009) D Chaffey. 2009. Mint.com used Strong Mail Influencer to create this viral program. http://www.strongmail.com/pdf/sm_casestudy_mint.pdf . (2009).
- 8Chakraborty et al . (2014) Sunandan Chakraborty, Filip Radlinski, Milad Shokouhi, and Paul Baecke. 2014. On Correlation of Absence Time and Search Effectiveness. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ’14) . ACM, New York, NY, USA, 1163–1166. https://doi.org/10.1145/2600428.2609535 · doi ↗
