Sequential Counterfactual Risk Minimization
Houssam Zenati, Eustache Diemert, Matthieu Martin, Julien Mairal,, Pierre Gaillard

TL;DR
This paper extends Counterfactual Risk Minimization to a sequential setting where policies can be deployed multiple times, introducing a new estimator and demonstrating improved theoretical and empirical performance.
Contribution
It proposes Sequential Counterfactual Risk Minimization (SCRM), extending CRM theory to multiple deployments and introducing a novel estimator for better performance.
Findings
Improved excess risk and regret rates with multiple deployments
Empirical validation in discrete and continuous action spaces
Demonstrated benefits of multiple policy deployments
Abstract
Counterfactual Risk Minimization (CRM) is a framework for dealing with the logged bandit feedback problem, where the goal is to improve a logging policy using offline data. In this paper, we explore the case where it is possible to deploy learned policies multiple times and acquire new data. We extend the CRM principle and its theory to this scenario, which we call "Sequential Counterfactual Risk Minimization (SCRM)." We introduce a novel counterfactual estimator and identify conditions that can improve the performance of CRM in terms of excess risk and regret rates, by using an analysis similar to restart strategies in accelerated optimization methods. We also provide an empirical evaluation of our method in both discrete and continuous action settings, and demonstrate the benefits of multiple deployments of CRM.
| Percentage | |||
|---|---|---|---|
| CRM | |||
| SCRM (ours) |
| Pricing | Advertising | Yeast | TMC2007 | |
|---|---|---|---|---|
| Pricing | Advertising | Scene | Yeast | TMC2007 | |
| Baseline | |||||
| SBPE | DNF | DNF | DNF | ||
| BKUCB | DNF | DNF | DNF | ||
| TRPO | |||||
| PPO | |||||
| CRM | |||||
| SCRM (ours) | |||||
| Skyline |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Machine Learning and Algorithms
Sequential Counterfactual Risk Minimization
Houssam Zenati
Eustache Diemert
Matthieu Martin
Julien Mairal
Pierre Gaillard
Abstract
Counterfactual Risk Minimization (CRM) is a framework for dealing with the logged bandit feedback problem, where the goal is to improve a logging policy using offline data. In this paper, we explore the case where it is possible to deploy learned policies multiple times and acquire new data. We extend the CRM principle and its theory to this scenario, which we call ”Sequential Counterfactual Risk Minimization (SCRM).” We introduce a novel counterfactual estimator and identify conditions that can improve the performance of CRM in terms of excess risk and regret rates, by using an analysis similar to restart strategies in accelerated optimization methods. We also provide an empirical evaluation of our method in both discrete and continuous action settings, and demonstrate the benefits of multiple deployments of CRM.
Machine Learning, ICML
1 Introduction
Counterfactual reasoning in the logged bandit problem has become a common task for practitioners in a wide range of applications such as recommender systems (Swaminathan & Joachims, 2015a), ad placements (Bottou et al., 2013) or precision medicine (Kallus & Zhou, 2018). Such a task typically consists in learning an optimal decision policy from logged contextual features and partial feedbacks induced by predictions from a logging policy. To do so, the logged data is originally obtained from a randomized data collection experiment. However, the success of counterfactual risk minimization is highly dependent on the quality of the logging policy and its ability to sample meaningful actions.
Counterfactual reasoning can be challenging due to large variance issues associated with counterfactual estimators (Swaminathan & Joachims, 2015b). Additionally, as pointed out by Bottou et al. (2013), confidence intervals obtained from counterfactual estimates may not be sufficiently accurate to select a final policy from offline data (Dai et al., 2020). This can occur when the logging policy does not sufficiently explore the action space. To address this, one option is to simply collect additional data from the same logging system to increase the sample size. However, it may be more efficient to use already collected data to design a better data collection experiment through a sequential design approach (Bottou et al., 2013, see Section 6.4). It is thus appealing to consider successive policy deployments when possible.
We tackle this sequential design problem and are interested in multiple deployments of the CRM setup of Swaminathan & Joachims (2015a), which we call sequential counterfactual risk minimization (SCRM). SCRM performs a sequence of data collection experiments by determining at each round a policy using data samples collected during previous experiments. The obtained policy is then deployed for the next round to collect additional samples. Such a sequential decision making system thus entails designing an adaptive learning strategy that minimizes the excess risk and expected regret of the learner. In contrast to the conservative learning strategy in CRM, the exploration induced by sequential deployments of enhanced logging policies should allow for improved excess risk and regret guarantees. Yet, obtaining such guarantees is nontrivial and we address it in this work.
In order to accomplish this, we first propose a new counterfactual estimator that controls the variance and analyze its convergence guarantees. Specifically, we obtain an improved dependence on the variance of importance weights between the optimal and logging policy. Second, leveraging this estimator and a weak assumption on the concentration of this variance term, we show how the error bound sequentially concentrates through CRM rollouts. This allows us to improve the excess risk bounds convergence rate as well as the regret rate. Our analysis employs methods similar to restart strategies in acceleration methods (Nesterov, 2012) and optimization for strongly convex functions (Boyd & Vandenberghe, 2004). We also conduct numerical experiments to demonstrate the effectiveness of our method in both discrete and continuous action settings, and how it improves upon CRM and other existing methods in the literature.
2 Related Work
Counterfactual learning from logged feedback (Bottou et al., 2013) uses only past interactions to learn a policy without interacting with the environment. Counterfactual risk minimization methods (Swaminathan & Joachims, 2015a, b) propose learning formulations using a variance penalization as in (Maurer & Pontil, 2009) to find policies with minimal variance. Even so, counterfactual methods remain prone to large variance issues (Dudík et al., 2014). These problems may arise when the logging policy under-explores the action space, making it difficult to use importance sampling tehcniques (Owen, 2013) that are key to counterfactual reasoning. While one could collect additional data to counter this problem, our method focuses on sequential deployments (Bottou et al., 2013, see Section 6.4) to collect data obtained from adaptive policies to explore the action space. Note also that the original motivation is related but different from the support deficiency problem (Sachdeva et al., 2020) where the support of the logging policy does not cover the support of the optimal policy.
Another related literature to our framework is batch bandit methods. Originally introduced by Perchet et al. (2015) and then extended by Gao et al. (2019) in the multi-arm setting, batch bandit agent take decisions and only observe feedback in batches. This therefore differs from the classic bandit setting (Auer et al., 2002; Audibert et al., 2007) where rewards are observed after each action taken by an agent. Extensions to the contextual case have been proposed by Han et al. and could easily be kernelized (Valko et al., 2013). The sequential counterfactual risk minimization problem is thus closely related to this setting. However, major differences can be noted. First, SCRM does not leverage any problem structure as in stochastic contextual bandits (Li et al., 2010) by assuming a linear reward function (Chu et al., 2011; Goldenshluger & Zeevi, 2013; Han et al., ) nor uses regression oracles as (Foster & Rakhlin, 2020; Simchi-Levi & Xu, 2020). Second, deterministic decision rules taken by bandit agents (Lattimore & Szepesvari, 2019) do not allow for counterfactual reasoning or causal inference (Peters et al., 2017), unlike our framework which performs sequential randomized data collection. Third, unlike gradient based methods used in counterfactual methods with parametric policies, batch bandit methods use zero-order methods to learn from data and necessitate approximations to be scalable (Calandriello et al., 2020; Zenati et al., 2022).
The sequential designs that we use are adaptive data collection experiments, which have been studied by Bakshy et al. (2018); Kasy & Sautmann (2021). Closely related to our method is policy learning from adaptive data that has been studied by Zhan et al. (2021) and Bibaut et al. (2021) in the online setting. In contrast, we consider a batch setting and our analysis achieve fast rates in more general conditions. Zhan et al. (2021) use a doubly robust estimator and provide regret guarantees but assume a deterministic lower bound on the propensity score to control the variance. Instead, our novel counterfactual estimator does not require such an assumption. Bibaut et al. (2021) propose a novel maximal inequality and derive thereof fast rate regret guarantees under an additional margin condition that can only hold for finite action sets. Our work instead uses a different assumption on the expected risk, which is similar to Hölderian error bounds in acceleration methods (d’Aspremont et al., 2021) that are known to be satisfied for a broad class of subanalytic functions (Bolte et al., 2007).
In the reinforcement learning literature (Sutton & Barto, 1998), off-policy methods (Harutyunyan et al., 2016; Munos et al., 2016) evaluate and learn a policy using actions sampled from a behavior (logging) policy, which is therefore closely related to our setting. Among methods that have shown to be empirically successful are the PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015) algorithms which learn policies using a Kullback-Leibler distributional constraint to ensure robust learning, which can be compared to our learning strategy that improves the logging policy at each round. However reinforcement learning models transitions in the states (contexts) induced by the agent’s actions while bandit problems like ours assume that actions do not influence the context distribution. This enables to design algorithms that exploit the problem structure, have theoretical guarantees and can achieve better performance in practice.
Finally, our method is related to acceleration methods (d’Aspremont et al., 2021) where current iterates are used as new initial points in the optimization of strongly convex functions (Boyd & Vandenberghe, 2004). While different schemes use fixed (Powell, 1977) or adaptive (Nocedal & Wright, 2006; Becker et al., 2011; Nesterov, 2012; Bolte et al., 2007; Gaillard & Wintenberger, 2018) strategies, our method differs in that it does not consider the same original setting, does not require the same assumptions nor provides the same guarantees. Eventually, while current models are also used as new starting points, additional data is effectively collected in our setting unlike those previous works that do not assume partial feedbacks as in our case.
3 Sequential Counterfactual Risk Minimization
In this section, we introduce the (CRM) framework and motivate the use of sequential designs for (SCRM).
Notations
For random variables , and , we write the expectation and do the same for the variance . Moreover, denotes approximate inequalities up to universal multiplicative terms.
3.1 Background
In the counterfactual risk minimization (CRM) problem, we are given logged observations where contexts are sampled from a stochastic environment distribution , actions are drawn from a logging policy with a model in a parameter space . The losses are drawn from a conditional distribution . We note the associated propensities and assume them to be known. We will assume that the policies in admit densities so that the propensities will denote the density function of the logging policy on the actions given the contexts. The expected risk of a model is defined as:
[TABLE]
Counterfactual reasoning uses the logged data sampled from the logging policy associated to to estimate the risk of any model with importance sampling:
[TABLE]
under the common support assumption (the support of support is included in the support of ). The goal in CRM is to find a model with minimal risk by minimizing
[TABLE]
where uses the sample variance penalization principle (Maurer & Pontil, 2009) on samples from with counterfactual estimates of the expected risk , an empirical variance and . Specifically, in the (CRM) framework, multiple estimators are derived from the IPS method (Horvitz & Thompson, 1952) that uses the following clipped importance sampling estimator of the risk of a model by Bottou et al. (2013); Swaminathan & Joachims (2015a):
[TABLE]
where is a clipping parameter. Writing and the empirical variance estimator becomes:
[TABLE]
Other estimators aim at controlling the variance of the estimator with self-normalized estimators (Swaminathan & Joachims, 2015b) or with direct methods (Dudik et al., 2011; Dudík et al., 2014) in doubly robust estimators. Even so, the performance of counterfactual learning is harmed when the logging policy under-explores the action space (Owen, 2013). Likewise, counterfactual estimates obtained from a first round of randomized data collection may not suffice (Bottou et al., 2013) to select a model . In those cases, it could be natural to consider collecting additional samples. While it is possible to use the same logging model to do so, we will present a framework for designing an improved sequential data collection strategy, following the intuition of sequential designs of Bottou et al. (2013).
3.2 Sequential Designs
In this section we present a design of data collections that sequentially learn a policy from logged data in order to deploy it and learn from the newly collected data. Specifically, we assume that at a round , a model is deployed and a set of observations is collected thereof, with propensities to learn a new model and reiterate. In this work, we assume that the loss is bounded in as in (Swaminathan & Joachims, 2015a) (note however that this assumption could be relaxed to bounded losses) and follows a fixed distribution . Next, we will introduce useful definitions.
Definition 3.1** (Excess Risk and Expected Regret).**
Given an optimal model , we write for each rollout the excess risk:
[TABLE]
and define the expected regret as:
[TABLE]
The objective is now to find a sequence of models that have an excess risk and an expected regret that improve upon CRM guarantees. To do so, we define a sequence of minimization problems for :
[TABLE]
where is an objective function that we define in Section 4.2. Note that in the setting we consider, samples are i.i.d inside a rollout but dependencies exist between different sets of observations. From a causal inference perspective (Peters et al., 2017), this does not incur an additional bias because of the successive conditioning on past observations. We provide detailed explanations in Appendix A.1 on this matter. Note also that the main intuition and motivation of our work is to shed light on how learning intermediate models to adaptively collect data can improve upon sampling from the same logging system by using the same total sample size . To illustrate the learning benefits of SCRM we now provide a simple example.
Example 3.1** (Gaussian policies with quadratic loss).**
Let us consider Gaussian parametrized policies and a loss where . We illustrate in Figure 1 the evolution of the losses of learned models through 15 rollouts with either i) Batch CRM learning on aggregation of data, being generated by the unique initial logging policy or ii) Sequential CRM learning with models deployed adaptively, with data being generated by the last learned model for the batch . We see that the models learned with SCRM take larger optimization steps than the ones with CRM.
We summarize our (SCRM) framework in Algorithm 1 with the different blocks exposed previously. We provide an additional graphical illustration of SCRM compared to CRM in Appendix A.1. In the next section we will define counterfactual estimators from the observations at each round and define a learning strategy .
4 Variance-Dependent Convergence Guarantees
In this part we aim at providing convergence guarantees of counterfactual learning. We show how we can obtain a dependency of the excess risk on the variance of importance weights between the logging model and the optimal model.
4.1 Implicit exploration and controlled variance
We first introduce a new counterfactual estimator. For this, we will require a common support assumption as in importance sampling methods (Owen, 2013). We will assume that the policies for have all the same support. We then consider the following estimator of the risk of a model :
[TABLE]
where and is like a clipping parameter which ensures that the modified propensities are lower bounded. Noting \zeta_{i}(\theta)~{}=~{}\big{(}\frac{\pi_{\theta,i}}{\pi_{m,i}+\alpha\pi_{\theta,i}}-1\big{)}y_{m,i}, we can write the empirical variance estimator as:
[TABLE]
Here, the empirical variance uses a control variate since it uses the expression of above instead of . This allows to improve the dependency on the variance in the excess risk provided in Proposition 4.2. Note also that our estimator resembles the implicit exploration estimator in the EXP3-IX algorithm (Lattimore & Szepesvari, 2019), as our motivation is to improve the control of the variance.
4.2 Learning strategy
Next, we aim in this part to provide a learning objective strategy , as referred to in Eq. (8). Our approach, like the (CRM) framework, uses the sample variance penalization principle (Maurer & Pontil, 2009) to learn models that have low expected risk with high probability. To do so, we first provide an assumption to be used in our generalization error bound.
Assumption 4.1** (Bounded importance weights).**
For any models and any , we assume , for some .
This assumption has been made in previous works (Kallus & Zhou, 2018; Zenati et al., 2020) and is reasonable when we consider a bounded parameter space . Next, we state an error bound for our estimator.
Proposition 4.1** (Generalization Error Bound).**
Let and be the empirical estimators defined respectively in Eq. (9) and Eq. (10). Let , , and . Then, under Ass. 4.1, for , with probability at least :
[TABLE]
where is a metric entropy complexity measure defined in App. B.1 and .
This Proposition is proved in Appendix B.2 and essentially uses empirical bounds (Maurer & Pontil, 2009). By minimizing the latter high-probability upper bound, we can find models with guarantees of minimizing the expected risk. Therefore, at each round, we minimize the following loss:
[TABLE]
where is a positive parameter. Unlike deterministic decision rules used for example in UCB-based algorithms (Lattimore & Szepesvari, 2019), the exploration is naturally guaranteed by the stochasticity of the policies we use.
4.3 Excess risk upper bound
Eventually, we establish an upper bound on the excess risk of the IPS-IX estimator for counterfactual risk minimization using the learning strategy that we just defined. For this, we require an assumption on the complexity measure.
Assumption 4.2**.**
We assume that the set is compact and that there exists such that
This assumption states that the complexity grows logarithmically with the sample size. It holds for parametric policies so long as the propensities are lower bounded, which is verified using our estimator. We now state our variance-dependent excess risk bound.
Proposition 4.2** (Excess Risk Bound).**
Let and . Let be a set of samples collected with policy . Then, under Assumptions 4.1 and 4.2, a minimizer of Eq. (11) on the samples satisfies the excess risk upper-bound: w.p.
[TABLE]
where .
The proof is postponed to Appendix B.2. The modified propensities in IPS-IX as well as the control variate used in the variance estimator allow us to improve the dependency in , compared to obtained in previous work (Zenati et al., 2020). This turns out to be a crucial point to use these error bounds sequentially as in acceleration methods since if , as explained in the next section.
5 SCRM Analysis
In this section we provide the main theoretical result of this work on the excess risk and regret analysis of SCRM. We start by stating an assumption that is common in acceleration methods (d’Aspremont et al., 2021) with restart strategies (Becker et al., 2011; Nesterov, 2012) that we will require to achieve the benefits of sequential designs.
Assumption 5.1** (Hölderian Error Bound).**
We assume that there exist and such that for any , there exists such that
[TABLE]
Typically, in acceleration methods, Hölderian error bounds (Bolte et al., 2007) are of the form:
[TABLE]
for some and where is some distance to the optimal set (). This bound is akin to a local version of strong convexity () or a bounded parameter space () if is the Euclidean distance. When , this has also been referred to as the Łojasiewicz assumption introduced in (Łojasiewicz, 1963, 1993). Notably, it has been used in online learning (Gaillard & Wintenberger, 2018) to obtain fast rates with restart strategies. This assumption holds for instance for Example 3.1 with (see App C.1). We also discuss this assumption for distributions in the exponential family in Appendix C.2 notably for distributions that have been used practice (Swaminathan & Joachims, 2015b; Kallus & Zhou, 2018; Zenati et al., 2020). Next we state our main result that is the acceleration of the excess risk convergence rate and the regret upper bound of SCRM.
Proposition 5.1**.**
Let and . Let for m=0,\dots,M=\big{\lfloor}\log_{2}(1+\frac{n}{n_{0}})\big{\rfloor}. Then, under Assumptions 4.1, 4.2 and 5.1 with , the SCRM procedure (Alg. 1) satisfies the excess risk upper-bound
[TABLE]
Moreover, the expected regret is bounded as follows:
[TABLE]
The proof of our result is detailed in Appendix B.3.
Discussion
This result illustrates that an excess risk of order O\big{(}\frac{\log(n)}{n}\big{)} may be obtained when (which is implied by a local version of strong convexity assumption in acceleration methods). When , which merely accounts that the variance of importance weights are bounded, we simply recover the original rate of CRM of order . The SCRM procedures thus improves the excess risk rate whenever . It is worth to emphasize that the knowledge of is not needed by Alg. 1. We also note that our assumption seems related to the Bernstein condition (Bartlett & Mendelson, 2006, see Def 2.6), and (van Erven et al., 2015, see Def 5.1) that bounds a variance term by an excess risk term to the power. In empirical risk minimization, this implies the same excess risk rate and regret rate (van Erven & Koolen, 2016), which are exactly the same rates as ours (up to logs).
6 Empirical Evaluation
In this section we perform numerical experiments to validate our method in practical settings. We present the experimental setup as well as experiments comparing SCRM to related approaches and internal details of the method.
6.1 Experimental setup
As our method is able to handle both discrete and continuous actions we experiment in both settings. We now provide a brief description of the setups, with extensive details available in Appendix D.2. 111All the code to reproduce the empirical results is available at: https://github.com/criteo-research/sequential-conterfactual-risk-minimization
Continuous actions
We perform evaluation on synthetic problems pertaining to personalized pricing problems from (Demirer et al., 2019) (Pricing) and advertising from (Zenati et al., 2020) (Advertising). We consider Gaussian policies with linear contextual parametrization and fixed variance that corresponds to the exploration budget allowed in the original randomized experiment. The features are up to 10 dimensions and the actions are one-dimensional. We keep the original logging baselines from the settings and compare results to a skyline supervised model trained on the whole training data with full information.
Discrete actions
We adapt the setup of (Swaminathan & Joachims, 2015a) that transforms a multilabel classification task into a contextual bandit problem with discrete, combinatorial action space. We keep the original modeling (akin to CRF) with categorical policies . The baseline (resp. skyline) is a supervised, full information model with identical parameter space than CRM methods trained on 5% (resp. 100%) of the training data. We consider the class of probabilistic policies that satisfy Assumption 5.1 by predicting actions in an Epsilon Greedy fashion (Sutton & Barto, 1998)): where . Real-world datasets include Scene, Yeast and TMC2007 with feature space up to 30,438 dimensions and action space up to . To account for this combinatorial action space we allow a model to be learned using data from all past rollouts for better sample efficiency and therefore adjust variance estimation in Appendix A.2 to take into account sequential dependencies.
6.2 SCRM compared to CRM and related methods
We first compare SCRM to CRM and existing methods in the literature.
Comparison between SCRM and CRM
First, we provide insights on the performance that SCRM can achieve compared to classical CRM with increasing sample sizes. The key difference between CRM/SCRM is that for each sample size CRM learns from samples generated by the logging model (see Alg. 2) whilst SCRM learns from samples generated by a series of optimized models (see Alg. 1). For each sample size we select a posteriori the best for both methods based on test set loss value. We report in Figure 2 over rollouts the mean test loss depending on sample size up to , with standard deviation estimated over 10 random runs. We observe that SCRM converges very fast, often within the first rollouts. Conversely, CRM needs more samples and the variance is higher. We conclude that there is a striking benefit to use a sequential design in order to achieve near optimal loss with much fewer samples and better confidence compared to CRM. Complementary results on other datasets are available in Appendix E.1.
Moreover, to further illustrate this benefit of efficient learning we also report in Table 1 the sample size needed to attain near optimal performance when is known as in Example 3.1, where we also observe that SCRM reaches optimal performances faster than CRM. This corroborates the benefits of improved excess risk rates for SCRM.
Hyper-parameter selection for SCRM
In our experiments, hyperparameter selection consists in choosing a value for . We describe a simple heuristic and evaluate its performance on different datasets. We propose to select by estimating the non-penalized CRM loss (eq. 3) using offline cross-validation on past data . We report in Table 2 the test loss obtained when choosing a fixed a posteriori () or with this heuristic (). We observe that loss confidence intervals for both methods intersect for all discrete datasets, except on TMC2007 where the degradation shows only at the 3rd digit. On continuous datasets, the heuristic actually improves upon the fixed a posteriori selection. We conclude that this heuristic is usable in practice.
Comparison with other methods
In this paragraph we compare our SCRM to related methods to explore practical implications of existing methods in our setting. We first consider batch bandits methods and implement the stochastic sequential batch pure exploitation (SBPE) algorithm in (Han et al., ) and a batch version of kernel UCB (Valko et al., 2013) algorithm (BKUCB) with an optimized library (see implementations details in Appendix D.3). We also experiment with off-policy RL methods PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015) from the StableBaselines library (Raffin et al., 2021) (see Appendix D.3). Indeed, such methods model more general state transitions based on past actions, but they could be used in our setting. To fairly compare all methods (in particular those for which no heuristic existing for hyper-parameter selection) we report the mean and standard deviation over 10 random runs of the best test loss a posteriori over hyperparameter grids of the same size. First, we observe that SCRM beats CRM on all datasets, illustrating the benefit of the sequential design. Second, on discrete tasks (where we the combinatorial action space is large) we observe that SCRM achieves nearly the best test loss in all tasks, while RL methods have difficulties maintaining good performances. Third, batch bandits algorithms can achieve good performances in practice because of their deterministic decision rules. However, they involve an matrix inversion and therefore did not finish (DNF) in 24h (per single run) on a 46 CPU / 500G RAM machine in most of our settings with large sample size , which make them unpractical for large scale experiments. We conclude that SCRM is an effective learning paradigm and that it scales successfully on a variety of settings.
6.3 Details on SCRM
Next, we provide additional empirical evaluations of details of our method.
Evaluation of IPS-IX
To understand the bias-variance trade-off that IPS-IX can achieve in practice compared to other counterfactual estimators we consider a policy evaluation experiment. The task we consider uses sinusoidal losses and evaluated policies are shifted Gaussians , with being the logging policy. Evaluated policies with large shifts with therefore simulate the setting where the logging policy under-explores the action space. The estimators we consider include IPS, SNIPS (Swaminathan & Joachims, 2015b), clipped IPS (eq. 4) with heuristic from (Bottou et al., 2013) and IPS-IX (eq. 9) with . All methods therefore use their respective heuristics to set hyperparameters. We report in Figure 3 the bias and variance of estimators for each shift for . We observe that IPS-IX shows an empirical bias comparable to IPS, lower than SNIPS and clipped IPS while maintaining a lower variance. Moreover its variance is only slightly higher than clipped IPS which introduced a large bias. We conclude that besides being a key component of our analysis IPS-IX also controls the variance with a better trade-off in practice. More details are available in Appendix E.2.
When is SCRM useful
is a natural question of interest when choosing the method to be used on a given logged bandit feedback problem. Intuitively one can imagine that SCRM will be most useful when the logging policy underexplores the action space, for example when the distance (in parameter space) between the logging and optimal parameters is large. To study this question we proceed to the following experiment on the setup of Example 3.1 with Gaussian distributions and fixed loss variance . We vary the distance between the optimal model and the logging model . Since the ideal exploration level may be task dependent we choose a posteriori the best on a grid, for both CRM and SCRM. We report in Figure 4 the best final loss for both CRM and SCRM for a range of values of . We observe in particular that SCRM achieves better final losses for larger distances than CRM. With the same number of rollouts , SCRM can extend the exploration to further areas while CRM fails for any exploration level in those cases, which advocates for using sequential deployments.
7 Discussions
In this work, we have proposed a method to extend the CRM perspective for designing sequential data collection experiments. We have introduced a novel counterfactual estimator to improve variance control in excess risk bounds. Under a weak error bound assumption, we have sequentially applied these excess risk guarantees to achieve faster rates similarly to acceleration methods. Our method also improves upon CRM in practice and is particularly well-suited for this setting compared to existing methods in the literature. It is worth noting that, in order to avoid introducing dependencies in the excess risk bounds we analyzed, the theoretical algorithm we have studied uses geometric sample sizes to discard previous samples. However, using all past samples has been found to be also effective in practice and developing guarantees for this case would be an interesting area for future research. Additionally, similar to online settings that involve an exploration-exploitation tradeoff, investigating the use of optimism in the face of uncertainty (OFUL) principle in SCRM would also be a promising avenue for future work.
Acknowledgements
The authors thank Alberto Bietti for the insightful early discussions on this project. The authors also thank the reviewers for their feedback on this paper. This work was supported by ANR 3IA MIAI@Grenoble-Alpes (ANR-19-P3IA0003).
Appendix A Additional details on counterfactual estimators
A.1 Unconfoundedness in sequential designs
In these explanations, we recall that the distributions of contexts as well as the distribution of losses are fixed. In other words, the latter do not vary from one batch to another. In the counterfactual risk minimization framework (CRM) (Swaminathan & Joachims, 2015a), the causal graph (using the conventions in (Peters et al., 2017)) can be represented as shown in Figure 5.
In the sequential counterfactual risk minimization (SCRM) framework, if we unfold the causal graph, the following representation can be given in Figure 6.
Therefore, it is clear that in general, . However, from d-separation and faithfullness (Peters et al., 2017), we have for :
[TABLE]
Therefore, given that all the dependencies are observed and that we can condition on the direct parents of a given model , sequential randomized data collection are possible. We eventually provide in Figure 7 an illustration of SCRM and CRM.
A.2 Multiple Importance Sampling Estimators
Note that in order to avoid introducing dependencies in the excess risk bounds we analyzed, the theoretical algorithm we have studied uses geometric sample sizes to discard previous samples. However, using all past samples is effective in practice and developing guarantees for this case would be an interesting area for future research. We present in this section a estimators using aggregation of all previous information. In particular, we can use Multiple Importance Sampling (MIS) (Owen, 2013) over all previous samples. Consider in particular a partition of unity with weight functions which satisfies for all and . The MIS estimator writes:
[TABLE]
In multiple importance sampling we usually assume that the behavior distributions are independent. In our case, when we optimize based on the models , we break this assumption. However, as we will see, we can still have the unbiasedness property and derive an estimator for the variance of the estimator.
Proposition A.1** (Unbiasedness).**
The MIS estimator (12) is unbiased when the loss is fixed (its distribution does not depend on time rollout ).
Proof.
Let . We recall that at all rounds , models were deployed and sets of observations were collected thereof, with propensities to learn the next model . To prove the unbiasedness we use the tower rule on the expectation and condition on previous observations :
[TABLE]
where the second last line is true only when the distribution of does not change over time roll-outs . ∎
Among the proposals for functions , the most ’naive’ and natural heuristic is to choose
[TABLE]
which gives the naive concatenation of all IPS estimators
[TABLE]
where .
With the previous definition of the empirical mean estimator, we can now derive an empirical variance estimator, starting with the naive multi importance sampling estimator. We write the random variable . We note that for inside a batch each realization of and are independent. But the realizations of the random variables and are dependent. Writing
[TABLE]
where the second last equality is obtained with the bilinearity of the covariance. Given the latter expression of the variance, we propose the following estimator and with a linear sampling where all for :
[TABLE]
where \hat{V}(r^{m})=\frac{1}{n_{m}(n_{m}-1)}\sum_{i=1}^{n_{m}}\big{(}r_{i}^{m}-\bar{r}^{m}\big{)}^{2} and .
Note also that for other functions , the most studied one is the balance heuristic with , that is:
[TABLE]
The latter heuristic has been studied for its low variance (Owen, 2013) but these properties have been studied under an i.i.d assumption that is broken in our adaptive data collection strategy. Eventually, note that controlling the variance of this estimator with an implicit exploration estimator as we do in the i.i.d case would make a an interesting research direction.
Appendix B Analysis details
In this section, we provide the details of our analysis by starting with essential definitions, then our proofs of variance dependent excess risk bounds and finally our regret analysis.
B.1 Definitions
is a complexity measure that will be upper-bounded by the metric entropy in sup-norm at level of the following function set,
[TABLE]
The latter corresponds to clipped prediction errors of policies normalized into . More precisely, to define rigorously , we denote for any and , the complexity of a class by
[TABLE]
where \mathcal{F}\big{(}\{x_{i},a_{i},y_{i}\}\big{)}=\big{\{}\big{(}f(x_{1},a_{1},y_{1}),\dots,f(x_{n},a_{n},y_{n})\big{)},f\in\mathcal{F}\big{\}}\subseteq\mathbb{R}^{n} and the number is the smallest cardinality of a set such that is contained in the finite union of -balls centered at points in in the metric induced by . Then, is defined by
[TABLE]
B.2 Variance-dependent excess risk bounds
We will denote by the conditional expectation given the set of observation samples up to the rollout . Here, we recall that , , , and . Furthermore, throughout the document, \operatorname{\mathbb{E}}_{x,\theta_{m},y}\big{[}\cdot\big{]} (resp. \mathrm{Var}_{x,\theta_{m},y}\big{[}\cdot]) denotes the expectation (resp. variance) in where , , and .
Proposition 4.1 (Generalization Error Bound).
Let and be the empirical estimators defined respectively in Eq. (9) and Eq. (10). Let , , and the number of samples associated to the logged dataset at round . Then, with probability at least ,
[TABLE]
where .
Proof.
Let and . Since all functions in defined in Eq. (17) take values in , we can apply the concentration bound of Maurer & Pontil (2009, Theorem 6) to the set . This yields, with probability at least ,
[TABLE]
where
[TABLE]
is an estimation of the sample variance. Let and define the following biased estimator of the excess risk:
[TABLE]
We recall that \operatorname{\mathbb{E}}_{x,\theta_{m},y}\big{[}\cdot\big{]} denotes the expectation in where , , and . By construction of (see Eq. (17)),
[TABLE]
where and are defined respectively in Eq. (9) and Eq. (10). Thus, multiplying (21) by , substituting the above terms, and using , yields
[TABLE]
with probability . Now, let us decompose
[TABLE]
But, since the losses are bounded in almost surely,
[TABLE]
which, substituted into the previous equation, entails,
[TABLE]
Lower-bounding the left-hand side of (26), we thus get w.p ,
[TABLE]
Using and applying Hoeffding’s inequality, this further yields w.p.
[TABLE]
Eventually, note that since . Thus,
[TABLE]
which concludes the proof.
∎
Proposition 4.2 (Conservative Excess Risk).
Let and . Let be a set of samples collected with . Then, under Assumptions 4.1 and 4.2, the solution of Problem (8) with the IPS-IX estimator in Eq. (11) on the samples satisfies the excess risk upper-bound
[TABLE]
where .
Proof.
We consider the notations of the proof of Proposition 4.1. Fix . Applying, Theorem 15 of (Maurer & Pontil, 2009)222Note that in their notation, equals , is the dataset where , and is the expectation with respect to one test sample . to the function set defined in (17), we get with probability
[TABLE]
This can be written as:
[TABLE]
with the following definitions:
[TABLE]
Step: Lower bounding
Using the definition of in (17) and that of in Eq. (22), we have
[TABLE]
Thus, can be re-written as
[TABLE]
which we now lower-bound. To do so, we begin by upper-bounding . It can be expressed as
[TABLE]
To shorten notation, from now on and throughout this proof, we write instead of , omitting the dependence on and . Using the inequality for , we have
[TABLE]
where the last inequality is by Assumption 4.1 and because . Together with (30), we get
[TABLE]
We recall that by Eq.(24). Therefore,
[TABLE]
which finally gives
[TABLE]
Step: Upper bound
By definition of in (17), we have
[TABLE]
Then, using the inequality , for , this may be upper-bounded as
[TABLE]
On the one hand, the first term of the right-hand side may be upper-bounded as
[TABLE]
where . On the other hand, for the second term, we use the same factorization as in Eq. (31) to get
[TABLE]
which yields the upper-bound
[TABLE]
Therefore, substituting the last two upper-bounds into (34) entails
[TABLE]
Then, replacing this upper-bound into the definition of in (29) and using Assumption 4.2 to upper bound the terms in , we obtain the following upper-bound
[TABLE]
where the last inequality is because .
Step: excess risk upper bound
Setting and using the two previous bounds (33) and (35) respectively on and on into (28), we get
[TABLE]
Using that , we have that
[TABLE]
Then, since and , we have , which yields
[TABLE]
Substituting the last two inequalities into (36) finally entails
[TABLE]
which concludes the proof. ∎
B.3 Regret analysis
Proposition 5.1 (Regret upper-bound).
Let and . Let for m=0,\dots,M=\big{\lfloor}\log_{2}(1+\frac{n}{n_{0}})\big{\rfloor}. Then, under Assumptions 4.1, 4.2 and 5.1, the SCRM procedure (Alg. 1) satisfies the excess risk upper-bound
[TABLE]
Moreover, the expected regret is upper-bounded as follows:
[TABLE]
Proof.
First, note that for and M=\big{\lfloor}\log_{2}(1+\frac{n}{n_{0}})\big{\rfloor}, we have Hence, Alg. 1 has collected at most samples to design the estimator . For , we recall and use Eq. (37) to write
[TABLE]
where and are independent of .
Step: Obtaining a recurrence relation for
By Assumption 5.1, there exist and such that
[TABLE]
Replacing in Eq. (38) thus entails
[TABLE]
Step: Solving the recurrence relation for
We then insure by induction that satisfies
[TABLE]
for some that will be specified by the analysis.
Base step Since losses take values in , . Equation (40) is thus satisfied for as soon as .
Induction step Let . We assume that and prove Equation (40) for . Using Eq. (39), we have
[TABLE]
Now, we show that both terms inside the maximum can be upper-bounded by as soon as is large enough. On the one hand, if , we have
[TABLE]
On the other hand, if , we also have
[TABLE]
Combining the above two upper-bounds with (41) concludes the induction step under the condition
[TABLE]
Step: conclusion
Finally, setting the above value for we proved that for all , we have
[TABLE]
where the last equality is by substituting the values of and from (38). For the final step , this yields
[TABLE]
This concludes the first part of the proof.
Regret upper-bound
To upper bound the cumulative regret, using , we write
[TABLE]
where
[TABLE]
Then, computing the sum for , we have
[TABLE]
Using that , we finally obtain
[TABLE]
∎
Appendix C Additional discussions on the Hölderian Bound Assumption 5.1
In this appendix, we discuss Assumption 5.1 on different particular examples.
C.1 Verification of the assumption on a toy example with Gaussian families
We consider the setting of Example 3.1. In the latter, the policies are Gaussian of the form and the loss is defined by where . There is no loss in generality in assuming . Then, we can compute
[TABLE]
We recall that we are interested in verifying the existence of and for which Assumption 5.1 holds, that is in this case for any :
[TABLE]
which may be re-written here as
[TABLE]
The latter is satisfied for any as soon as is a bounded interval. Note that the constant may decrease exponentially fast as the diameter of increases. To illustrate, the existence of such couples , we plot in Fig. 8 different values of the following ratio
[TABLE]
The value of can be found for different values of in Fig. 8 by taking .
Higher values of induce faster rates and lower values of induce worst constant terms in the excess risk and regret bounds. Eventually, note that SCRM does not need those parameters to run and those two parameters are automatically calibrated by SCRM to find the best trade-off.
C.2 Discussion of Assumption 5.1 for Exponential Families
In this section, we consider a more realistic example in which policies belong to an exponential family. That is, we assume that the policies are parameterized by a parameter and can be written in the form:
[TABLE]
for some known function and sufficient statistic . Here, is a normalization constant, so that . We provide in Example C.1 a concrete example considered by (Swaminathan & Joachims, 2015a; Faury et al., 2020). To ease the notation, we removed here the dependency on contexts, but the generalization to contextual policies can be made similarly. The importance weight ratio may be written as,
[TABLE]
To verify Assumption 5.1, we need to upper bound their variance, which we shall write as,
[TABLE]
Now, computing the moment generating function (MGF) of the statistic
[TABLE]
the variance term may be written as
[TABLE]
This eventually leads us to
[TABLE]
We now discuss two cases that are used for discrete actions (Swaminathan & Joachims, 2015a) and continuous actions (Kallus & Zhou, 2018; Zenati et al., 2020).
Bounded sufficient statistic
Supposing that there exists an upper bound such that , Cauchy-Schwartz inequality states that , which entails
[TABLE]
Assuming that the parameter space is compact, i.e, , there exists a constant that depends on and such that, this may be further upper-bounded as
[TABLE]
Therefore, Assumption 5.1 is implied by
[TABLE]
The latter is implied by a local version of strong convexity for (d’Aspremont et al., 2021), and holds with for .
Example C.1**.**
For discrete actions , we consider, as in (Swaminathan & Joachims, 2015a) and (Faury et al., 2020), policies where given a context x, probabilities of sampling an action are given by
[TABLE]
The function is typically a feature map associated to a kernel in a RKHS. In this case, the natural parameter and the sufficient statistic may be written as
[TABLE]
Lognormal and Normal distributions
For normal and lognormal distributions with fixed variance as considered by (Kallus & Zhou, 2018; Zenati et al., 2020), the normalizing constant writes , and we then obtain that:
[TABLE]
which gives:
[TABLE]
In that case, it is again possible for a bounded parameter space to linearize , consider losses that verify: for all , there exists an optimal such that
[TABLE]
Again, this holds generally for and for locally strongly convex losses for .
Appendix D Experiment details
D.1 Code
All the code to reproduce figures and tables is available in the following repository: https://github.com/criteo-research/sequential-conterfactual-risk-minimization.
D.2 Empirical settings details
Pricing
The pricing application in (Demirer et al., 2019) considers a ”personalized pricing” setting where given contexts , prices (which are the actions) need to be predicted to maximize the revenue:
[TABLE]
where and is akin to an unknown context-specifidemand function. The data generating process uses contexts for a positive integer. Only dimensions however affect the demand, that is if we write . The price is generated from a Gaussian logging policy centered in . We consider in our example the quadratic functionnal and as in the original paper.
Advertising
The advertising simulation in (Zenati et al., 2020) consists in predicting the potential of a user that may be compared to their a priori responsiveness to a treatment. The potential is caused by an unobserved random group variable in (groups of ”high” or ”low” potential users in their responsiveness) that influences context of users. The goal is then to find a policy that maximizes reward by adapting to an unobserved potential. The potentials are normally distributed conditionally on the group index, where and or for two groups. The observed reward is then a function of the action and the context through the associated potential of the user . The reward function mimics reward over the offline continuous bidding dataset in (Zenati et al., 2020) with the form:
[TABLE]
The logging policy is a lognormal distribution as it is common in advertising applications (Bottou et al., 2013). In particular, as in (Zenati et al., 2020), where the mean and the variance .
Yeast, Scene, TMC2007
We follow (Swaminathan & Joachims, 2015a). We now recall briefly the setup. The problem is a binary multilabel classification with potential labels. All models are parametrized by . The baseline (resp. skyline) is a supervised, full information model with identical parameter space than CRM methods trained on 5% (resp. 100%) of the training data. Our main modification it to consider the class of probabilistic policies that satisfy Assumption 5.1 by predicting actions in an Epsilon Greedy fashion (Sutton & Barto, 1998)): where . The loss is the Hamming loss (number of incorrectly assigned labels - both false positives and false negatives in the action vector):
[TABLE]
where (resp. ) is the -th component of the label vector (resp. action vector) of line . A uniform policy will thus evaluate at a loss of .
D.3 Implementation details
Counterfactual methods
In this paragraph we start by detailing the non adaptive counterfactual risk minimization that we compare to in this work.
We also provide the grid of hyperparameters for the evaluated in CRM and SCRM methods .
Batch Bandits
Let be a bounded positive definite Kernel associated to a RKHS , is the feature map such that for any . Context-actions pairs are written as and denoting the history of all context-actions pairs seen up until the end of batch . is the kernel matrix of all context-actions seen until the end of the batch . Eventually, is the kernel column vector of size . denotes the vector of concatenated rewards observed up until the end of the batch .
At a batch , a context is sampled for , and then to sample an action , the following decision rule is applied:
[TABLE]
In batch Kernel UCB, is defined as
[TABLE]
where
[TABLE]
and is a theoretical parameter that is set to in practical heuristics (Lattimore & Szepesvari, 2019). In SBPE (Han et al., ), is defined directly as
[TABLE]
SBPE (Han et al., ) uses a linear modelling, therefore we used a linear kernel. For the Kernel UCB (Valko et al., 2013) method, we used Gaussian and Polynomial kernels in our experiments. Note also that no regularization parameter is used in SBPE so we set in our experiments, and for K-UCB we chose in the grid .
Note in particular that we adapted the batch bandit baselines to the CRM setting by benefiting the initialization with the logged dataset to set the gram matrix as well as the reward vector with information from the logging data. This modification changes the original methods which take random actions at initializations.
Eventually, the baselines were carefully optimized using the Jax library (https://github.com/google/jax) to allow for just in time compilations of algebraic blocks in both methods and to maximize their scaling capacity.
RL baselines
In order to compare our method to the two known off-policy online RL algorithm PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015), we do the following:
we use the stable_baselines3(Raffin et al., 2021) library for the implementation. When necessary we call multiple times the model PPO or TRPO, to have buffer size of geometrical increase. 2. 2.
we initialize the ActorCriticPolicy with a simpler MLP model having only one layer with output dimension of 1, (with argument net_arch= [1], that is mathematically the same modelling as in CRM and SCRM baselines). 3. 3.
At the initial step only and to enable a fair comparison with counterfactual methods using a logging dataset, we pretrain the RL policies to imitate the actions sampled from the logging policy: we process by multiple step of the Adam optimizer, minimizing a loss being the sum of 2 terms:
- •
a MSE term between the sampled action of the ActorCriticPolicy for the contexts in the instances, and the actions sampled by the logging policy.
- •
the ENTROPY term guaranteeing to keep a minimum of exploration in order to initialize the RL algorithm () 4. 4.
we combine the 2 last terms with a linear combinaison with hyperparameters being tuned a posteriori, i.e. with the hyperparam
Appendix E Additional empirical results
E.1 SCRM compared to CRM
We provide here the additional plot in the Pricing setting.
E.2 Evaluation of IPS-IX
We provide here the plots for the whole setting considered in policy evaluation with IPS-IX.
E.3 Exploration/Exploitation tradeoff
In this part we give the details used for the experiment described in Section 6.3. We consider again Example 3.1 with the Gaussian parametrized policies and a loss where with . Recall that . We consider a grid of and consider . Our experiment aims at illustrating the influence of sequential exploration that is an important detail of the SCRM and CRM principles.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Audibert et al. (2007) Audibert, J.-Y., Munos, R., and Szepesvari, C. Tuning bandit algorithms in stochastic environments. In International Conference on Algorithmic Learning Theory , 2007.
- 2Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47:235–256, 05 2002.
- 3Bakshy et al. (2018) Bakshy, E., Dworkin, L., Karrer, B., Kashin, K., Letham, B., Murthy, A., and Singh, S. Ae: A domain-agnostic platform for adaptive experimentation. 2018.
- 4Bartlett & Mendelson (2006) Bartlett, P. L. and Mendelson, S. Empirical minimization. Probability Theory and Related Fields , 2006.
- 5Becker et al. (2011) Becker, S. R., Candès, E. J., and Grant, M. C. Templates for convex cone problems with applications to sparse signal recovery. Mathematical Programming Computation , 3(3):165–218, jul 2011.
- 6Bibaut et al. (2021) Bibaut, A., Kallus, N., Dimakopoulou, M., Chambaz, A., and van der Laan, M. Risk minimization from adaptively collected data: Guarantees for supervised and policy learning. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems , volume 34, pp. 19261–19273. Curran Associates, Inc., 2021.
- 7Bolte et al. (2007) Bolte, J., Daniilidis, A., and Lewis, A. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization , 17(4):1205–1223, 2007.
- 8Bottou et al. (2013) Bottou, L., Peters, J., Quiñonero Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research (JMLR) , 14(1):3207–3260, 2013.
