A Python Library For Empirical Calibration

Xiaojing Wang; Jingang Miao; Yunting Sun

arXiv:1906.11920·stat.CO·July 29, 2019

A Python Library For Empirical Calibration

Xiaojing Wang, Jingang Miao, Yunting Sun

PDF

Open Access 2 Repos

TL;DR

This paper introduces a Python library called EC that efficiently computes empirical calibration weights to correct data biases in survey sampling and observational studies, offering a flexible and robust tool for statistical bias correction.

Contribution

The paper presents a new Python library EC for empirical calibration, formulated as convex optimization, with enhanced efficiency, robustness, and usability features compared to existing software.

Findings

01

EC is more efficient than existing software.

02

EC supports various optimization objectives and inexact calibration.

03

Demonstrated effectiveness on simulated and real-world data.

Abstract

Dealing with biased data samples is a common task across many statistical fields. In survey sampling, bias often occurs due to unrepresentative samples. In causal studies with observational data, the treated versus untreated group assignment is often correlated with covariates, i.e., not random. Empirical calibration is a generic weighting method that presents a unified view on correcting or reducing the data biases for the tasks mentioned above. We provide a Python library EC to compute the empirical calibration weights. The problem is formulated as convex optimization and solved efficiently in the dual form. Compared to existing software, EC is both more efficient and robust. EC also accommodates different optimization objectives, supports weight clipping, and allows inexact calibration, which improves usability. We demonstrate its usage across various experiments with both simulated…

Tables2

Table 1. Table 1: Performance benchmarks of empirical calibration software packages. Run time in milliseconds.

Package	Mean	Min	Max
ebal	30.7	26.3	121.7
CVXR	102.2	88.4	140.6
EC	1.4	1.3	2.1

Table 2. Table 2: Causal methods with observational data.

	Discrete weights	Continuous weights
By raw covariates	Raw matching	Empirical calibration
By propensity scores	Propensity score matching	Propensity score weighting

Equations73

w minimize

w minimize

i = 1 \sum n w_{i} c_{j} (X_{i}) = \overset{c}{ˉ}_{j}, j = 1, \dots, p,

i = 1 \sum n w_{i} = 1,

w_{i} \geq 0, i = 1, \dots, n .

L (w) = i = 1 \sum n w_{i} lo g (w_{i}),

L (w) = i = 1 \sum n w_{i} lo g (w_{i}),

L (w) = \frac{1}{2} i = 1 \sum n w_{i}^{2} .

L (w) = \frac{1}{2} i = 1 \sum n w_{i}^{2} .

\frac{( \sum _{i = 1}^{n} w _{i} ) ^{2}}{\sum _{i = 1}^{n} w _{i}^{2}} .

\frac{( \sum _{i = 1}^{n} w _{i} ) ^{2}}{\sum _{i = 1}^{n} w _{i}^{2}} .

\frac{1}{n _{1}} i : T_{i} = 1 \sum [Y_{i} (1) - Y_{i} (0)],

\frac{1}{n _{1}} i : T_{i} = 1 \sum [Y_{i} (1) - Y_{i} (0)],

\overset{τ}{^}_{EC} = \frac{1}{n _{1}} i : T_{i} = 1 \sum Y_{i} - i : T_{i} = 0 \sum w_{i} Y_{i},

\overset{τ}{^}_{EC} = \frac{1}{n _{1}} i : T_{i} = 1 \sum Y_{i} - i : T_{i} = 0 \sum w_{i} Y_{i},

(Z_{1}, Z_{2}, Z_{3}, Z_{4}) \sim N (0, I_{4}) .

(Z_{1}, Z_{2}, Z_{3}, Z_{4}) \sim N (0, I_{4}) .

Y = 210 + 27.4 Z_{1} + 13.7 Z_{2} + 13.7 Z_{3} + 13.7 Z_{4} + ϵ,

Y = 210 + 27.4 Z_{1} + 13.7 Z_{2} + 13.7 Z_{3} + 13.7 Z_{4} + ϵ,

P r (T = 1∣ Z) = expit (- Z_{1} + 0.5 Z_{2} - 0.25 Z_{3} - 0.1 Z_{4}) .

P r (T = 1∣ Z) = expit (- Z_{1} + 0.5 Z_{2} - 0.25 Z_{3} - 0.1 Z_{4}) .

X_{i 1}

X_{i 1}

X_{i 2}

X_{i 3}

X_{i 4}

w minimize

w minimize

Z^{T} w = 0_{p},

1_{n}^{T} w = 1,

w \geq 0_{n},

L = \frac{1}{2} w^{T} w - β^{T} Z^{T} w - θ (1_{n}^{T} w - 1) - α^{T} w .

L = \frac{1}{2} w^{T} w - β^{T} Z^{T} w - θ (1_{n}^{T} w - 1) - α^{T} w .

w_{i} = Z_{i}^{T} β + θ + α_{i} .

w_{i} = Z_{i}^{T} β + θ + α_{i} .

w_{i} (w_{i} - Z_{i}^{T} β - θ) = 0,

w_{i} (w_{i} - Z_{i}^{T} β - θ) = 0,

w_{i} = max {0, Z_{i}^{T} β + θ} .

w_{i} = max {0, Z_{i}^{T} β + θ} .

w_{i} = e^{Z_{i}^{T} β + θ} .

w_{i} = e^{Z_{i}^{T} β + θ} .

w_{i} = min {u, max {l, Z_{i}^{T} β + θ}},

w_{i} = min {u, max {l, Z_{i}^{T} β + θ}},

w_{i} = min {u, max {l, e^{Z_{i}^{T} β + θ}}} .

w_{i} = min {u, max {l, e^{Z_{i}^{T} β + θ}}} .

∣ Z^{T} w ∣_{2} \leq ϵ,

∣ Z^{T} w ∣_{2} \leq ϵ,

w minimize

w minimize

Z^{T} w + Δ_{1} - Δ_{2} = 0_{p},

1_{n}^{T} w = 1,

w \geq 0_{n},

ϵ^{2} - Δ_{1}^{T} Δ_{1} - Δ_{2}^{T} Δ_{2} \geq 0,

Δ_{1} \geq 0_{p},

Δ_{2} \geq 0_{p} .

L = \frac{1}{2} w^{T} w - β^{T} (Z^{T} w + Δ_{1} - Δ_{2}) - θ (1_{n}^{T} w - 1) - α^{T} w - λ (\frac{1}{2} ϵ^{2} - \frac{1}{2} Δ_{1}^{T} Δ_{1} - \frac{1}{2} Δ_{2}^{T} Δ_{2}) - γ_{1}^{T} Δ_{1} - γ_{2}^{T} Δ_{2} .

L = \frac{1}{2} w^{T} w - β^{T} (Z^{T} w + Δ_{1} - Δ_{2}) - θ (1_{n}^{T} w - 1) - α^{T} w - λ (\frac{1}{2} ϵ^{2} - \frac{1}{2} Δ_{1}^{T} Δ_{1} - \frac{1}{2} Δ_{2}^{T} Δ_{2}) - γ_{1}^{T} Δ_{1} - γ_{2}^{T} Δ_{2} .

λ Δ_{1}

λ Δ_{1}

λ Δ_{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Causal Inference Techniques · Statistical Methods and Inference · Statistical Methods and Bayesian Inference

Full text

A Python Library for Empirical Calibration

Xiaojing Wang

Google

[email protected] &Jingang Miao

Google

[email protected] &Yunting Sun

Google

[email protected]

Abstract

Dealing with biased data samples is a common task across many statistical fields. In survey sampling, bias often occurs due to unrepresentative samples. In causal studies with observational data, the treated versus untreated group assignment is often correlated with covariates, i.e., not random. Empirical calibration is a generic weighting method that presents a unified view on correcting or reducing the data biases for the tasks mentioned above. We provide a Python library EC to compute the empirical calibration weights. The problem is formulated as convex optimization and solved efficiently in the dual form. Compared to existing software, EC is both more efficient and robust. EC also accommodates different optimization objectives, supports weight clipping, and allows inexact calibration, which improves usability. We demonstrate its usage across various experiments with both simulated and real-world data.

1 Introduction

Biased data samples are prevalent in statistical problems. In survey sampling, unrepresentative data can occur when certain sub-population are under- or over-represented in the sample. For example, researchers may decide to over-sample females in a marketing study of cosmetics using differential sampling probabilities, which, if not accounted for, would lead to selection bias. Further, once invited to join the study with a cash incentive, richer individuals may be less likely to participate, and the differential response rates may cause self-selection bias. To guard against such biases, sampling weights, non-response weights, and calibration weights are used at different stages of survey data adjustment. In causal inference with observational data, the treated versus untreated group assignment is often correlated with covariates, i.e., not random. Weighting can be applied to untreated individuals such that the reweighted untreated group better mimics the control group as in a randomized control experiment.

Weighting is a generic bias correction technique that can take different forms. One is inverse probability weighting, where the inverse of probability/propensity of being included in the sample or receiving the treatment is used as weights. Such probability is either known or estimated from a propensity model. Popular examples include the Horvitz-Thompson estimator (Horvitz & Thompson, 1952) in survey sampling and inverse propensity score weighting (Rosenbaum & Rubin, 1983) in causal inference.

Empirical calibration takes a different form, where instead of trying to summarize the systematic differences with one probability/propensity, one seeks weights that directly balance out the covariates but deviate least from the uniform weights to reduce the inflation of variance.

Solving the empirical calibration weights can be formulated as a convex optimization problem. Suppose there are $n$ biased data points $\{X_{i}\}_{i=1}^{n}$ , and $p$ marginal constraints $\{\bar{c}_{j}\}_{j=1}^{p}$ . The goal is to find a $n$ -dimensional weight vector ${\mathbf{w}}$ that is the least unequal while subjecting to the weight normalization and marginal balance constraints.

[TABLE]

where $L$ is the convex loss function and $c_{j}(X_{i})$ is the $j$ -th transformation.

There are variations of the unequalness definition, which determines the loss function $L$ . Hainmueller (2012) proposed entropy balancing in the context of causal inference with observational data. It corresponds to the entropy loss

[TABLE]

which can be viewed as the Kullback-Leibler divergence between ${\mathbf{w}}$ and uniform weights. Alternatively, (Deville & Särndal, 1992) used the Euclidean or squared distance between ${\mathbf{w}}$ and some base weights — typically uniform weights. It is equivalent to a quadratic loss

[TABLE]

Further, minimizing this quadratic loss can be seen as maximizing the effective sample size (Kish, 1965):

[TABLE]

More broadly, both entropy and quadratic losses can be considered as special cases of the empirical likelihood (Owen, 2001).

We provide a Python library EC that supports both entropy and quadratic loss functions. When there is no feasible solution to satisfy the balance constraints, EC allows for inexact constraints, which improves the robustness and usability. For some applications, it can be desirable to bound the weights to a certain range, EC conveniently has the weight restraint option built-in, which is superior to post optimization clipping or winsorization.

This paper is structured as follows. Section 2 briefly presents EC’s interface. In section 3, we give a short tour of various applications of empirical calibration, and present experimental results on both simulated and real-world data. Some implementation issues are discussed in section 4. Concluding remarks are given in section 5.

2 Software

The EC library is available at https://github.com/google/empirical_calibration and can be imported as

⬇

import empirical_calibration as ec

The primary interface is function ec.calibrate}: \beginmintedpython def calibrate(covariates: np.ndarray, target_covariates: np.ndarray, target_weights: np.ndarray = None, autoscale: bool = False, objective: Objective = ec.Objective.ENTROPY, max_weight: float = 1.0, l2_norm: float = 0) -> Tuple[np.ndarray, bool]:

Its arguments are

$X$ , covariates to be calibrated. All values must be numeric. For categorical values, the from_formula} function is often more convenient. \item[\mintinlinepythontarget_covariates] covariates to be used as target in calibration. The sum of each column is a marginal constraint ( $\bar{c}_{j}$ ). The number of columns should match covariates}. \item[\mintinlinepythontarget_weights] weights for target_covariates}. These are needed when the \mintinlinepythontarget_covariates themselves have weights. Its length must equal the number of rows in target_covariates}. If None, equal weights are used. \item[\mintinlinepythonautoscale] whether to scale covariates to [0, 1] and apply the same scaling to target covariates}. Setting it to True might help improve numerical stability. \item[\mintinlinepython

objective] the objective of the convex optimization problem. Supported values are ec.Objective.ENTROPY} ( $\sum_i=1^n w_i log(w_i)$ )and\verb{ec.Objective.QUADRATIC} ( $\frac{1$ 2 ∑_i=1^n w_i^2 $).\par theupperboundonweights.Mustbebetweenuniformweight(1/numberofrowsin\verb{covariates}) and 1.0. \item[\mintinline{python$ l2_norm] $\epsilon$ , the L2 norm of the covariates balance constraint; i.e., the Euclidean distance between the weighted mean of covariates and the simple mean of target covariates after balancing.

It returns a tuple (weights, success)}, where \begindescription The weights for the subjects. They should sum up to 1. Whether the constrained optimization succeeds.

Entropy objective is a common choice for causal inference problems, while the quadratic objective is more widely used in the context of survey calibration. One practical difference is that the entropy objective yields strictly positive weights while the quadratic objective allows zero weights where zero-weight data points are effectively discarded in subsequent analysis. The given constraints may yield no feasible solution, in which case one can either drop some constraints or soften the hard constraints by specifying a maximal distance allowed between the weighted covariates and the target covariates via the l2_norm} argument. Further, instead of manually specifying the soft constraint, one can use \mintinlinepythonec.maybe_exact_calibrate, which automatically chooses the smallest soft margin that yields a feasible solution. Also provided is the from_formula} interface, which is parallel to the functions above with \mintinlinepythoncovariates and target_covariates} replaced with \begindescription Formula used to generate design matrix. No outcome variable allowed. Data to be calibrated. Data containing the target.

formula} here follows \pkgpatsy-style so that one can conveniently construct square terms, interaction terms, etc. For example, formula=’~ -1 + x1 + x2 ** 2 + x3:x4’} tells \pkgEC to match the first moment of $x1$ , the second moment of $x2$ as well as the interaction between $x3$ and $x4$ . This saves the user the trouble of manually constructing these quantities. We illustrate the usage of these interfaces in the Applications section below.

3 Applications

3.1 Survey Calibration

Surveys or samples are drawn to make inferences about the target population, which is the bread and butter of statistics. A valid inference is only possible when the survey or sample is representative of the target population. Survey weighting is commonly used to correct for unequal sampling probability, differential response rates, and coverage errors (Deville & Särndal, 1992; Kott, 2006). Calibration, which is often one of the steps in survey weighting, makes use of information about the target population that is available after the sample is collected (auxiliary information). For example, for a survey of internet users, such information may be found in the Current Population Survey Computer and Internet Use Supplement (Ryan & Lewis, 2017), including distributions of internet users’ gender, age, education, and household income. Calibration aims to make the sample more representative of the target population. Popular methods of using auxiliary information include ratio estimation (Fuller, 2011), post-stratification or raking (Little, 1993), and calibration estimation (Deville & Särndal, 1992). Weighted samples can be used to estimate the population average of survey response. The weights are often constructed such that weighted moments — typically the first two moments — agree with known population benchmarks. The benchmarks could come from census data or other large scale and high-quality surveys. Meanwhile, the calibration weights are chosen to least deviate from the base weights, which account for sampling design and differential non-response. In the simplest form, the base weights are uniform.

3.1.1 Estimate population mean

A common use case of surveys is to estimate the mean or total of a quantity. We replicate the direct standardization (Fleiss et al., 2003) example by Fu et al. (2018). Data is obtained from the CVXR package (Fu et al., 2017), where dspop} contains 1000 rows with columns $\textsex ∼Bernoulli(0.5),age ∼Uniform(10, 60)$ ,and $y_i ∼N(5 ×sex_i + 0.1 ×age_i, 1)$ .\verb{dssamp} contains a skewed sample of 100 rows with small values of $y$ over-represented, thus biasing its distribution downwards. \par\par\begin{minted$python cols = [’sex’, ’age’] weights, _= ec.maybe_exact_calibrate( covariates=dssamp[cols], target_covariates=dspop[cols], objective=ec.Objective.ENTROPY ) print(’True mean of y: ’.format(dspop[’y’].mean())) print(’Unweighted sample mean: ’.format(dssamp[’y’].mean())) print(’Weighted mean: ’.format(np.average(dssamp[’y’], weights=weights)))

The true mean of $y$ based on the data generating process is 6.0. Using the generated population of size 1000 and a sample of size 100 contained in the CVXR package, the population mean is 6.01, but the mean of the skewed sample is 3.76, which is a gross underestimation. With empirical calibration, however, the weighted mean is 5.82, which is closer to the population mean.

The kernel density plots (Figure 1) show that the unweighted curve is biased toward smaller values of $y$ , but the weights help recover the true density.

3.2 Observational Data with a Binary Treatment

Empirical calibration is also applicable in observational studies with a binary treatment. It aims to achieve covariate balance by calibrating the untreated group against the treated group. It assigns non-negative weights to individual control units such that certain moments — typically means — of covariates between the treatment group and the untreated group are matched. The weighted untreated units then mimic a control sample as if it was from a randomized experiment. We follow the potential outcome language of Rubin causal model to describe the inference problem with a binary treatment. Suppose each unit $i$ is associated with a pair of potential outcomes: $Y_{i}(1)$ if treated ( $T_{i}=1$ ) and $Y_{i}(0)$ if untreated ( $T_{i}=0$ ). The treatment effect for this unit is defined as $\tau_{i}=Y_{i}(1)-Y_{i}(0)$ . Inference of $\tau_{i}$ ’s becomes a missing data problem since the two potential outcomes are never both observed — commonly known as the fundamental problem of causal inference. If $Y_{i}(T_{i})$ is observed, $Y_{i}(1-T_{i})$ is the counterfactual, i.e., what would have been observed if $T_{i}$ did not occur. We limit the discussion to the sample average treatment effect on the treated (ATT) estimand:

[TABLE]

where $n_{1}$ is the number of treated units. The first term in (1) is straightforward to compute using the treated observations. The problem boils down to estimating the second term — the mean of unobserved counterfactuals — from the control observations. In experimental settings, the treatment assignment is independent of the outcome, $Y(1),Y(0)\perp T$ , and one can simply use $\frac{1}{n_{0}}\sum_{i:T_{i}=0}Y_{i}(0)$ as an estimate of $\frac{1}{n_{1}}\sum_{i:T_{i}=1}Y_{i}(0)$ . In observational studies, however, due to the treatment selection bias, the treated group is often systematically different from the control group, rendering the latter two quantities unequal. Conventionally, we assume observed covariates $X$ contain all the information about the selection bias (i.e., there is no unobserved confounding variable, which is a strong assumption), and estimate ATT in two stages: 1) “design” purely based on matching of observed covariates $X$ and 2) outcome analysis of $Y$ . Stage 1 equates or balances the distributions of covariates between the treated and control groups. Matching and weighting are the two main approaches to achieve the balance, see Appendix B for a short tour of common methods. Stage 2 compares the outcomes of the treated and control units, and estimates the causal effect of the treatment. It generally involves some regression adjustments to account for the small residual covariate imbalance between the groups after stage 1. The outcome regression must be able to accept weights if stage 1 is done via weighting. The matching methods and outcome analysis in the two stages have been shown to work best in combination (Rubin, 1973). The intuition is the same as that behind the double-robust estimators (Robins et al., 1994), which are asymptotically unbiased if either the propensity score matching model or the outcome regression model is correctly specified. For empirical calibration weighting, we focus on the affine estimator:

[TABLE]

where $n_{1}$ is the number of treatment units and $w_{i}$ ’s are the empirical calibration weights that sum up to 1. The first term is the simple average of the observed outcomes for the treated units, and the second term is the weighted average of the observed outcomes for the control units.

3.2.1 Kang-Schafer Simulation

Kang et al. (2007) used a simulation study to illustrate the selection bias of outcome under informative non-response. The study became a standard benchmark to compare different estimators for causal estimands. The true set of covariates is generated independently and identically distributed from the standard normal distribution

[TABLE]

The outcome is generated as

[TABLE]

where $\epsilon\sim N(0,1)$ . The propensity score is defined as

[TABLE]

This mechanism produces an equal-sized treated and control group on average. Given the covariates, the outcome is independent of the treatment assignment; thus the true ATT is zero. The overall outcome mean is 210. Due to the treatment selection bias, the outcome mean for the treated group (200) is lower than that of the control group (220). A typical exercise is to examine the performance of an observational method under both correctly specified and misspecified propensity score and/or outcome regression models. Misspecification occurs when the following nonlinear transformation $X_{i}$ ’s are observed in place of the true covariates

[TABLE]

To estimate ATT, we use empirical calibration to weight the control group so that it matches the treatment group in terms of their covariate distribution. Simulations with a sample of 1000 are used. Without weights, the bias is 20.2, and RMSE is 20.3. With correctly specified covariates to match ( $Z1,\dots,Z4$ ), the bias is almost zero (0.001), and RMSE is 0.09, With incorrectly specified covariates to match ( $X1,\dots,X4$ ), bias is -4.5, and RMSE is 4.6. With matching on additional transformations of incorrectly specified covariates ( $X1,\dots,X4$ plus interactions terms and log-transformed versions), bias is -1.8 and RMSE is 2.0. It is evident that weighting reduces the bias of the ATT estimate significantly. We also use empirical calibration to estimate the population mean and compare it with the results reported in Kang et al. (2007). The treatment group is weighted so that it matches the population in terms of their covariate distribution. The estimator is the weighted value of $y$ in the treatment group. Again, simulations with a sample of 1000 are used. With correctly specified covariates to match ( $Z1,\dots,Z4$ ), the bias is almost zero (0.0003), and RMSE is 1.13, which is better than all the methods described in Kang et al. (2007).

3.2.2 LaLonde Data Analysis

The LaLonde (1986) data is another canonical benchmark in the causal inference literature. It consists of three groups for evaluating the effect of a large scale job training program — the National Supported Work Demonstration (NSW). An experimental treatment group with 185 observations, an experimental control group with 260 observations, and an observational control group was drawn from the Current Population Survey (CPS), with 15,992 observations. The outcome variable was the post-intervention earnings in 1978, and common covariates for all three datasets include age, education, earnings in 1974, earnings in 1975, and binary indicators of being black, being Hispanic, being married, and having completed high school. Using the experimental control, the simple difference-in-means estimate of the average treatment effect on the treated is 1794.3 with a 95% confidence interval of [479.2, 3109.5]. A linear regression with earning in 1978 as the outcome and treatment or control indicator together with other variables as independent variables gives an estimate of 1698 with a 95% confidence interval of [458, 2938], which is narrower than the difference in means estimate. We apply entropy balancing on the observational control with respect to 52 covariates as described in Hainmueller (2012), and the difference in means estimate is 1571, which is identical to what was reported in Hainmueller (2012). Using quadratic balancing, the point estimate is 1713, which is very close to the estimate using the experimental control.

3.3 Geo Experiments

An advertiser must measure the impact and return on investment of its online ad campaigns. Randomized controlled experiment or A/B testing is considered the gold standard for drawing causal conclusions. A user-level experiment randomly assigns users to the treatment or control group. No ad is served to the control users. The causal effect of advertising can then be estimated by comparing the observed outcomes between the two groups. It can be difficult however to maintain the integrity of the user level group assignment due to various issues such as user signed in and signed out, cookie churn, cross-device usage. Geo-level experiments offer an attractive alternative to user-level experiments by experimenting at the geographic level. A region is first partitioned into non-overlapping subregions, or simply “geos”, which are then randomly assigned to a control or treatment condition. Each geo realizes its assigned treatment condition through the use of geo-targeted advertising. For example, an online ad campaign is run in the treatment geos but not in the control geos. The behavior changes, such as website visits and purchases, of customers in the treatment geos can then be attributed to the ad campaign. Despite the random assignment of treatment and control geos, the systematic difference between the two groups can still be substantial, because there may only be a small number of highly heterogeneous geos available for experimentation. A common solution is first to pair similar geo together and then apply random assignment within each pair. The issue is that some geo, for example, New York City, is too large to find a matching geo and cannot be sufficiently balanced out. Empirical calibration can help lessen the imbalance of the geo-level randomization, and thus reduce the bias of the causal effect estimate. It finds weights for the control geos such that the weighted average of the control geo time series matches that of the treatment geos in the pretest period. The same weights are then applied to control time series in the test period as the treatment counterfactual time series. Empirical calibration does not aggregate time series across pretest or test period as in geo-based regression (GBR) (Vaver & Koehler, 2011), nor does it aggregate data across control geos as in the time-based regression (TBR) (Kerman et al., 2017). It leverages individual time series at each geo for causal inference. Although GBR allows geo level weights in its linear model to account for heteroscedasticity caused by the differences in geo size, unlike in empirical calibration, GBR’s weights are determined in a less principled way — simply the inverse of the pretest period metrics. Empirical calibration shares the spirit of using contemporaneous controls with Bayesian structural time series (Brodersen et al., 2015) models but does not need to specify a time series model, thus is more robust to model misspecification. We demonstrate the usage of empirical calibration for geo experiments using the data found in Google Inc (2017). The goal is to estimate the treatment effect on geo 30.

The unweighted mean of controls is systematically above that of the treatment geo in the pretest period, which suggests that the difference between the unweighted means in the test period is not an indication of any causal effect of the treatment. The weighted mean of controls, on the other hand, matches the treatment geo much better in the pretest period, which leads to a more reasonable counterfactual estimate. Instead of looking at one treatment geo at a time, one can also choose to study the mean of all treatment geos as compared to control geos.

4 Implementation

Stata package Ebalance (Hainmueller & Xu, 2013) and R package ebal (Hainmueller, 2014) package finds the entropy balancing weights using the algorithm proposed by Hainmueller (2012). It exploits the dual form of convex optimization, where the primal weights can be expressed as a log-linear function of the covariates specified in the moment conditions. The dual problem is more tractable than the primal problem because it is unconstrained and the dimensionality reduces to the number of marginal constraints. A Levenberg–Marquardt scheme is employed to update the dual solution iteratively. When the problem is feasible, the algorithm is globally convergent. R package survey offers some overlapping functionality with our python implementation. Calling survey::calibrate} with the argument \mintinlineRcalfun="linear" is equivalent to using ec.calibrate} with a quadratic objective. If no upper or lower bounds are placed on the weights, a closed-form solution is used; however, negative weights are possible. If, on the other hand, bounds are set, then an iterative algorithm is used instead. One advantage of our implementation is that \mintinlineRsurvey::calibrate would fail when there is no feasible solution, while ec.calibration} allows gradual relaxation of the constraints until a feasible solution can be found. Alternatively, both the entropy balancing and quadratic balancing weights can be solved by general purpose convex optimization software. Python library \pkgCVXPY (Diamond & Boyd, 2016) and R package CVXR (Fu et al., 2017) are two common choices, which provide domain-specific modeling language and interface with open-source solvers such as ECOS (Domahidi et al., 2013) and SCS (O’Donoghue et al., 2016). Our implementation mostly follows the entropy balancing dual solution. We also express the primal weights as a function of the covariates and Lagrangian multipliers. The function is log-linear for entropy balancing and linear with a ReLU activation for quadratic balancing. Instead of relying on Levenberg–Marquardt update, we find the dual solution by solving a set of non-linear equations with Python’s scipy.optimize.fsolve}, see Appendix~\refapp:cvx for details. There could be no feasible solution when the sample size is small, or the number of imposing constraints is large. In the context of observational studies with a binary treatment, it is particularly common to have no feasible solution when there was limited overlap between the treated and untreated covariate distributions. Instead of forcing our users to drop certain marginal constraints altogether, we provide an option for our users to relax the hard constraints. In our experience, a small constraint relaxation often leads to feasible solutions without too much degradation on the weights, significantly improving the numerical computation experience. In other applications, it might be desirable to restrict the weights to a particular range. The conventional solution is to apply weight clipping or winsorization in a post-processing step, but it leads to a violation of the weight normalization constraint and sub-optimal weight solutions. Our implementation allows the weight clipping imposed directly as an additional constraint in the optimization problem, yielding optimal weights satisfying all constraints at once.

4.1 Benchmark against R Packages ebal and CVXR

Benchmarking was done against R package ebal version 0.1-6 and package CVXR version 0.99-4 on a Kang et al. (2007) simulation of sample size 2,000. Computation was done for 200 times for each package. Package EC was 22 and 73 times as fast as ebal and CVXR respectively (Table 1). Further, Package EC had less variability in the performance times.

5 Conclusion

Empirical calibration has many applications, particularly in survey sampling and causal inference. EC provides an easy to use, robust, and efficient implementation. The source code is shared at https://github.com/google/empirical_calibration, where additional documentation and application examples can be found.

Acknowledgments

The authors thank Art Owen, Jim Koehler, Joseph Kelly, Georg Goerg, Jon Vaver, Susanna Makela, Mike Hankin, and Mike Wurm for helpful discussions. The authors appreciate Tony Fagan, Penny Chu, and Elissa Lee for their support. Special thanks go to Jim Koehler, David Chan, and Tim Au for reviewing this manuscript.

Appendix A Convex Optimization Details

A.1 Solving the Dual Problem

We present a detailed solution for the quadratic loss. The solution for entropy loss can be derived similarly. We first convert the optimization into vector form

[TABLE]

where $Z$ is the $n\times p$ covariate matrix subtracted by the marginal constraint, i.e., the $i$ -th row $Z_{i}={\mathbf{c}}(X_{i})-\mathbf{\bar{c}}$ . We construct the Lagrangian with $\beta\in\mathbb{R}^{p}$ , $\theta\in\mathbb{R}$ , and $\alpha\in\mathbb{R}_{\geq 0}^{n}$ as the Lagrangian multipliers

[TABLE]

Setting $\frac{\partial L}{\partial w_{i}}$ to zero reveals that the primal weight is a linear function of the transformed covariate and Lagrangian multipliers

[TABLE]

Multiplying $w_{i}$ on both sides and using the complementary slackness Karush-Kuhn-Tucker condition $w_{i}\alpha_{i}=0$ , we eliminate $\alpha_{i}$ and obtain

[TABLE]

which further reduces to

[TABLE]

Substituting (8) into (4)-(5), we obtain a $p+1$ -dimensional estimating equation with respect to the dual parameter $(\beta,\ \theta)$ . The equation can be solved using a general purpose nonlinear equation solver, such as scipy.optimize.fsolve. When such a solution exists, it would be unique because the quadratic objective is strictly convex. The dual solution for the entropy objective can be derived similarly, but with an exponential link function to replace (8):

[TABLE]

By construction of (8) and (9), sample data points with identical covariates would be given identical weights, which is a desirable property.

A.2 Bounding the Weights

For certain applications, it is desirable to impose additional upper and/or lower bound on the weights to reduce the impact of extreme weights. Suppose the imposing bound is $[l,u]$ , where $l$ and $u$ are both non-negative. We only need to change the weight link functions accordingly. (8) would become

[TABLE]

and (9) is now

[TABLE]

A.3 Relaxing the Equality Constraint

When there is no feasible solution, we need to either drop some marginal constraints or soften the hard constraints as in soft margin support vector machine

[TABLE]

where $\epsilon$ is the tolerance. It only requires slightly more work to solve optimization problem with tolerance added. Following the common practice, we first introduce two $p$ -dimensional non-negative slack variables $\Delta_{1}$ and $\Delta_{2}$ , and reformulate the optimization problem as

[TABLE]

We then construct the Lagrangian with $\beta\in\mathbb{R}^{p}$ , $\theta\in\mathbb{R}$ , $\alpha\in\mathbb{R}_{\geq 0}^{n}$ , $\lambda\in\mathbb{R}_{\geq 0}$ , and $\gamma_{1},\ \gamma_{2}\in\mathbb{R}_{\geq 0}^{p}$ as the Lagrangian multipliers

[TABLE]

Setting $\frac{\partial L}{\partial w_{i}}$ to zero and applying the complementary slackness KKT condition $w_{i}\alpha_{i}=0$ leads to the same solution (8) for $w_{i}$ . Setting $\frac{\partial L}{\partial\Delta_{1}}$ and $\frac{\partial L}{\partial\Delta_{2}}$ to zero and applying their respective complementary slackness KKT conditions, we have

[TABLE]

which in turn leads to

[TABLE]

As a result the slack variables can be eliminated from (12)

[TABLE]

Combining (19) and the complementary KKT condition

[TABLE]

we get

[TABLE]

which nicely connects the $L_{2}$ tolerance $\epsilon$ to $L_{2}$ norm of the dual parameter $\beta$ , and further eliminates $\lambda$ from (20),

[TABLE]

It reveals that it is the scaled dual parameter $\beta$ that serves as the slack for the relaxed covariate balance constraint (12). Finally, combining (8), (13) and (21), we again obtain a $p+1$ -dimensional estimating equation with respect to the dual parameter $(\beta,\ \theta)$ , and can solve it using a general purpose nonlinear equation solver.

Appendix B Empirical Calibration and Other Causal Methods

We briefly compare empirical calibration with other causal inference methods. Randomized experiments use a randomized assignment mechanism to ensure that the covariate distributions between the treated and control groups are naturally balanced. For observational data, matching and weighting are the two main methods to enforce this balance. Matching refers to matching treatment and control units at the individual level; while weighting assigns continuous weights to control units such that their weighted average matches the treatment average. Zhao & Percival (2017) nicely summarized all matching and weighting methods in the table below

Weighting is generally preferred to matching due to its efficiency gain (Zhao & Percival, 2017). Compared to propensity score weighting, empirical calibration enjoys similar theoretical properties but yields more stable weights.

Raw Matching

uses raw pre-treatment covariates for matching and does not attempt to model the assignment mechanism. Both exact matching and $k:1$ nearest neighbor matching fall into this category. Their variants vary by choice of the distance metric, and whether the same control unit is allowed to be used multiple times as a match (matching with replacement). Raw matching is intuitive and easy to implement. In many applications, however, it can be difficult or impossible to find $k$ closely matched control units for certain treatment units. Empirical calibration, on the other hand, aims to balance the treated and control group at the aggregate level, which is usually sufficient for estimating ATT. Nearest neighbor matching implies binary or discrete weights to control units. For an observational data with a large control group, many control units would be effectively discarded in the analysis. By adopting continuous weights, empirical calibration uses information from more control units. It leads to a larger effective sample size for the control group and reduces the variance of the ATT estimate.

Propensity Score Matching (PSM)

assumes a regression model for the propensity score $p(X)=P(T=1|X)$ . The treated and control units are matched using the estimated propensity scores. The propensity score serves as the sufficient statistic of treatment assignment, and reduces the problem to a univariate matching problem. The success of the propensity score method hinges on the quality of the estimated scores. Accurately estimated propensity scores stochastically balance the covariates between the treatment and control groups. In practice, however, matching or weighting solely based on propensity score rarely guarantees actual covariate balancing. If the data is balanced to begin with, PSM may increase the imbalance (King & Nielsen, 2015). Empirical calibration takes the guesswork out of the procedure by directly optimizing towards an explicit covariate balancing goal, with guaranteed results.

Propensity Score Weighting (PSW)

weights the control units using the inverse probability weighting (IPW) scheme

[TABLE]

where $\hat{p}(X_{i})$ is the estimated propensity score for the $i$ -th unit. The inverse probability weights could be volatile and sensitive to model misspecification. If some estimated propensity score is close to zero for a control unit, its inverse weight becomes very large and unstable, inflating the variance of the IPW estimate for ATT. One remedy is to trim extreme weights or apply winsorization, but it is hard to choose the trimming threshold in a principled way. Empirical calibration seeks weights that least deviate from uniform weights, so large weights are naturally penalized in the optimization. The resulting weights are far more stable. Furthermore, if desired, bounds on the weights can be directly imposed in optimization without the need for a separate trimming step. Conventionally, propensity score regression and outcome regress models are fitted separately and then combined to construct a doubly robust estimator — if either model is correctly specified, the estimator is statistically consistent. Recently, Zhao & Percival (2017) showed that entropy balancing (EB) is also doubly robust with respect to the implied logistic propensity score regression and linear outcome regression. Hirano et al. (2003) showed that IPW paired with the sieve propensity score model could achieve the semiparametric efficiency bound, which justifies why weighting is usually preferred over matching. Empirical calibration with the entropy loss has also been shown to reach the asymptotic semiparametric efficiency bound when both implied propensity score regression and outcome regression are correctly specified (Zhao & Percival, 2017).

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Brodersen et al. (2015) Kay H Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, Steven L Scott, et al. Inferring causal impact using bayesian structural time-series models. The Annals of Applied Statistics , 9(1):247–274, 2015.
2Deville & Särndal (1992) Jean-Claude Deville and Carl-Erik Särndal. Calibration estimators in survey sampling. Journal of the American statistical Association , 87(418):376–382, 1992.
3Diamond & Boyd (2016) Steven Diamond and Stephen Boyd. Cvxpy: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research , 17(1):2909–2913, 2016.
4Domahidi et al. (2013) Alexander Domahidi, Eric Chu, and Stephen Boyd. Ecos: An socp solver for embedded systems. In Control Conference (ECC), 2013 European , pp. 3071–3076. IEEE, 2013.
5Fleiss et al. (2003) Joseph L. Fleiss, Bruce Levin, and Myunghee Cho Paik. Statistical Methods for Rates and Proportions . Wiley, 2003.
6Fu et al. (2017) Anqi Fu, Balasubramanian Narasimhan, and Stephen Boyd. Cvxr: An r package for disciplined convex optimization. ar Xiv preprint ar Xiv:1711.07582 , 2017.
7Fu et al. (2018) Anqi Fu, Balasubramanian Narasimhan, and Stephen Boyd. Cvrx: A direct standardization example. https://cvxr.rbind.io/cvxr_examples/cvxr_direct-standardization , 2018.
8Fuller (2011) Wayne A Fuller. Sampling statistics , volume 560. John Wiley & Sons, 2011.