Worth Weighting? How to Think About and Use Weights in Survey Experiments
Luke W. Miratrix, Jasjeet S. Sekhon, Alexander G. Theodoridis, Luis F., Campos

TL;DR
This paper provides practical guidance on using weights in survey experiments, showing that unweighted estimates often suffice for population effects, while weighted methods are better for precise population estimates.
Contribution
It offers a Neyman-Rubin model-based framework and empirical evidence for when to use weights versus unweighted estimators in survey experiments.
Findings
Unweighted sample estimates often match weighted ones for population effects.
Weighting can reduce statistical power and may be unnecessary with high-quality samples.
Post-stratification on weights or covariates improves estimates when precise population effects are needed.
Abstract
The popularity of online surveys has increased the prominence of using weights that capture units' probabilities of inclusion for claims of representativeness. Yet, much uncertainty remains regarding how these weights should be employed in the analysis of survey experiments: Should they be used or ignored? If they are used, which estimators are preferred? We offer practical advice, rooted in the Neyman-Rubin model, for researchers producing and working with survey experimental data. We examine simple, efficient estimators for analyzing these data, and give formulae for their biases and variances. We provide simulations that examine these estimators as well as real examples from experiments administered online through YouGov. We find that for examining the existence of population treatment effects using high-quality, broadly representative samples recruited by top online survey firms,…
| Estimator | Mean | Bias | SE | RMSE | boot SE | Coverage | |
|---|---|---|---|---|---|---|---|
| A | |||||||
| 1 | 40.36 | 7.77 | 1.35 | 7.89 | |||
| 2 | 32.58 | 0.00 | 1.84 | 1.84 | |||
| 3 | 40.37 | 7.78 | 3.14 | 8.39 | 3.12 | 30% | |
| 4 | 32.62 | 0.03 | 3.91 | 3.91 | 3.79 | 95% | |
| 5 | 32.60 | 0.01 | 2.67 | 2.67 | 2.69 | 95% | |
| B | |||||||
| 1 | 30.00 | 0.00 | 0.00 | 0.00 | |||
| 2 | 30.00 | 0.00 | 0.00 | 0.00 | |||
| 3 | 30.01 | 0.01 | 2.58 | 2.58 | 2.58 | 95% | |
| 4 | 30.03 | 0.03 | 3.35 | 3.35 | 3.31 | 95% | |
| 5 | 30.02 | 0.02 | 3.32 | 3.32 | 3.29 | 95% |
| Survey | Year | Geography | Base Stimulus | Conditions | Outcome(s) | N Democrats | N Republicans |
|---|---|---|---|---|---|---|---|
| CCES Module 1 | 2010 | National | News Report | Democratic/Republican Candidate | Report Fair; Report Biased; Topic Important; Candidate Deserves Credit; Candidate Typical | 231 | 211 |
| CCES Module 1 | 2010 | National | Image | Democratic Donkey/Republican Elephant | Unemployment Rate Estimate | 190 | 219 |
| CCES Module 2 | 2010 | National | Image | Democratic Donkey/Republican Elephant | Unemployment Rate Estimate | 201 | 193 |
| YouGov Study | 2011 | National | News Report | Democratic/Republican Candidate | Report Fair; Report Biased; Topic Important; Candidate Deserves Credit; Candidate Typical | 442 | 372 |
| CCES | 2012 | National | Campaign Advertisement Video | Obama/Romney | Time watched; Repeat; Share; See More | 326 | 228 |
| CCES | 2012 | National | Campaign Advertisement Video | Negative/Positive | Time watched; Repeat; Share; See More | 326 | 228 |
| CCES | 2012 | National | Candidate Vignette | Democratic/Republican Label | Trait Ratings: Compassionate; Moral; Strong Leader; Really Cares; Knowledgeable; Greedy; Indecisive; Hard Working; Honest | 321 | 225 |
| CCES | 2012 | National | Voter Fraud Hypothetical | Democrats/Republicans | Would this group commit fraud? | 223 | 145 |
| Gubernatorial Election | 2013 | Virginia | Campaign Advertisement Video | McCauliffe/Cuccinelli | Time watched; Repeat; Share; See More | 454 | 350 |
| Gubernatorial Election | 2013 | Virginia | Campaign Advertisement Video | Negative/Positive | Time watched; Repeat; Share; See More | 454 | 350 |
| YouGov Study | 2013 | National | News Report | Democratic/Republican Candidate | Report Fair; Report Biased; Topic Important; Candidate Deserves Credit; Candidate Typical | 456 | 353 |
| CCES | 2014 | National | Candidate Conjoint 1 | Male/Female Candidate | Is candidate more likely a Democrat or Republican? | 504 | 330 |
| CCES | 2014 | National | Candidate Conjoint 2 | Male/Female Candidate | Is candidate more likely a Democrat or Republican? | 504 | 330 |
| CCES | 2014 | National | Candidate Conjoint 3 | Male/Female Candidate | Is candidate more likely a Democrat or Republican? | 504 | 330 |
| CCES | 2014 | National | Candidate Conjoint 4 | Male/Female Candidate | Is candidate more likely a Democrat or Republican? | 504 | 330 |
| CCES | 2014 | National | Painting by George W. Bush | Bush Revealed as Artist/Not Revealed | Rating of Painting Quality | 504 | 329 |
| CCES | 2014 | National | Sketch by Barack Obama | Obama Revealed as Artist/Not Revealed | Rating of Sketch Quality | 394 | 278 |
| CCES | 2014 | National | News Story about Stampede at July 4 Gathering | Democratic/Republican Event | In-Party Shame | 338 | 204 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\newfloatcommand
capbtabboxtable[][\FBwidth]
Worth Weighting? How to Think About and Use Weights in Survey Experiments
Luke W. Miratrix
Harvard University For research support and funding of the experimental studies analyzed here, Theodoridis thanks the University of California, Merced, the National Science Foundation Awards #1430505, #1225750 and #0924191, the Empirical Implications of Theoretical Models program and the Integrative Graduate Education and Research Traineeship program, and the Berkeley Institute of Governmental Studies and its Mike Synar Graduate Research Fellowship; Miratrix thanks the Institute of Education Sciences, U.S. Department of Education, through Grant R305D150040; and Sekhon thanks Office of Naval Research (ONR) grants N00014-15-1-2367 and N00014-17-1-2176. The opinions expressed are those of the authors and do not represent views of the Institute, the U.S. Department of Education, ONR, or any of the organizations mentioned above. Please send comments to [email protected], [email protected], [email protected], or [email protected]. Replication data happily provided upon request.
Jasjeet S. Sekhon
University of California, Berkeley
Alexander G. Theodoridis
University of California, Merced
Luis F. Campos
Harvard University
Abstract
The popularity of online surveys has increased the prominence of using weights that capture units’ probabilities of inclusion for claims of representativeness. Yet, much uncertainty remains regarding how these weights should be employed in analysis of survey experiments: Should they be used or ignored? If they are used, which estimators are preferred? We offer practical advice, rooted in the Neyman-Rubin model, for researchers producing and working with survey experimental data. We examine simple, efficient estimators for analyzing these data, and give formulae for their biases and variances. We provide simulations that examine these estimators as well as real examples from experiments administered online through YouGov. We find that for examining the existence of population treatment effects using high-quality, broadly representative samples recruited by top online survey firms, sample quantities, which do not rely on weights, are often sufficient. We found that Sample Average Treatment Effect (SATE) estimates did not appear to differ substantially from their weighted counterparts, and they avoided the substantial loss of statistical power that accompanies weighting. When precise estimates of Population Average Treatment Effects (PATE) are essential, we analytically show post-stratifying on survey weights and/or covariates highly correlated with the outcome to be a conservative choice. While we show these substantial gains in simulations, we find limited evidence of them in practice.
1 Introduction
Population-based survey experiments have become increasingly common in political science in recent decades (Gaines, Kuklinski and Quirk, 2007; Mutz, 2011; Sniderman, 2011). However, practical advice remains limited in the literature and uncertainty persists among scholars regarding the role of weights that capture differing probabilities of eventual inclusion across units in the analysis of survey experiments (Franco et al., Forthcoming). Should they be used or ignored? If they are to be used, which estimators are to be preferred? As Mutz (2011, 113-120) notes,
“there has been no systematic treatment of this topic to date, and some scholars have used weights while others have not … the practice of weighting was developed as a survey research tool—that is, for use in observational settings. The use of experimental methodology with representative samples is not yet sufficiently common for the analogous issue to have been explored in the statistical literature.”
We seek to fill this void with a systematic treatment, based on sound statistical principals rooted in the Neyman-Rubin model, yielding practical advice for scholars seeking to make the best possible decisions when using (or electing not to use) weights in their analysis of survey experiments. We explore the topic through a combination of formulae, simulation, and examination of real data.
Taken together, these explorations lead to the conclusion that, for scholars examining population treatment effects using the high-quality, broadly representative samples recruited and delivered by top online survey firms, sample quantities, which do not rely on weights, are often sufficient. Sample Average Treatment Effect (SATE) estimates tend not to differ substantially from weighted estimates, and they avoid the statistical power loss that accompanies weighting. When precise estimates of Population Average Treatment Effects (PATE) are essential, we conclude that a “double-Hàjek” weighted estimator is a very straightforward and reliable option in many cases. We also analytically show that post-stratifying on survey weights and/or covariates highly correlated with the outcome is a conservative choice for precision improvement, because it is unlikely to do harm and could be quite beneficial in certain circumstances.
The greater prevalence of online surveys has gone hand-in-hand with the boom in survey experiments. Firms such as YouGov (formerly Polimetrix) and Knowledge Networks (now owned by GfK) provide researchers platforms through which to run experiments. The firms offer representative samples generated through extensive panel recruitment efforts and sophisticated sample matching and weighting procedures. By reducing or eliminating costs, subsidized, grant-based and collective programs such as Time Sharing Experiments for the Social Sciences (TESS), the Cooperative Congressional Election Study (CCES), and Cooperative Campaign Analysis Project (CCAP) have further facilitated researchers’ access to time on high-end online surveys. Other firms and platforms, such as Survey Sampling International, Google Consumer Surveys (Santoso, Stein and Stevenson, 2016), and Amazon’s Mechanical Turk (Berinsky, Huber and Lenz, 2012), offer even less costly access to large and diverse convenience samples on which researchers can also conduct survey experiments. Researchers using these sometimes generate their own weights to improve representativeness. However, because we view population inferences with such convenience samples as rather tenuous, our primary interest is in methods for analysis of data from sources, such as YouGov and Knowledge Networks, that actively recruit subjects and provide the researcher with weights.
Survey experiments are a two-step process where a sample is first obtained from a parent population, and then that sample is randomized into different treatment arms. The sample selection and treatment assignment processes are generally independent of each other. Sampling procedures have changed in recent years because of increasing rates of non-response and new technologies. As a result, weights can vary substantially across units, with some units having only a small probability of being in the sample. In contrast, the treatment assignment mechanisms are usually simple and relatively balanced, rendering the SATE straightforward to estimate. Estimating the PATE, however, is less so because these estimates need to incorporate the weights, which introduces additional variance as well as a host of complexities.
In this work we assume the weights are known, and further assume that they incorporate both sampling probabilities and nonresponse. In particular, if there is non-response, and the non-response is correctly modeled as a function of some set of covariates, the overall weight would then be the product of being included in the sample and of responding conditional on that inclusion. We use weight rather than sampling weight to indicate this more general view. In fact, for our primary type of data targeted by this work, typically the weights are calculated by the survey firms to represent the relative chances that a newly arrived recruit would get selected into the survey; as volunteering is in part self-selection, the non-response is built in to the final weight calculations automatically. We believe the findings based on these assumptions are nevertheless informative, but we also discuss the additional complications of weight uncertainty in the body of this paper.
Overall, we encourage researchers choosing between these approaches to first give serious thought to the types of inferences they will make. Do they simply wish to establish the presence or absence of an effect in a given population? If so, the SATE may suffice. Or do they hope to measure the magnitude of an effect that must or may not already be documented? In this case, the scholar may consider her options for weighted estimators.
In Section 2 we overview general survey methodology. In Section 3 we then formally consider survey experiments and relate them to the SATE. We formally define the PATE and some estimators of it in Section 4, where we also discuss weights and uncertainty in weights in more detail, and we introduce a post-stratification estimator in Section Appendix B: Post-Stratification for PATE in Survey Experiments. We then investigate the performance of these estimators through simulation studies in Section 5, and analyze trends and features of real survey experimental data collected through YouGov in Section 6. We conclude with an extended discussion, providing some advice and high-level pointers to applied practitioners.
2 Surveys and Survey Experiments through the Lens of Potential Outcomes
We formalize surveys and survey experiments in terms of the Neyman-Rubin model of potential outcomes (Splawa-Neyman, Dabrowska and Speed, 1923/1990). Assume we have a population of units indexed as . We take a sample from this population using a sample selection mechanism, and we then randomly assign treatment in this sample using a treatment assignment mechanism. Both mechanisms will be formally defined in subsequent sections. Each unit in the population has a pair of values, , called its potential outcomes. Let be unit ’s outcome if it were treated, and its outcome if it were not. For each selected unit, we observe either or depending on whether we treat it or not. For any unselected unit, we observe neither.
We make the usual no-interference assumption that implies that treatment assignment for any particular unit has no impact on the potential outcomes of any other unit. This assumption is natural in survey experiments. The treatment effect for unit is then the difference in potential outcomes, . These individual treatment effects are deterministic, pre-treatment quantities.
Let be our sample of units. Then the Sample Average Treatment Effect (SATE) is the mean treatment effect over the sample:
[TABLE]
This is a parameter for the sample at hand, but is random in its own right if we view the sample as a draw from the larger population. By comparison, a parameter of interest in the population is the Population Average Treatment Effect (PATE) defined as
[TABLE]
In general , and if the sampling scheme is not simple (e.g., some types of units are more likely to be selected), then potentially .
We discuss some results concerning the sample selection mechanism in the next section. After that, we will combine the sample selection with the treatment assignment process.
2.1 Simple Surveys (No Experiments)
Let be a dummy variable indicating selection of unit into the sample, with if unit is in the sample, and 0 if not. Let be , the vector of selections. In a slight abuse of notation, let also denote the random sample. Thus, for example, would mean unit was selected into sample . Finally let the overall selection probability or sampling probability for unit be
[TABLE]
which is the probability of unit being included in the sample. The more the vary, the more the sample could be unrepresentative of the population. We assume for all , meaning every unit has some chance of being selected into . The depend, among other things, on the desired size of sample . We assume the are fixed and known and incorporate non-response; we discuss uncertainty in them in Section 4.2.
Consider the case where we have no treatment and we see for any selected unit. Our task is to estimate the mean of the population, . Estimating the mean of a population under a sampling framework has a long, rich history. We base our work on two estimators from that history here.
Let be the average selection probability in the population and be the realized sample size for sample , with . Then let be the weight. These weights are relative to a baseline of 1, which eases interpretability due to removing dependence on . A weight of 1 means the unit stands for itself, a weight of 2 means the unit “counts” as 2 units, a weight of 0.5 means units of this type tend to be over-represented and so this unit counts as half, and so forth. The total weight of our sample is then
[TABLE]
is random, but .
The Horvitz-Thompson estimator (Horvitz and Thompson, 1952), an inverse probability weighting estimator, is then
[TABLE]
Although unbiased, the Horvitz-Thompson estimator is well known to be highly variable. This variability comes from the weights; if you randomly get too many rare units in the sample, the inverse of their weights will inflate , even if all are the same. We are not controlling for the realized size of the sample. This is reparable by normalizing by the realized weight of the sample rather than the expected.
This gives the Hàjek estimator, which is the usual weighted average of the selected units, and which likely reflects the approach used by most scholars:
[TABLE]
The Hàjek estimator is not unbiased, but it often has smaller MSE than Horvitz-Thompson (Hàjek, 1958; Särndal, Swensson and Wretman, 2003). The bias, however, will tend to be negligible, as shown by the following lemma:
Lemma 2.1**.**
[A variation on Result 6.34 of Cochran (1977)] Under a Poisson selection scheme, i.e. units sampled independently with individual probability , the bias of the Hàjek estimator is . In particular, the bias can be approximated as
[TABLE]
See Appendix C for proof. The above shows that, for a fixed population, the bias decreases rapidly as sample size increases. If we sample with equal probability or if the outcomes are constant, the bias is 0. However, if the covariance between the weights and is large, the bias could potentially be large also. In particular, the covariance will be large if rare units (those with small ) systematically tend to be outliers with large because, as weights are non-negative inverses of the , their distribution can feature a long right tail that drives the covariance.
3 Survey Experiments and SATE
Survey experiments are surveys with an additional treatment assigned at random to all selected units. Independent of let be a treatment assignment, with if unit is treated, 0 otherwise. The most natural such assignment mechanism for our context is Bernoulli assignment, where each responding unit is treated independently with probability for some . Another common mechanism is the classic complete randomization, when a -sized simple random sample of the units is treated. Regardless, we assume randomization is a separate process from selection. In particular, we assume that randomization does not depend on the weights.
If our interest is in the SATE, then a natural estimator is Neyman’s difference-in-means estimator of
[TABLE]
with the (possibly random) number of treated units (see Splawa-Neyman, Dabrowska and Speed, 1923/1990).
This estimator is essentially unbiased for the SATE (), but unfortunately, the SATE is not generally the same as the PATE and in general. The bias, for fixed , is
[TABLE]
See Appendix C for derivation. As units with higher will be more likely to be selected into , the estimator will be biased toward the treatment effect of these units. Equation 3 shows an important fact: if the treatment impacts are not correlated with the weights, then there will be no bias. In particular, if the selection probabilities are all the same, or there is no treatment effect heterogeneity, then the bias of for estimating the PATE will be 0.
The variance of , conditional on the sample , is well known, but we include it here as we use it extensively.
Theorem 3.1**.**
Let sample be randomly assigned to treatment and control with for all with either a complete randomization or Bernoulli assignment mechanism. The unadjusted simple-difference estimator is unbiased111Nearly unbiased, that is. Under randomizations where the estimator could be undefined (e.g., there is a chance of all units getting assigned to treatment, such as with Bernoulli assignment where is random and or ), this unbiasedness is conditional on the event of the estimator being defined. Because this probability is generally exponentially small the bias is as well, however. See Miratrix, Sekhon and Yu (2013) for further discussion. for the SATE, i.e. . Its variance is
[TABLE]
where and are the variances of the individual potential outcomes and treatment effects for the sample, and are the expectations (across randomizations) of the inverses of the proportion of units in the two treatment arms.
If is fixed, such as with a completely randomized experiment, then , and the above simplifies to Neyman’s result of
[TABLE]
For Bernoulli assignment, the are complicated because of the expectation of the random denominator, and there are mild technical issues because the estimator is undefined when, for example, . One approach is to use as an approximation for . This approximation is quite good; the bias is of high order for the same reasons as the bias for the Hàjek estimator (see Lemma 2.1). Furthermore, the undefined issue is of small concern as the chance of is exponentially small; giving the estimator an arbitrary value (e.g., 0) if this rare event occurs, introduces only a small bias. An alternate approach is to condition on the number of units treated: set and use Neyman’s results. Conditioning is a reasonable choice that we prefer. It leads to a more accurate (and more readable) formula. For details, including formal definitions and the derivations, see Miratrix, Sekhon and Yu (2013).
It is important to underscore that any SATE analysis on the sample, given a truly modeled treatment assignment mechanism, is valid. I.e., such an analysis is estimating a true treatment effect parameter, the SATE. If then any SATE analysis will be correct for PATE as well (although estimates of uncertainty may be too low if they do not account for variability in ). In particular, if there is a constant treatment effect, then for any sample, and the SATE will be the PATE, and all uncertainty estimates for the SATE will be the same as for the PATE.222The estimated uncertainty will, however, depend on the sample . For example, if happens to have widely varying units, will have high variance and the sample-dependent SATE SE estimate should generally reflect that by being large to give correct coverage for . Now, as this is true for any sample, the overall process will have correct coverage. But a constant treatment effect is a large assumption.
4 Estimating the PATE
Imagine we had both potential outcomes for all the sampled . These would give us exact knowledge of the SATE, and we could also use this information, coupled with the weights, to estimate the PATE. In particular, with knowledge of the we have a sample of treatment effects:
[TABLE]
We can use these to estimate the PATE, , with, for example, a Hàjek estimator:
[TABLE]
This oracle estimator is slightly biased, but the bias is small, giving . If we wanted an unbiased estimator, we could use a Horvitz-Thompson estimator by replacing with , the expected sample size.
Unfortunately, we do not, for a given sample , observe . We can, however, estimate it given the randomization and partially observed potential outcomes. Estimating the PATE is now implicitly a two-step process: estimate the sample dependent , which in turn estimates the population parameter . Under this view, we have two concerns. First, we want to accurately estimate using all the tools available to simple randomized experiments such as adjustment methods or, if we can control the randomization, blocking. Second, we want to focus on a sample parameter that is itself a good estimator of . See Appendix A for a more formal treatment of this.
4.1 Estimating
Equation 6 shows that our estimator is the difference in weighted means of our treatment potential outcomes and our control potential outcomes. This immediately motivates estimating these means with the units randomized to each arm of our study, as with the following “double-Hàjek” estimator
[TABLE]
with
[TABLE]
The are the total sample masses in each treatment arm. , the expected number of units that will land in treatment (similarly for control).
This estimator is two separate Hàjek estimators, one for the mean treatment outcome and one for the mean control. Each estimator adjusts for the total mass selected into that condition. This difference of weighted means is the one naturally seen in the field. It corresponds to the weighted OLS estimate from regressing the observed outcomes on the treatment indicators with weights . This equivalence is shown in Appendix C.
Because this is a Hàjek estimator, there is bias for in the randomization step as well as the selection step because the depend on the realized randomization. Again, this bias is small, which means the expected value of our actual estimator, conditional on the sample, is approximately , our Hàjek “estimator” of the population : . (For unbiased versions, see Appendix A.)
We can obtain approximate results for the population variance of if we view the entire selection-and-assignment process as drawing two samples from a larger population. We ignore the finite-sample issues of no unit being able to appear in both treatment arms (i.e., we assume a large population) and use approximate formula based on sampling theory. For a Poisson selection scheme and Bernoulli assignment mechanism we then have:
Theorem 4.1**.**
The approximate variance (AV) of is
[TABLE]
with . This formula assumes the are small; see Appendix C for a more exact form. This variance can be estimated by
[TABLE]
where and .
See Appendix C for the derivation, which also gives more general formulae that can be adapted for other selection mechanisms. For related work and similar derivations, see Aronow and Middleton (2013) and Wood (2008).
4.2 Uncertain and misspecified weights
Following the survey sampling literature, this paper assumes the weights are exact, correct, and known. They are considered to be the total chance of selection into the sample. In particular, again following standard practice, the are the product of any original sampling weights and any non-response weights, given a classic sampling context (Groves et al., 2011). By contrast, for surveys such as YouGov the non-response is built in, as the recruited panels are in effect self-selected, so we get the overall weights (which they call propensity or case weights) directly. Our results are regarding these total weights.
Of course, especially when considering non-response, weights are not known but instead estimated using a model and, ideally, a rich set of covariates. This raises two concerns. The first is if the weights are systematically inaccurate due to some selection mechanism that has not been correctly captured. In this case, as the weights are independent of the assignment mechanism, the SATE estimates are still valid and unbiased. The PATE estimates, however, can be arbitrarily biased, and this bias is not necessarily detectable. For example, if only those susceptible to treatment join the study, the PATE estimate will be too high, and there may be no measured covariate that allows for detection of this.
The second concern is whether there is additional uncertainty that needs to be accounted for, given the estimated weights, when doing inference for the PATE. There is, although we believe this uncertainty can often be much smaller than the uncertainty in the randomized experiment itself.333Consider that the estimated weights are usually calibrating the full sample to a larger known population, while the uncertainty of the experiment is of the difference in two subsamples, which will tend to have about four times the variance, at least. While this uncertainty could be taken into account, much of the literature does not tend to do so. Interestingly, it is not obvious whether estimating the weights given the sample could actually improve PATE precision similar to using estimated propensity scores instead of known propensity scores—see, for example, Hirano, Imbens and Ridder (2000). We leave this as an area for future investigation.
For further thoughts on concerns regarding uncertainty in the weights, we point to the literature on generalizing randomized trials to wider populations, such as discussed in Hartman et al. (2015) and Imai, King and Stuart (2008). Here, the approach is generally to estimate units’ propensity for inclusion into the experiment, and then weight units by these quantities in order to estimate population characteristics. These propensities of inclusion are usually estimated by borrowing from the propensity score literature for observational studies (Cole and Stuart, 2010). One nice aspect of this approach is it provides diagnostics in the form of a placebo test. In particular, the characteristics of the re-weighted control group of the randomized experiment should match the characteristics of the target population of interest (see Stuart et al. (2010) for a discussion).
Relatedly, O’Muircheartaigh and Hedges (2014) and Tipton (2013) propose post-stratified estimators, stratifying on these estimated weights. In their case, however, they also have the population proportions of the strata as given, which allows for simpler variance expressions and arguably less sensitivity to error in the weights themselves. Furthermore, they do not incorporate the unit-level weights once they stratify. Tipton (2013) investigates the associated bias-variance trade-offs due to stratification, and gives advice as to when stratification will be effective.
Generalization assumes we know the assignment mechanism, but not necessarily the sampling mechanism. There is some work on the reverse case, with estimated propensity scores of treatment and known weights, see DuGoff, Schuler and Stuart (2013). Here the final propensity weights are also treated as fixed for inference.
4.3 Post-stratification to improve precision
One can improve the precision of an experiment by adjusting for covariates. For an examination of this under the potential outcome framework, see, for example, Lin (2013). We use post-stratification for this adjustment. In post-stratification, treatment effects are first estimated within each of a series of specified strata of the data and then averaged together with weights proportional to strata size (Miratrix, Sekhon and Yu, 2013).
We use post-stratification because it relies on very weak modeling assumptions and naturally connects with the weighting involved in estimating the PATE. See Appendix B for the overall framework and associated estimators. Other estimators that rely on regression and other forms of modeling are also possible, see Zheng and Little (2003) or, more recently, Si, Pillai and Gelman (2015). For post-stratification, the more the mean potential outcomes vary between strata, the greater the gain in precision. And given that it is precisely when the weights and outcomes are correlated that we must worry about the weights, post-stratifying on them is a natural choice. Such stratification is easy to implement: simply build strata pre-randomization (but not necessarily pre-sampling) by, e.g., taking the weighted quantiles of the as the strata.
When the units are divided into quantiles by survey weight, the cut-points of those quantiles depend on the realized weights of the sample. Because this is still pre-randomization, this does not impact the validity of the variance and variance-estimation formulae of the SATE estimate of . It does, however, make generating appropriate population variance formulae difficult. Given this, we propose using the bootstrap, incorporating the variable definition of strata to take this stage being sample-dependent into account. Bootstrap is natural in that for survey experiments we are pulling units from a large population, and so simulating independent draws is reasonable. While a technical analysis of this approach is beyond the scope of this paper, we discuss some particulars of implementation in Appendix B. Reassuringly, our simulation studies in the next section show excellent coverage rates.
5 Simulation Studies
We here present a series of simulation studies to assess the relative performance of the respective estimators. We also assess the performance of the bootstrap estimates of the standard errors.
Our simulation studies are as follows: we generate a large population of size with the two potential outcomes and a selection probability for each unit. Using this fixed population, we repeatedly take a sample and run a subsequent experiment, recording the treatment effect estimates for the different estimators. In particular, we first select a sample of size , sampling without replacement but with probabilities of selection inversely proportional to the weights.444We ignore a mild technical issue of the not being exactly proportional to the weights due to not sampling with replacement. Once we have obtained the final sample, we randomly assign treatment and estimate the treatment effect. After doing this 10,000 times we estimate the overall mean, variance, and MSE of the different estimators to compare their performance to the PATE. We also calculate bootstrap standard error estimates for all the estimators using the case-wise bootstrap scheme discussed in Appendix B.
Simulation A.
Our first simulation is for a population with a heterogeneous treatment effect that varies in connection to the weight. See Appendix D for some simple plots showing the structure of the population and a single sample. Our treatment effect, outcomes and sampling probabilities are all strongly related. We then took samples of a specified size from this fixed population, and examined the performance of our estimators as estimators for the PATE.
Results for are on Table 1. Other sample sizes such as , not shown, are substantively the same. The first two lines of the table show the performance of the two “oracle” estimators (Equation 1) and (Equation 6), which we could use if all of the potential outcomes were known. For there is bias because the treatment effect of a sample is not generally the same as the treatment effect of the population. The Hàjek approach of , second line, is therefore superior despite the larger SE. Line 3 is the simple estimate of the SATE from Equation 2. Because it is estimating , it has the same bias as line 1, but because it only uses observed outcomes, the SE is larger. Line 4 uses the “double-Hàjek” estimator shown in Equation 7. This estimator is targeting , reducing bias, but has a larger SE relative to line 3 due to the fact that we are incorporating weights. Line 5 is the post-stratified “double-Hàjek” given in Appendix B. Units were stratified by their survey weight, with equally sized (by weight) strata. For this scenario, post-stratifying helps, as illustrated by the smaller SE and RMSE, compared to .
An inspection of the coverage rates reflects what we have already discussed: The estimate does not target the PATE while the other two sample estimates, and , do. Therefore it has terrible coverage. Furthermore, the latter two estimates give correct coverage, which is reflective of the bootstrap SE estimates hitting their mark.
Simulation B.
As a second simulation we kept the original structure between and , but set a constant treatment effect of 30 for all units. Results are on the bottom half of Table 1. Here, for any sample , so there is no error in either estimate with known potential outcomes (lines 1–2). This also means that is a valid estimate of the PATE and this is reflected in the lack of bias and nominal coverage rate (line 3). The increase of SE of the weighted and post-stratified estimators (lines 4–5) reflects the use of weights when they are in fact unnecessary. Overall the SATE estimate is the best, as expected in this situation.
Simulation C.
In our final simulation, we systematically varied the relationship between selection probability and outcomes while maintaining the same marginal distributions in order to examine the benefits post-stratification.
In our DGP we first generate a bivariate normal pair of latent variables with correlation , and then generate the weights as a function of the first variable and the outcomes as a function of the second. Then, by varying we can vary the strength of the relationship between outcome and weight. (See Appendix D for the particulars.) When , which corresponds to Simulation A, and have a very strong relationship and we benefit greatly from post-stratification. Conversely when , and are unrelated and there will be no such benefit.
For each we generated 20 populations, conducting a simulation study within each population. We then averaged the results and plotted the averages against on Figure 1. The solid lines give the performance of the oracle estimators and , and the non-solid lines are the estimators. The gray lines are estimators that do not incorporate the weights, and the black lines are estimators that do. The light grey points show the individual population simulation studies; they vary due to the variation in the finite populations.
We first see that, because both the double-Hàjek and its post-stratified version are targeting , which in turn estimates the PATE, they remain unbiased regardless of the latent correlation. On the other hand, the SATE and its estimator, , are affected. The bias continually increases as the relationship between weight and treatment effect increases.
As expected, the SE of the estimators that do not use weights, and , stay the same regardless of because the marginal distributions of the outcomes are the same across . The estimators that only use weights to adjust for sampling differences, and , also remain the same, although their SEs are larger than for and because of incorporating the weights. We pay for unbiasedness with greater variability. The post-stratified estimator , however, sees continual precision gains as the weights are increasingly predictive of treatment effect. For low , it has roughly the same uncertainty as , but is soon the most precise of all (non-oracle) estimators.
These conclusions are tied together in the right-most panel of Figure 1, showing the RMSE, which gives the combined impact of bias and variance on performance. As increases, the RMSE of steadily climbs due to bias, eventually being the worst at . Meanwhile, the post-stratified estimator that exploits weights, , performs better and better. Overall, if weights are important then 1) the bias terms can be too large to be ignored, and 2) there is something to be gained by adjusting the estimates of treatment effects with those weights beyond simple re-weighting. Otherwise, SATE estimators are superior, as incorporating weights can be costly.
6 Real Data Application
To better understand the overall trade offs involved in using weighted estimators of PATE versus simply estimating the SATE on actual survey experiments, we analyzed a set of survey experiments embedded in 7 separate surveys fielded by us though YouGov over the course of 5 years. Studies appeared in two modules of the 2010 Cooperative Congressional Election Study (CCES), one module each of the 2012 and 2014 CCES, a survey of Virginia voters run prior to the 2013 gubernatorial election in that state, and two other national YouGov surveys. Each survey had post-hoc weights assigned by YouGov through that firm’s standard procedure. Across these surveys were 18 separate assignments of respondents to binary treatments. In several of these cases, multiple outcomes were measured, producing 46 randomization/outcome combinations.555In the analyses presented here, we use all of these outcomes. We elected to do this because more outcomes provide more opportunity for divergence between SATE and PATE, and thus a more conservative test of our conclusion that such divergence is rare. Also, selecting outcomes would represent an added researcher degree of freedom, which we sought to avoid. In the interest of transparency, we present, in Appendix E, the results of our examination when only the primary outcome for each randomization is used. Our findings remain the same. All of the studies examined were conducted in the United States and focused on topics related to partisan political behavior.666Some of the studies featured random assignment of campaign advertisements shown to respondents with ad tone and partisan source varied. Several of the studies presented respondents with vignettes or news stories describing candidates or groups of voters with characteristics such as party label and gender randomized. Another set of studies asked respondents to evaluate artwork when told (or not told) that the art was produced by Presidents Bush and Obama. More details regarding the specific studies used can be found in Table 2 in the Appendix. As such, we make our comparisons of SATE and PATE by looking at Democratic and Republican respondents separately. This is because treatment effects for such studies are generally highly heterogeneous by respondent party identification. Our set of 46 randomization/outcome combinations produce 92 experiments (half among Democrats and half among Republicans). Sample sizes for the experiments analyzed range from 145 to 504 . Weights varied substantially in these samples, ranging from near 0 to around 8, when normalized to 1 across the sample (standard deviation of 1.04). Sixty-five of them () showed SATEs that were significantly different from zero. However, once the weights provided by YouGov were taken into account to estimate the PATE (via the double-Hàjek estimate) only 52 experiments () had significant effects.
Our first finding is that incorporating weights substantially increased the standard errors. Figure 2(a) shows a average increase in standard error estimates of over across experiments.
We next examined whether there is evidence of some experiments having a PATE substantially different from the SATE. To do this, we calculated bootstrap estimates of the standard error for the difference in the estimators, and calculated a standardized difference in estimates of . If there were no difference between the SATE and the PATE, the s should be roughly distributed as a standard Normal. First, the average of these is , giving no evidence for any systematic trend of the PATE being larger or smaller than the SATE. Second, when we compared our 92 values to the standard normal with a qq-plot (Figure 2(b)), we find excellent fit. While there is a somewhat suggestive tail departing from the expected line, the bulk of the experiments closely follow the standard normal distribution, suggesting that the SATE and PATE were generally quite similar relative to their estimation uncertainty. A test using q-statistics (modeled after Weiss et al. (2017)) failed to reject the null of no differences across the experiments (); especially considering the possible correlation of outcomes would make this test anti-conservative, we have no evidence that the PATE and SATE estimates differ (see Appendix E for further investigation of this). An FDR test also showed no experiments with a significant difference.
Finally, we consider whether post-stratification on weights improved precision. Generally, it did not: the estimated SEs of are very similar to those for , with an average increase of about . Further examination offers a hint as to why post-stratification did not yield benefits: the weights generated by YouGov for these samples do not correlate meaningfully with the outcomes of interest. In no case did the magnitude of the correlation between weights and outcome exceed . To further explore potential benefits of post-stratification we examine the effects of adjusting for a covariate, respondent party identification, on the full experiments not separated by party ID. Relative to no stratification, if we post-stratify on party ID, weighting the strata by the total sampled mass in both treatment and control, our PATE estimate shows an average reduction of in estimated SEs across experiments with participants of both major parties. If we post-stratify on both party ID and the weights, we see an average estimated SEs reduction of . These reductions would in no way make up for the larger SEs from attempting to estimate the PATE. However, they are consistent with the fact that post-stratification can generally only help.
While this does not imply that scholars need not consider post-stratifying on weights, it does show that outcomes of interest in political science studies are not necessarily going to be correlated with these weights. This makes clear the importance of researchers understanding, and reporting, the process used to generate weights and being aware of the covariates with which those weights are likely to be highly correlated (for online surveys, such a list would often includes certain racial and education-level categories).
Discussion.
Overall, it appears that in this context and for these experiments, the survey weights significantly increase uncertainty, and that there is little evidence that the RMSE (which includes the SATE-PATE bias) for estimating the PATE is improved by estimators that include these weights. Furthermore, the weights are not predictive enough of outcome to help the post-stratified estimator. With regard to post-stratification, we note that in practice the analysis of any particular experiment would likely be improved by post-stratifying on known covariates predictive of outcome rather than naïvely on the weights.
In understanding these findings, it is useful to consider the ways in which data from these leading online survey firms (in this case, YouGov) may differ from more convenience-based online samples. Even unweighted, datasets from these firms tend to be more representative. This is because they often engage in extensive panel recruitment and retention efforts and assign subjects from their panels to client samples through mechanisms such as block randomization. As a result, the unweighted data are often largely representative of the overall population along many relevant dimensions. Relatedly, firms may use a clean-up matching step, such as the one employed by YouGov, where they down-sample their data to generate more uniform weights (Rivers, 2006). This will likely increase the heterogeneity of the final sample, which could decrease precision.777Consider a standard scenario wherein a researcher purchases a sample of 1000 respondents. To generate these data, the survey firm might recruit 1400 respondents, all of whom participate in the study. Two datasets result from this. The first contains all 1400 respondents. The second is a trimmed version, where the firm drops 400 of the most overrepresented respondents (which is tantamount to assigning these respondents a weight of 0). This second set, which comes with weights assigned to each observation, is what many scholars analyze. Some firms will, upon request, also provide the full data set, but these data do not generally include weights, as the process for generating these weights is combined with the procedure for trimming down the larger data set by matching it to some frame based upon population characteristics. The weights will be less extreme than they would have been had the entire original sample been included, and the trimmed sample will be more heterogeneous, as many similar observations will be purged. This will make it more difficult to estimate its SATE compared to the full set (do note the SATEs could differ). Furthermore, post-stratification shows that estimators that include weights for the trimmed set will also be less variable than for the same estimators on the full dataset (assuming weights could be obtained), even though the trimmed dataset weights will be less variable. Consider a case with two classes of respondents, reluctant and eager, equally represented in the population. The trimmed sample will have fewer eager respondents. Then, compared to the full data set, we will have a less precise estimate of the eager respondents in the trimmed data set. The precision for the reluctant respondents would be the same. Overall, our combined estimate will be, therefore, less precise. We recommend that researchers request the original, pre-weighted data, in order to work with a larger and more homogeneous sample. For the SATE the gains are immediate. For the PATE, one might generate weights for the full sample by extrapolating from the weights assigned in the trimmed sample or by contracting with the survey firm to obtain weights for this full unmatched sample. Then, by post-stratifying on the weights, the researcher can take advantage of the additional units to increase precision in some strata without increasing variability in the others. For both SATE and PATE estimation, power would be improved.
7 Discussion and Practical Guidance
We investigate incorporating weights in survey experiments under the potential outcomes framework. We focus on two styles of estimator, those that incorporate these weights to take any selection mechanisms into account, and those that ignore weights and instead focus on estimating the SATE. We primarily find that incorporating weights, even when they are exactly known, substantially decreases precision. Because of this, researchers are faced with a trade-off: more powerful estimates for the SATE, or more uncertain estimates of the PATE. We conclude with several observations that should inform how one navigates this trade-off.
The PATE can only be different from the SATE when two things hold: (1) there is meaningful variation in the treatment impact, and (2) that variation is correlated with the weights. See Equation 3. Moreover, the random assignment of treatment protects inference for the weighted estimator, even if the weights are incorrect or known only approximately: because the randomization of units into treatment is independent of the (possibly incorrect) weights, any inference conditional on the sample and the weights is a valid inference. When PATE is the estimand, we are estimating the treatment effect for a hypothetical population defined by the weights and sample, even if it does not correspond to the actual population. For example, if we find a treatment effect in our weighted sample, we know the treatment does have an effect for at least some units. See Hartman et al. (2015) for a discussion of this issue in the case of evaluating the external validity of an experiment.
It is important to compare the PATE and SATE estimates. A meaningful discrepancy between them is a signal to look for treatment effect heterogeneity and a flag that weight misspecification could be a real concern. If the estimates do not differ, however, and there is no other evidence of heterogeneity, then extrapolation is less of a concern—and furthermore the SATE is probably a sufficient estimate for the PATE. Of course, with misspecified weights if there is heterogeneity associated with being selected into the experiment, but that is not captured by the covariates, then PATE estimation can be undetectably biased. For more on assessing heterogeneity, see Ding, Feller and Miratrix (2016).
Interestingly, our examination of real survey data found no strong connection between the weights and outcomes. The SATE and PATE estimators tended to be similar. Based on this, we have several general pieces of practical guidance: (1) When analyzing survey experiments using high quality, broadly representative samples, such as those recruited and provided by firms like YouGov and Knowledge Networks, SATE estimates will generally be sufficient for most purposes. (2) If a particular research question calls for estimates of the PATE, a “double-Hàjek” estimator is probably the most straightforward (and a defensible) approach, unless weights are highly correlated with the outcomes variables. (3) If weights are strongly correlated with a study’s outcome(s) of interest, post-stratification on the weights with bootstrap standard errors can help offset the cost of including weights for those seeking to draw population inferences.
This motivates a two-stage approach: first focus on the SATE using the entire, unweighted sample and determine whether the treatment had impact. This will generally be the most powerful strategy for detecting an effect, as the weights, being set aside, will not inflate uncertainty estimates. Then, once a treatment effect is established, work on how to generalize it to the population. This second stage is an assessment of the magnitude of an effect in the population once an effect on at least some members of the population has been established. First estimate the PATE with the weights, and then compare it to the SATE estimate. If they differ, then consider working to explain any treatment effect heterogeneity with covariates, and think carefully about weight quality. Regardless, ensure that all analyses preserve the original strength of the assignment mechanism; the weights do not need to jeopardize valid assessment of the presence of causal effects. Part of preserving valid statistical inference would be to commit to a particular procedure before analyzing the given dataset. A pre-analysis plan or sample splitting would help prevent a fishing expedition to find treatment effects.
Acknowledgements
For helpful and careful comments we would like to thank Henry Brady, Devin Caughey, Christopher Fariss, Erin Hartman, Steve Nicholson, Liz Stuart, Chelsea Zhang, and participants of the ACIC 2015 and 2014 Society of Political Methodology Summer Meetings. We would like to thank Guillaume Basse for his insights into the connection between weighted linear regression and the double-Hàjek estimator. We also thank the valuable feedback and commentary received from two anonymous reviewers and the Editor, who pushed us to clarify and refine our findings.
Appendix A: A general class of estimators
When estimating the PATE, our overall estimation error is a combination of our error due to the randomized experiment for estimating and the difference between our survey-sampling estimate and the PATE . We can break this error down for any estimator of . First, given , we have , with being a bias term. Then
[TABLE]
Given a choice of , the first term is the expected MSE of the estimator for estimating when we consider all possible randomizations of treatment assignment on the given sample . The second term is the MSE of as an estimator for across all samples. The third term is a cross-bias term; it depends on how the bias of a sample is correlated with the error of its . We generally assume it is small and ignore it. This gives a rough formula for the overall mean square error of
[TABLE]
The first term will tend to be a function of the randomization method used and sample-dependent parameters such as , , , and, importantly, the choice of estimator . For a given choice of , if we reduce this inner term, we reduce the expectation and therefore increase the overall precision of the estimator for PATE. We reduce this term with better estimators, e.g., ones that exploit covariates; this is the goal of post-stratification.
The sampling scheme and choice of governs the second term. If we reduce it by changing , we increase precision. The main way to do this is to sample better, e.g., move closer to equal probability sampling. No estimation strategy can reduce this term.
Alternate estimators.
Given the above, our primary“double-Hàjek” estimator can be viewed as doubly biased: the expected value across randomizations is approximately , and the expected value of is approximately . We could instead use Horvitz-Thompson style estimators at either or both levels to remove these biases. In particular, if we select an estimator that is unbiased at the randomization level, i.e. , then we have
[TABLE]
One such estimator is the “single-Hàjek” estimator of
[TABLE]
This estimator is tied to double-Hàjek by and . It is a Horvitz-Thompson estimator with respect to the randomization for the two parts of our estimand . Interestingly, this estimator has the same asymptotic variance expression found in Theorem 4.1 as .
Finally, if we have
[TABLE]
For fixed , we have such an estimator as
[TABLE]
This estimator generally pays a large price for unbiasedness with high variance.
Appendix B: Post-Stratification for PATE in Survey Experiments
Post-stratification is motivated by viewing PATE estimation as a two step process. In particular, estimators that have higher precision will give overall gains. Say we had a categorical covariate associated with our outcomes. We can then express our overall estimand as:
[TABLE]
with being the number of units in the population in stratum and being the proportion of the population in stratum . We could then estimate the population with strata level estimators of
[TABLE]
As before, we would then need to estimate these .
This motivates a post-stratified estimator as a combination of estimates of population strata size estimates and population strata effect estimates:
[TABLE]
where estimates , with the being the total weight in the sample and the
[TABLE]
being the total weights of the strata. These are not dependent on the randomization so
[TABLE]
If we had population knowledge we might actually know the and simply plug them in; this connects to the generalization of experiments. See, for example, Tipton (2013).
For the we have several options. Arguably the most natural is the double-Hàjek estimator of
[TABLE]
with being the total weight in the treatment group in stratum , and similarly for the control. The will have the usual bias from being Hàjek estimators. Here, however, this bias is of order , not (see Lemma 2.1), and so could potentially be larger than one might expect.
Regardless, combining gives our final
[TABLE]
If we want to avoid this bias, we could instead use a single-Hàjek estimator in each strata:
[TABLE]
For the single-Hàjek, we immediately have , i.e., unbiasedness in the randomization step. This also causes the to cancel. If the weights within strata are generally homogenous, the single-Hàjek will be essentially the same as the double. And if is built by stratifying on weights then we would indeed expect such homogeneity. Thus, with post-stratification, we can remove some bias for very little cost in variance.
7.1 Variance Estimation
As discussed in the main text, the post-stratification step can be sample-dependent. For example, if the units are divided into quantiles by survey weight, the cut-points of those quantiles depend on the realized weights of the sample. Because this is still pre-randomization, this does not impact the validity of the variance and variance-estimation formulae of the SATE estimate of . It does, however, make generating appropriate population variance formulae difficult. Furthermore, even if the strata are pre-defined, the formulae of Theorem 4.1 are actually for a linearized version of the ratio estimators, and as the strata are smaller than the overall sample, one might be concerned that these approximations would be not that good when applied to individual strata. This is why we propose the bootstrap.
Appropriate implementation of the bootstrap deserves some discussion. Bootstrap is a “by analogy” technique. To obtain the variability of an estimator we repeatedly simulate obtaining a sample from some population using our hypothesized sampling mechanism, randomizing it into treatment, and estimating the treatment effect using our estimator on that sample. We first, therefore, need to have a population to sample from. Our best estimate of this population is the sample weighted by the weights. We then take a size- i.i.d. sample from this population with probability proportional to the inverse of these weights. The treatment assignment being Bernoulli means we take a case-wise bootstrap, bootstrapping the original treatment assignment along with the outcome. This avoids any need to impute any missing potential outcomes.
The up-weighting and subsequent weighted sampling steps collapse to generating a bootstrap sample by taking a classic with-replacement unweighted sample (i.e., a case-wise bootstrap) from the original sample of the triples .
Appendix C: Derivations
In the following we derive the bias of the Hàjek estimator, show that it is small, and derive the bias of as an estimator for the PATE. After this we show how a weighted OLS regression can be used in practice to estimate the double-Hàjek. Finally, we derive properties of the unstratified PATE estimators.
Bias of the Hàjek Estimator
The proof of Lemma 2.1, that the bias of a Hàjek estimator is , follows a similar strategy to the proof of Result 6.34 in Cochran (1977). That result is of the bias of a general ratio estimator for a fixed sample size under simple random sampling. We adapt this result to the Hàjek estimator (also a ratio estimator) under independent Poisson random sampling with variable sample size. A fixed sample size correction is possible, but is not needed for our purposes.
We extend the notation described in Section 2.1. Denote so that we can write . The expected values of both the numerator and denominator are
[TABLE]
These results alone should motivate why the Hàjek estimator should be approximately unbiased, but let us be a bit more rigorous. By first manipulating the difference of the estimator and its target and then applying the first order Taylor approximation, , we can get the approximate difference.
[TABLE]
Taking expectations and noting that by Equation 12 leads to the approximate bias:
[TABLE]
These expanded terms can be calculated individually for our estimator using properties of variance and covariance.
[TABLE]
Finally, substitute Equations 12, 14 and 15 into Equation 13 and simplify:
[TABLE]
We finally use the relation
[TABLE]
to get our final covariance formulation.
We have ignored a mild technical issue of an undefined estimator with probability . For the Poisson selection scheme, with the independent, which will be exponentially small in . Letting the estimator be defined as 0 under this circumstance gives a bounded, exponentially small term far less in magnitude than other bias terms.
Bias of the SATE for the PATE
To see that (or ) is a biased estimate for PATE, assume fixed sample size to obtain:
[TABLE]
For a random sample size, there is an additional, but negligible, a bias term. We can see that the above is a first order approximation of the overall bias by replacing with . The difference in these terms is of order , as with our bias lemma.
The double-Hàjek as weighted OLS
In Section 4.1 we introduced the “double-Hàjek” estimator. Here we show that this estimate is equivalent to a weighted OLS where the weights are and we regress on the treatment indicator. In other words we fit the model
[TABLE]
with weights . The weighted OLS estimates and are the solutions to the normal equations:
[TABLE]
These are obtained by taking derivatives with respect to and of the weighted sum of squares, , and setting them to [math]. Grouping by treatment indicators, we get the following:
[TABLE]
Taking the difference of these equations implies that
[TABLE]
To make the connection to the “double-Hàjek” estimate, denote and , as before. If we distribute the summation in the second normal equation (Equation 17), we get
[TABLE]
Written in the most general sense and replacing the weights, we get back our “double-Hàjek” estimate.
[TABLE]
Hence one way of calculating is by fitting a weighted OLS regression onto the treatment indicator and inspecting the coefficients.
Properties of
Our estimator can be expressed as
[TABLE]
For the expectation of , we have
[TABLE]
For variance we use results and notation from Särndal, Swensson and Wretman (2003) to obtain approximate variance terms as follows. Define as the event of unit being selected and also treated. We then have and the probability that units and are both selected and treated is
[TABLE]
For the treatment group specifically we have
[TABLE]
with . The check notation denotes a value divided by its probability of being included in the sample: . The above is a classic ratio estimator with selection probabilities of for the ratio of
[TABLE]
since .
The approximate variance of a ratio estimator (Särndal, Swensson and Wretman, 2003) is:
[TABLE]
with
[TABLE]
We can estimate this variance with a sum over the treatment group of
[TABLE]
with and .
The Poisson-Bernoulli Model.
Under Poisson selection we have for (with ). With Bernoulli assignment we have for (with ) giving for and for . This gives
[TABLE]
and
[TABLE]
The above formula are problematic in that they depend on our rather than the weights ). However, if we assume we can make progress. In particular, in this case, under mild regularity conditions on the sampling probabilities, we can assume for all . This means that . Couple this with to get a fairly tight upper bound on our two formula of
[TABLE]
and, using with ,
[TABLE]
Finally, to get overall variance presented in Theorem 4.1 we first view the sample into the treatment arm as independent of the sample into the control arm, which is again motivated by the assumption. For the control arm, we then do the above derivation with and . More lengthy derivations that account for the dependence structure will give higher-order terms which are in the end negligible. See Wood (2008) for an approach.
Appendix D: The simulation’s DGP
In this section we provide additional simulation details and explanations of some of the choices made throughout the simulations of Section 5. In all our simulations, the potential outcomes are simulated as nonlinear functions of the weights.
To generate our populations we use the following algorithm: let be a correlation measuring the strength of the relationship between the weights and outcomes. We then generate two latent parameters as a bivariate standard normal draw with correlation . (We do this by generating , and , with .)
We then generate uniformly distributed weights on pre-specified interval by using the c.d.f. transformation:
[TABLE]
where is the standard normal c.d.f. We also generate shadow weights
[TABLE]
also uniform, and with the same distribution as .
Our potential outcomes are then a function of the shadow weights :
[TABLE]
with as independent Gaussian noise. The treatment potential outcomes are generated to give a non-linear heterogeneous treatment effect. When , , giving the strongest possible relationship between outcome and weight. Conversely when the weights are completely unrelated to the potential outcomes, so stratifying on them should not help improve estimation.
Once we have a population, we then sample inversely proportional to the weight . For example, in Simulation A we take a fixed sample size of ( of the population). Our post-stratification estimator stratifies based on the weight to increase precision. The stratifying variable is defined in Section Appendix B: Post-Stratification for PATE in Survey Experiments.
Simulation A has maximal covariance, with . Figure 3 shows a subset of the population and a sample from this scenario to illustrate the structure of our DGP. Figure 3(a) shows the characteristics of the simulated population while Figure 3(b) shows how a weighted sample might look.
Overall, Figure 3 shows that the weight and potential outcome distributions differ in the sample and population. Furthermore, because the potential outcomes are related to the weights they are consequently related to the post-stratification levels in the sample.
For Simulation B we simply replace the formula for with a constant treatment effect of , so . We still have the sample general relationships between the sample and population, but as we see in Section 5 the estimators behave quite differently.
For Simulation C we varied , which controls the relationship between the weight and the potential outcomes. The top two right-most panels of Figure 3(b) show there is smaller variability within strata for and than if we consider the entire sample at once. As our weights become less predictive of outcome, this variability will increase. Our formulation, however, maintains the marginal distributions of , , and as changes so that any benefits we see from post-stratification can only be attributed to the changing relationship.
Appendix E: Further Details and Results of the Real Data Application
As mentioned in the main text, the 92 survey experiments analyzed in Section 6 were generated from 18 unique randomizations on 7 separate surveys. We split each randomization by subject party identification and considered multiple outcomes per treatment randomization. One might worry that the potential correlation of the multiple outcomes might be influencing the results, so we append here the results when considering only one unique outcome per randomization.
The 18 unique randomizations give rise to 36 survey experiments after splitting each randomization by subject party identification, considering only the larger Democratic and Republican leaning subgroups. of them () showed SATEs that were significantly different from zero. Once the weights were taken into account to estimate the PATE (via the double-Hàjek estimate) experiments () had significant effects. Even though more experiments showed significant PATE than SATE estimates, incorporating weights still increased standard errors: there was a average increase in variance of over across experiments. The raw SE increases can be seen in Figure 4(a).
We further examined whether there is evidence of some experiments having a PATE substantially different from the SATE. We calculated the 36 values and compared them to a standard normal with a qq-plot (Figure 4(b)). While visually there do seem to be some distributional departures from a standard Normal, a Kolmogorov–Smirnov test does not support this hypothesis (with a p-value of ). Furthermore, an FDR test also fails to find any experiments with significant differences. All of this suggests a general equivalence between the SATE and the PATE in this subset of experiments as well.
To explore whether post-stratification on weights improved precision, we compared the estimated SEs. The estimated SEs of are very similar to those for , with an average increase of about . Post-stratifying on party ID on the original 18 experiments led to modest variance reduction. Relative to no stratification, we see an average reduction of in variance across experiments with participants of both major parties. If we post-stratify on both party ID and the weights, we see an average reduction of . These findings, similar to the main text, show that while post-stratification should help reduce the variance in theory the gains can be rather modest in practice.
[FIGURE:]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Aronow and Middleton (2013) Aronow, Peter M and Joel A Middleton. 2013. “A class of unbiased estimators of the average treatment effect in randomized experiments.” Journal of Causal Inference 1(1):135–154.
- 3Berinsky, Huber and Lenz (2012) Berinsky, Adam J., Gregory A. Huber and Gabriel S. Lenz. 2012. “Evaluating Online Labor Markets for Experimental Research: Amazon.com’s Mechanical Turk.” Political Analysis 20(3):351–368.
- 4Cochran (1977) Cochran, William G. 1977. Sampling Techniques, 3rd Edition . New York: John Wiley and Sons.
- 5Cole and Stuart (2010) Cole, S R and Elizabeth A Stuart. 2010. “Generalizing Evidence From Randomized Clinical Trials to Target Populations: The ACTG 320 Trial.” American Journal of Epidemiology 172(1):107–115.
- 6Ding, Feller and Miratrix (2016) Ding, Peng, Avi Feller and Luke Miratrix. 2016. “Decomposing treatment effect variation.” ar Xiv preprint ar Xiv:1605.06566 .
- 7Du Goff, Schuler and Stuart (2013) Du Goff, Eva H, Megan Schuler and Elizabeth A Stuart. 2013. “Generalizing Observational Study Results: Applying Propensity Score Methods to Complex Surveys.” Health Services Research 49(1):284–303.
- 8Franco et al. (Forthcoming) Franco, Annie, Gabor Simonovits, Neil Malhotra and L.J. Zigerell. Forthcoming. “Developing Standards for Post-Hoc Weighting in Population-Based Survey Experiments.” Journal of Experimental Political Science .
