Bias and high-dimensional adjustment in observational studies of peer   effects

Dean Eckles; Eytan Bakshy

arXiv:1706.04692·stat.ME·February 16, 2021

Bias and high-dimensional adjustment in observational studies of peer effects

Dean Eckles, Eytan Bakshy

PDF

1 Repo

TL;DR

This study demonstrates that high-dimensional behavioral data and propensity score models can significantly improve causal inference of peer effects in observational social media studies, achieving estimates comparable to randomized experiments.

Contribution

It shows that detailed past behavior records and advanced statistical models can reduce bias in observational peer effect studies, validating their credibility.

Findings

01

Naive estimators overstate peer effects by 320%.

02

Adjusting for prior behaviors reduces bias by 91%.

03

Full high-dimensional models reduce bias by over 97%.

Abstract

Peer effects, in which the behavior of an individual is affected by the behavior of their peers, are posited by multiple theories in the social sciences. Other processes can also produce behaviors that are correlated in networks and groups, thereby generating debate about the credibility of observational (i.e. nonexperimental) studies of peer effects. Randomized field experiments that identify peer effects, however, are often expensive or infeasible. Thus, many studies of peer effects use observational data, and prior evaluations of causal inference methods for adjusting observational data to estimate peer effects have lacked an experimental "gold standard" for comparison. Here we show, in the context of information and media diffusion on Facebook, that high-dimensional adjustment of a nonexperimental control group (677 million observations) using propensity score models produces…

Tables2

Table 1. Table 1: Variables included in models predicting exposure. The final column indicates which base model specification include that variable. Some variables are transformed and/or contribute multiple inputs (columns) to a model. †: Includes untransformed and squared terms, x 𝑥 x and x 2 superscript 𝑥 2 x^{2} ; *: Transformed with log ⁡ ( x + 1 ) 𝑥 1 \log(x+1) ; ‡: Includes binary indicator, 1 { x > 0 } 1 𝑥 0 1{\{x>0\}} . All other variables are untransformed; if categorical, one indicator (dummy) for each level is included in the model matrix.

Category	Name	Description	Columns	Models
Demographics	Age^†	As indicated on profile	2	$𝖠, 𝖣$
	Gender	Indicated or inferred: female, male, or unknown	2	$𝖠, 𝖣$
Facebook	Friend count	Number of extant friendships	1	$𝖠$
	Friend initiation	Number and proportion of extant friendships initiated	2	$𝖠$
	Tenure^∗	Days since registration of account	1	$𝖠$
	Profile picture	Whether the user has a profile picture	1	$𝖠$
	Visitation freq.	Days active in prior 30, 91, and 182 day periods	3	$𝖠$
Communication	Action count^∗	Number of posts (including URLs), comments, and likes in a prior one month period	1	$𝖠$
	Post count^∗	Number of posts (including URLs) in a prior one month period	1	$𝖠$
	Comment count^∗	Number of comments on posts in a prior one month period	1	$𝖠$
	Like count^∗	Number of posts and comments “liked” in a prior one month period	1	$𝖠$
Link sharing	Shares^∗‡	Number of URLs shared in a one month period	2	$𝖠$
	Unique domains^∗	Number of unique domains of URLs shared in a six month period	1	$𝖠$
	Same domain shares^∗‡	Number of times shared any URL with the same domain as outcome URL in six month period	2	$𝗌$
	Other domain shares^∗	Number of times shared any URL in six month period for each of the other domains	3,703	$𝖬$

Table 2. Table 2: Comparison of experimental and observational estimates of peer effects. Estimates of the probability of sharing if not exposed p ( 0 ) superscript 𝑝 0 {p}^{(0)} , relative risk ( R R 𝑅 𝑅 RR ), and the risk difference ( δ 𝛿 \delta ) for each model with 95% bootstrap standard confidence intervals in brackets.

Model	${\hat{p}}^{(0)}$	$\hat{R R}$	$\hat{δ}$
AMs	1.751e-04 [1.563e-04, 1.940e-04]	7.44 [6.65, 8.23]	1.128e-03 [1.097e-03, 1.159e-03]
Ms	1.677e-04 [1.503e-04, 1.851e-04]	7.77 [7.02, 8.52]	1.135e-03 [1.109e-03, 1.162e-03]
AM	1.124e-04 [9.828e-05, 1.265e-04]	11.59 [10.42, 12.77]	1.191e-03 [1.163e-03, 1.218e-03]
M	9.989e-05 [8.855e-05, 1.112e-04]	13.05 [11.84, 14.25]	1.203e-03 [1.179e-03, 1.228e-03]
As	1.489e-04 [1.329e-04, 1.649e-04]	8.75 [7.80, 9.70]	1.154e-03 [1.127e-03, 1.181e-03]
Ds	1.469e-04 [1.318e-04, 1.619e-04]	8.87 [7.92, 9.82]	1.156e-03 [1.128e-03, 1.183e-03]
A	6.501e-05 [5.761e-05, 7.241e-05]	20.04 [17.82, 22.27]	1.238e-03 [1.214e-03, 1.263e-03]
D	5.806e-05 [5.018e-05, 6.593e-05]	22.45 [19.53, 25.36]	1.245e-03 [1.221e-03, 1.269e-03]
naive	4.567e-05 [3.842e-05, 5.291e-05]	28.54 [24.12, 32.95]	1.257e-03 [1.233e-03, 1.282e-03]
exp	1.920e-04 [1.845e-04, 1.995e-04]	6.79 [6.54, 7.04]	1.111e-03 [1.086e-03, 1.136e-03]

Equations12

\hat{e}_{d}(X_{iu})=\text{logit}^{-1}\big{(}X_{iu}\hat{\beta}_{d}\big{)}.

\hat{e}_{d}(X_{iu})=\text{logit}^{-1}\big{(}X_{iu}\hat{\beta}_{d}\big{)}.

\overset{p}{^}_{d j}^{(0)} = \frac{1}{n _{d j}^{(0)}} ⟨ i, u ⟩ \in C(d) \sum Y_{i u} 1 [\overset{e}{^}_{d} (X_{i u}) \in \hat{Q}_{d j}]

\overset{p}{^}_{d j}^{(0)} = \frac{1}{n _{d j}^{(0)}} ⟨ i, u ⟩ \in C(d) \sum Y_{i u} 1 [\overset{e}{^}_{d} (X_{i u}) \in \hat{Q}_{d j}]

\overset{p}{^}_{d}^{(0)} = j = 1 \sum J \frac{n _{d j}^{(1)}}{n _{d}^{(1)}} \overset{p}{^}_{d j}^{(0)} .

\overset{p}{^}_{d}^{(0)} = j = 1 \sum J \frac{n _{d j}^{(1)}}{n _{d}^{(1)}} \overset{p}{^}_{d j}^{(0)} .

\overset{p}{^}^{(0)} = d \sum \frac{n _{d}^{(1)}}{n ^{(1)}} \overset{p}{^}_{d}^{(0)} .

\overset{p}{^}^{(0)} = d \sum \frac{n _{d}^{(1)}}{n ^{(1)}} \overset{p}{^}_{d}^{(0)} .

100 \frac{R R _{m} - R R _{exp}}{R R _{exp}} .

100 \frac{R R _{m} - R R _{exp}}{R R _{exp}} .

100 \frac{δ ^ _{m} - δ ^ _{exp}}{p ^ ^{(0)}}

100 \frac{δ ^ _{m} - δ ^ _{exp}}{p ^ ^{(0)}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fghjorth/vkme18
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsCausal inference

Full text

Bias and high-dimensional adjustment in observational studies of peer effects∗

Dean Eckles1 and Eytan Bakshy2

1Massachusetts Institute of Technology, 2Facebook

Abstract

Peer effects, in which the behavior of an individual is affected by the behavior of their peers, are posited by multiple theories in the social sciences. Other processes can also produce behaviors that are correlated in networks and groups, thereby generating debate about the credibility of observational (i.e. nonexperimental) studies of peer effects. Randomized field experiments that identify peer effects, however, are often expensive or infeasible. Thus, many studies of peer effects use observational data, and prior evaluations of causal inference methods for adjusting observational data to estimate peer effects have lacked an experimental “gold standard” for comparison. Here we show, in the context of information and media diffusion on Facebook, that high-dimensional adjustment of a nonexperimental control group (677 million observations) using propensity score models produces estimates of peer effects statistically indistinguishable from those from using a large randomized experiment (220 million observations). Naive observational estimators overstate peer effects by 320% and commonly used variables (e.g., demographics) offer little bias reduction, but adjusting for a measure of prior behaviors closely related to the focal behavior reduces bias by 91%. High-dimensional models adjusting for over 3,700 past behaviors provide additional bias reduction, such that the full model reduces bias by over 97%. This experimental evaluation demonstrates that detailed records of individuals’ past behavior can improve studies of social influence, information diffusion, and imitation; these results are encouraging for the credibility of some studies but also cautionary for studies of rare or new behaviors. More generally, these results show how large, high-dimensional data sets and statistical learning techniques can be used to improve causal inference in the behavioral sciences.

We are grateful to L. Adamic, S. Aral, J. Bailenson, J. H. Fowler, W. H. Hobbs, G. W. Imbens, S. Messing, C. Nass, M. Nowak, A. B. Owen, A. Peysakhovich, B. Reeves, D. Rogosa, J. Sekhon, A. C. Thomas, J. Ugander, and participants in seminars at New York University Stern School of Business, Stanford University Graduate School of Business, UC Berkeley Department of Biostatistics, Johns Hopkins University Bloomberg School of Public Health, University of Chicago Booth School of Business, Columbia University Department of Statistics, and UC Davis Department of Statistics for comments on this work. D.E. was previously an employee of Facebook while contributing to this research and is a contractor with Facebook. D.E. and E.B. have significant financial interests in Facebook.

1 Introduction

Understanding how the behavior of individuals is affected by the behavior of their peers is of central importance for the social and behavioral sciences, and many theories suggest that positive peer effects are ubiquitous (Sherif, , 1936; Blume, , 1995; Centola and Macy, , 2007; Granovetter, , 1978; Manski, , 2000; Montanari and Saberi, , 2010). However, it has been difficult to identify and estimate peer effects in situ. Much of the most credible evidence about peer effects in humans and primates comes from small experiments in artificial social environments (Asch, , 1956; Sherif, , 1936; Whiten et al., , 2005; Herbst and Mas, , 2015). In some cases, field experiments modulating tie formation and group membership (Sacerdote, , 2001; Zimmerman, , 2003; Lyle, , 2007; Carrell et al., , 2009; Centola, , 2010; Firth et al., , 2016), shocks to group or peer behavior (Aplin et al., , 2015; Banerjee et al., , 2013; Bond et al., , 2012; Cai et al., , 2015; Eckles et al., , 2016; van de Waal et al., , 2013), or subsequent exposure to peer behaviors (Aral and Walker, , 2011; Bakshy et al., 2012a, ; Salganik et al., , 2006) have been possible, but in many cases these experimental designs are infeasible. Thus, much recent work on peer effects uses observational data from new large-scale measurement of behavior (Aral et al., , 2009; Bakshy et al., , 2011; Friggeri et al., , 2014; Ugander et al., , 2012; Allen et al., , 2013) or longitudinal surveys (Christakis and Fowler, , 2007; Iyengar et al., , 2011; Banerjee et al., , 2013; Card and Giuliano, , 2013; Christakis and Fowler, , 2013; Fortin and Yazbeck, , 2015). Many of these studies are expected to suffer from substantial confounding of peer effects with other processes that also produce clustering of behavior in social networks, such as homophily (McPherson et al., , 2001) and external causes common to network neighbors. Thus, even when issues of simultaneity (including “reflection”, Manski, , 1993) can be avoided, it is thus generally not possible to identify peer effects using observational data without the often suspect assumption that adjusting for available covariates is sufficient to make peer behavior unconfounded (i.e., as-if randomly assigned; Shalizi and Thomas, , 2011; Angrist, , 2014). However, even if these assumptions are not strictly satisfied, some observational estimators, especially those that adjust for numerous or particularly relevant behavioral variables, may have relatively small bias in practice, such that the bias is small compared with other sources of error (e.g., sampling error) or is small enough to not change choices of theories or policies.

Using a massive field experiment as a “gold standard”, we conduct a constructed observational study by adding a nonexperimental control group to a randomized experiment (LaLonde, , 1986; Dehejia and Wahba, , 1999, 2002; Hill et al., , 2004) to assess bias in observational estimators of peer effects in the diffusion of information and media, which has been widely studied (Katz and Lazarsfeld, , 1955; Wu et al., , 2011; Berger, , 2011; Myers et al., , 2012; Bakshy et al., 2012b, ; Bakshy et al., , 2015; Friggeri et al., , 2014; Cheng et al., , 2014; Flaxman et al., , 2016; Goel et al., , 2015). Diffusion of information and media, especially via Internet services, is now central to multiple topics in applied research, including studies of product adoption (Goel et al., , 2015) and political participation (Friggeri et al., , 2014; Bakshy et al., , 2015; Barberá et al., , 2015; Flaxman et al., , 2016; Allcott and Gentzkow, , 2017). The present work is the first to experimentally evaluate state-of-the-art estimators for observational studies of peer effects, and it does so in a setting of substantial and increasing relevance.

We review related work in the next subsection. Section 2 describes the randomized experiment, observational data, and estimators used in this study. Section 3 compares the resulting experimental and observational estimates of peer effects, where we find substantial variation among the performance of the observational estimators.

1.1 Related work

Prior evaluations of observational estimators of peer effects lacked comparison with an experiment and instead relied on sensitivity analysis (VanderWeele, , 2011), simulations (Thomas, , 2013), and analyses when the absence of peer effects is assumed (Cohen-Cole and Fletcher, , 2008). For this reason, we review prior work on peer effects in this section, but also consider observational–experimental comparisons in other areas of study. While causal inference for peer effects faces distinctive threats to validity (e.g., homophily), the present study also has methodological advantages over many prior observational–experimental comparisons for educational, medical, and public policy interventions (e.g., LaLonde, , 1986; Heckman et al., , 1997; Dehejia and Wahba, , 1999, 2002; Hill et al., , 2004; Michalopoulos et al., , 2004; Diaz and Handa, , 2006; Shadish et al., , 2008). Unlike other constructed observational studies (where, e.g., different survey questions were used for the experimental and nonexperimental data (Diaz and Handa, , 2006)), here measures and outcomes for the experimental and nonexperimental data are identically defined and measured. Furthermore, unlike prior “double-randomized” designs (Shadish et al., , 2008), this study has sufficient statistical power to detect confounding bias and examines processes that cause exposure in situ. Here we are able to study peer effects across many distinct behaviors and evaluate high-dimensional adjustment for covariates that can be constructed from routinely collected behavioral data, rather than custom-made survey instruments (Diaz and Handa, , 2006; Shadish et al., , 2008; Pohl et al., , 2009). In the remainder of this section, we make more detailed comparisons with such prior evaluation of observational methods outside the study of peer effects. We focus on single-study comparisons, where the prior work is most similar to the present contribution. We highlight some of the sources of ambiguity in this prior work and note the advances of the present study with respect to methodology and applicability to contemporary research.

1.1.1 Simulations

Simulations can be used to evaluate observational methods under known models for selection into the treatment. Of course, simulations require assuming both selection and outcome models, which are typically unknown for the circumstances of interest. Simulations have been used to evaluate the consequences of adjustment for additional covariates (Steiner et al., , 2015), which need not always be bias reducing (Ding and Miratrix, , 2015). In the context of peer effects, simulations have been used to illustrate how peer effects are confounded with homophily (Shalizi and Thomas, , 2011) and the consequences of specific modeling choices (Thomas, , 2013). Realism can be somewhat improved by using real data as a starting point for a simulation, e.g., by only simulating the missing potential outcomes (e.g., Imai and van Dyk, , 2004) but using real covariates and treatment data. Nonetheless, simulations are limited in their ability to tell us about the magnitude of bias and bias reduction in practice.

1.1.2 Meta-analyses

Meta-analyses of observational and experimental studies are sometimes used to assess the bias of observational methods. For example, Hemkens et al., (2016) compare experimental and observational studies of effects of several clinical treatments on mortality. Similarly, Schuemie et al., (2014) examine numerous studies using routinely collected data and demonstrates how dependent observational estimates can be on specific modeling choices. Meta-analyses often face the problem that the observational and experimental studies differ in numerous ways besides how units were assigned to treatment. These differences include the populations sampled from, the implementation of the treatment, and the outcomes measures may all differ. In the context of peer effects, meta-analysis is especially difficult because of the small number of field experiments previously conducted. One recent meta-analytic study of peer effects in worker productivity (Herbst and Mas, , 2015) is able to compare lab experiments with field studies, but the field studies are largely not randomized experiments; they do not attempt a comparison of field experiments and observational studies.111 There are two field studies with random assignment. Both use random assignment of peers, rather than random assignment of exposure to peer behavior. Goto et al., (2011) estimate peer effects in the productivity of rice planters; this estimate is the largest of any lab or field study in the meta-analysis. Guryan et al., (2009) estimate peer effects in the performance of professional golfers; they find no significant peer effects. These distinctive settings and results illustrate the challenge of conducting a meta-analysis such that randomized and observational studies are comparable. To our knowledge, the experiment we use is the only large field experiment identifying peer effects in online information diffusion. So a meta-analysis here would be limited to comparing a single experimental estimate (or a small number of estimates) to a number of observational estimates.

1.1.3 Single-study comparisons

Single-study comparisons — such as the present work — involve comparing the results from a single study that includes observations where some units are randomly assigned and other units are not randomly assigned. Among single-study comparisons, one can distinguish between studies that combine such data by different means.

Constructed observational studies. First, there are studies in which an existing randomized experiment is augmented by adding observational data — usually by adding a non-experimental control group (NECG); these are sometimes called constructed observational studies (Hill et al., , 2004; Hill, , 2008), since an observational study is constructed by the addition of data to an experiment. Prominent examples of constructed observational studies have compared observational and experimental estimates of effects of job training programs (LaLonde, , 1986; Heckman et al., , 1997; Dehejia and Wahba, , 1999, 2002), social welfare interventions (e.g., conditional cash transfers, Diaz and Handa, , 2006), welfare-to-work programs (Michalopoulos et al., , 2004), and medical interventions (Hill et al., , 2004)). There are a small number of papers that conduct quantitative (Heinsman and Shadish, , 1996; Glazerman et al., , 2003) or qualitative (Michalopoulos et al., , 2004; Cook et al., , 2008; Wong et al., , 2016) reviews of these within-study comparisons, though the set of available studies to review has been dominated by studies of job-training and other educational treatments.

The results of these constructed observational studies have often been ambiguous for multiple reasons. First, different choices of models and theory-driven criteria for excluding units yield dramatically different comparisons with the experimental estimates. For example, in evaluating a job training program, Dehejia and Wahba, (2002) exclude units that do not have earning data for two prior periods, based on theoretical reasons to expect a dip in earnings right before the decision to enroll; other authors have highlighted how consequential this decision is for the results (Smith and Todd, , 2005; Dehejia, , 2005; Hill, , 2008). Second, while it has been common to interpret observational–experimental discrepancies as due to bias in the observational estimators, there are often other explanations. Though constructed observational studies aim to avoid the incomparability of observational and experimental estimates that occurs in meta-analyses, Shadish et al., (2008) argue that many constructed observational studies “confound assignment method with other study features” (p. 1335), such as the places or times data is sampled from, the implementation of the treatments, the version of the measures used as covariates or outcomes, and different patterns of missing data. For example, while Diaz and Handa, (2006) find experimental–observational discrepancies in estimated effects of a conditional cash transfer program, for many of the outcomes these can be attributed to differences in the survey measures used for the experimental and observational data. Despite these limitations, of the six observational–experimental comparisons they review, Cook et al., (2008) categorize Diaz and Handa, (2006), along with Shadish et al., (2008), as one of the two less ambiguous comparisons.

Doubly randomized preference trials. Second, there are studies in which units are randomly assigned to whether they will be randomly assigned to treatment or whether another process (e.g., self-selection when given a choice) will determine their treatment. This design is sometimes called a doubly randomized preference trial (DRPT) (Long et al., , 2008; Shadish et al., , 2008). Shadish et al., (2008) motivate its use for evaluating observational methods by noting some of the shortcomings of constructed observational studies we have described above: a DRPT involves only varying the assignment method, while holding constant the population, treatment implementation, and measures.

Papers reporting on DRPTs have argued they provide evidence about the bias of observational estimators and which types of covariates and analysis methods most reduce that bias. Shadish et al., (2008, p. 1341) conclude that “this study suggests that adjusted results from nonrandomized experiments can approximate results from randomized experiments.” Steiner et al., (2010, p. 260) describe their analysis as having “identified the specific contributions” made by different sets of covariates. Pohl et al., (2009), who conduct a smaller replication of this DRPT, conclude that “we have shown that it is possible to model the selection process and to get an unbiased treatment effect even with nonrandomized experiments” (p. 475) and that “the choice of covariates made all the difference, and the mode of data analysis [regression adjustment, propensity score methods] did not” (p. 474).

DRPTs are a useful tool for evaluating observational methods, but the DPRTs conducted to date have important threats to their statistical and external validity. The two DRPTs described above have relatively small samples (n = 445 and n = 202) for making a comparison between estimates computed on disjoint halves of the data. The conclusions reviewed above are apparently not based on formal statistical inference: for Shadish et al., (2008) and Steiner et al., (2010) the experimental and unadjusted observational estimates are statistically indistinguishable, so their results are “basically descriptive” (Steiner et al., , 2010, p. 256). The same is true of the smaller replication by Pohl et al., (2009). Thus, there is no evidence of any bias to be reduced or eliminated. Of course, depending on how much bias was expected, this could be notable on its own. While these papers do not report such tests, the comparisons among different observational estimators will generally have greater power. A reanalysis of the results reported in Steiner et al., (2010) suggests that at least some of the observational estimators are converging to different estimands; see Supplementary Information Section 5. Thus, if these studies do provide formal statistical evidence for bias and bias reduction, it is through comparison of multiple observational estimators that would all converge to the same quantity in the absence of bias; that is, the randomized experiment arm of these particular DRPTs does not contribute evidence of bias or bias reduction.

Besides issues of statistical validity, DRPTs can lack external validity and relevance to use of routinely collected data for social science. In addition to using convenience samples of college students, many of the factors that are expected to contribute to confounding bias are absent from the self-selection process in the non-randomized arms of the DRPTs. That is, the participants are presented with a choice between two alternative activities (math and vocabulary exercises), so it is not that some people are unaware of some of the choices, unable to choose them because of costs, etc. This simplification is especially notable for rare treatments. In the context of peer effects, where the treatment is exposure to, e.g., a friend’s adoption of a product or sharing of a URL, such rarity is common.

Differences with the present study. We have described the present study as a constructed observational study, in that we graft observational data (the NECG) onto a randomized experiment. Like some other constructed observational studies, the present study has the advantage of evaluating observation methods when applied to the actual processes of selection into treatment (i.e., exposure to a peer behavior) in situ.

The present study also enjoys some of the advantages of DRPTs. Like DRPTs, the experimental control group and the non-experimental control group are drawn from the same population and all of the variables are measured in the same way. In fact, our NECG consists of individuals and items drawn from the same population of individuals and items in the experiment. Thus, the main explanation of any observational–experimental discrepancies is a process that causes users to be selected into exposure to URLs they are more (or less) likely to share — that is, confounding by, e.g., homophily, common external causes, and past influence.

Because there are millions of distinct URLs that can be grouped into thousands of domains, each of which is analyzed separately, the present study also allows for a form of internal replication that makes it similar to meta-analysis. Other single-study comparisons have usually only studied a single contrast between two treatments.

The present study is of particular relevance to other contemporary empirical research that makes use of new sources of routinely collected data about human behavior, such as from Internet services and economic activity. This is in contrast to some of the prior work that made use of custom purpose-built measures (e.g., measures of preferences for one treatment over another, as in DRPTs) that are of less relevance to use of routinely collected data.

2 Data and method

We analyze a large experiment that randomly modulated the primary mechanism of peer effects in information and media sharing behaviors on Facebook: the Facebook News Feed (Bakshy et al., 2012b, ). Facebook users can share links to particular Web pages (URLs), which is a common way of disseminating news, entertainment, and other information and media. Using a cryptographic hash function, a small percentage of user–URL pairs were randomly assigned to a no feed condition in which News Feed stories about a peer sharing that URL were not displayed to that focal user. Deliveries of these stories and held out deliveries (i.e., those for pairs in the no feed condition) were recorded. Taking exposure to a peer sharing a URL as the treatment, this experiment identifies peer effects for users who would have been exposed to peer sharing. These peer effects may arise from multiple processes, including peer sharing exposing egos to novel information, as is typical in many diffusion and product-adoption settings. Unlike other work (Bakshy et al., 2012a, ) that holds the content fixed and only considers the presence of a peer’s name being linked to the content, this experiment targets an estimand that combines these effects. Key causal quantities identified by the experiment include the relative risk of sharing, $RR=p^{(1)}/\;p^{(0)},$ and the risk difference of sharing (i.e., the average treatment effect on the treated), $\delta=p^{(1)}-\;p^{(0)},å$ where $p^{(1)}$ is the probability of sharing a particular URL when exposed to a peer sharing that URL for those that would be exposed and $p^{(0)}$ is the probability of sharing a particular URL when not exposed to a peer sharing that URL for those that would be exposed. Note that $p^{(0)}$ , but not $p^{(1)}$ , involves a counterfactual.

2.1 Propensity score modeling and stratification

All of the observational estimators evaluated here are the result of post-stratification (i.e., subclassfication) by the domain name of the URL (e.g., for the URL http://www.cnn.com/article_x, the domain is www.cnn.com); that is, per-domain estimates are combined weighting by the number of exposed observations per domain. The adjusted estimators additionally use granular stratification on estimated propensity scores (Rosenbaum and Rubin, , 1983, 1984; Rubin, , 1997). The propensity score $e(X_{iu})=\text{Pr}(E_{iu}=1\;|\;X_{iu})$ is the probability that user $i$ is exposed to a peer sharing URL $u$ , where $X_{iu}$ are variables describing that user–URL pair (Rosenbaum and Rubin, , 1984). In an observational study, researchers typically rely on the following assumptions. Under conditional unconfoundedness, the potential outcomes are independent of the exposure, $(Y_{iu}(0),Y_{iu}(1))\perp|X_{iu}$ . Under overlap or positivity, units have positive conditional probability of exposure and non-exposure, $e(X_{iu})\in(0,1)$ . Conditional unconfoundedness implies that exposure is also unconfounded conditional on $e(X_{iu})$ . In observational studies, propensity scores are estimated using available covariates and conditioned on using regression adjustment, matching, weighting, or post-stratification. We estimate propensity scores using $L_{2}$ -penalized logistic regression (i.e., logistic ridge regression) using different sets of predictors. We set the $L_{2}$ penalty $\lambda=0.5$ . The effect of the penalty on the resulting estimates is expected to be small for two reasons. First, the estimated scores are used for stratification, so only the rank of the scores matters for the analysis; thus, the size of the penalty primarily serves to control how much more small principal components are shrunk than the larger principal components. Note that for linear ridge regression with a univariate or orthonormal basis, the penalty has no effect on the ranks of the scores. Second, most of the models have many more observations than input dimensions; even for models with $\mathsf{M}$ , since this matrix is sparse, for domains with a small number of observations, only some of the columns have any non-zero values. Thus, changes to $\lambda$ are expected to produce only small changes to the estimates. Analysis of other penalties $\lambda\in\{0.1,0.5,5,50\}$ (not shown) was consistent with this expectation.

Since the true model for peer and ego behavior is expected to be highly heterogeneous across very different URLs, we fit a separate model for each domain. This also facilitates a form of internal replication. So for model $m$ , the estimated propensity score for user $i$ being exposed to a URL $u$ from domain $d$ is

[TABLE]

This procedure is conducted for each of the models described below. (We suppress indication of which model $m$ is used from notation except where needed.)

The resulting estimated propensity scores can then be used in three closely related ways — to construct weights for each unit, to match exposed and unexposed units, or to divide the sample into strata (i.e., subclasses). We use (post-)stratification (i.e., subclassfication) on the estimated propensity scores. Such stratification can also be regarded as form of nonparametric weighting or a form of matching, sometimes called “blocking” (Imbens, , 2004) or “interval matching” (Morgan and Harding, , 2006), that does not impose a particular ratio of treated to control units, as one-to-one matching methods do. For very large data sets, such as the current study, stratification has computational advantages over matching, easily supports a much larger control group than treatment group, and the larger sample sizes afford using more strata than is otherwise common.222 Additionally, as discussed in Supplementary Information Section 2.3, the best available method for producing confidence intervals — a multiway cluster-robust bootstrap strategy — is inconsistent for nearest neighbor matching, rather than stratification, on propensity scores. For smaller data sets without this dependence structure, other recent developments, such as direct matching on many covariates could be preferable (Diamond and Sekhon, , 2013). Post-stratification on propensity scores is not covered by some recent results by King and Nielsen, (2015), who argue against the use of propensity scores for matching. Together, these considerations motivated our use of propensity score stratification.

The boundaries of the strata are given by quantiles of the estimated propensity scores for each user–URL pair within each domain. For each domain, we use a number of strata $J$ proportional to the square-root of the number of exposed user–URL pairs, though any large number of strata yields similar results (see Supplementary Information Section 2.1). So for each model $m$ , domain $d$ , and $j\in\{1,2,...,J\}$ we have an interval $\hat{Q}_{dj}\subset[0,1]$ of the scores between the $j-1$ to $j$ th quantiles.333For some of the simpler models, discreteness in the estimate mean there are not J unique quantiles. The strata-specific probability of sharing is estimated with a simple average of the outcomes for all the unexposed pairs in that strata

[TABLE]

where $C(d)$ is the set of user–URL pairs in the NECG from domain $d$ . The estimate for a particular domain $d$ for model $m$ is an average of the estimates for each stratum weighted by the number of exposed pairs within that strata

[TABLE]

Propensity score stratification thus results in weighting outcomes for unexposed individuals in the NECG according to the number of exposed units with similar propensity scores. This process is illustrated with $J=100$ strata for a single domain in Fig. 1.

Estimates from multiple domains are combined in the same way by weighting the estimate for each domain by the number of exposed pairs for that domain

[TABLE]

This weighted average of domain-specific estimates is then used to estimate the other quantities of interest (e.g., $\delta$ , $RR$ ) in combination with the estimate of $p^{(0)}$ common to both the experimental and observational analyses.

2.1.1 Efficiency

Even under overlap and conditional unconfoundedness assumptions, this estimator will only achieve the asymptotic semiparametric efficiency bound for the ATT under stronger assumptions than some alternative methods (Hahn, , 1998). If $e(X_{iu})$ were estimated nonparametrically, then weighting by the inverse of the estimated propensity scores would be efficient Hirano et al., (2003), and similar arguments apply to post-stratification on the estimated propensity scores with a growing number of strata (Imbens, , 2004). The sparse, high-dimensional covariates available here motivate instead using a parametric $L_{2}$ -regularized logistic regression, at the cost of this asymptotic property. Alternative methods could combine this regularized propensity score model with an outcome model to achieve efficient estimation (Robins and Rotnitzky, , 1995; Hahn, , 1998; Imbens, , 2004; Belloni et al., , 2014; Athey et al., , 2016; Chernozhukov et al., , 2016). In the present and similar settings, there are reasons to think the potential variance reductions from modeling the outcome are small. In particular, with a rare ( $<$ 0.2%) binary outcome, there is little information in the outcome for such methods to exploit. We selected the present methods in part because of their appealing computational properties forlarge data sets.

2.2 Sets of covariates

There are numerous variables available for the propensity score model.444For example, an analyst could construct the individual–term matrix counting all the words used by each individual in their Facebook communications; each of these thousands of variables could be used as a covariate. It is not possible or desirable to include all the variables that an analyst could construct because of the work involved in defining variables, the costs of increasing dimensionality for precision, and computational challenges in using all of them in an analysis. Furthermore, many situations may require that investigators decide in advance what variables are worth measuring. In both of these cases, it is standard practice to use theory and other domain knowledge to select variables. In the case of peer effects in URL sharing, the analyst would select variables believed to be related to causes of sharing a URL and to be associated with network structure (i.e., peer and ego variables are associated because of homophily, common external causes, and prior influence). This is not to say that the analyst must think each variable is a likely cause of sharing behaviors, but simply that they are causes of sharing behaviors or are descendants of these causes.

Table 1 lists the variables we computed for use as covariates. These variables are each included in at least one of model specifications, which are designed to correspond to selections of variables that an analyst might make and to evaluate the contribution of different sets of variables to bias reduction. Model $\mathsf{A}$ includes all of the base variables. This model is expected to have the largest potential for bias reduction but to also suffer from increased sampling variance. In other settings, many of these variables might not be available to analysts. Model $\mathsf{D}$ includes demographic variables only. At least some of these variables, or similar measurements, would likely be available in many other settings. These are all expected to be associated with consuming content from particular sources. $\mathsf{D}$ can also be seen as a relatively minimal convenience selection of covariates.

We consider two additional sets of predictors that can be combined with these sets of covariates. First, we expected that, by virtue of serving as measures of a user’s latent interest in and likelihood of independently encountering a URL, variables describing prior interactions with related URLs could result in substantial bias reduction. In particular, for some user–URL pair $iu$ , let same domain shares count the number of URLs that $i$ shared in the six months prior to the experiment that have the same domain name as $u$ . Models that add this variable are indicated with $\mathsf{s}$ ; for example, Model $\mathsf{As}$ adds same domain shares to Model $\mathsf{A}$ . This allows for straightforward evaluation of the consequences of using this variable to the observational analysis.

We regard same domain shares as an example of more specific information about related prior behaviors. In some cases, such information will be available to analysts. In other cases, this information may not be available, or the related behaviors may not be sufficiently common to be useful. In particular, if the focal behavior is new (e.g., a new product launch) or only recently popular, then this information may be limited. In the present case, very few users may have shared any URLs from a particular domain during the prior six months; that is, same domain shares can be 0 for most or all users for some domains.

For this reason, we also evaluate models that include the number of times a user shared URLs from each of the other 3,703 domain names; we indicate the presence of these predictors with $\mathsf{M}$ , as this corresponds to the addition of a large sparse matrix of (log-transformed) counts. These models have important similarities with the use of low-rank matrix decomposition methods in, e.g., recommendation systems: the $L_{2}$ penalty results in shrinking larger principal components of the training data less, where many matrix decomposition methods would simply select a small number of components to use to represent the tastes of individuals (Hastie et al., , 2008, §3.4.1).

2.3 Comparisons of estimators

To evaluate observational estimators of the relative risk $RR$ and risk difference (or ATT) $\delta$ , we use the NECG, as described above, to produce estimates of $p^{(0)}$ that make no use of the control group from the randomized experiment. Recall that the experimental and observational estimates of ${p}^{(1)}$ are identical, as they are both the proportion of exposed user–URL pairs that resulted in sharing; thus, all discrepancies are due to differences in estimating $p^{(0)}$ .

We compute the discrepancy between each of the resulting observational estimates and the experimental estimates. Our focus is primarily on estimates of the relative risk $RR$ . We also consider the risk difference $\delta$ (i.e., the average treatment effect on the treated, ATT). For each observational estimator $m$ , we have two estimators, $\widehat{RR}_{m}$ and $\hat{\delta}_{m}$ . We generally take the experimental estimates as the gold standard — as unbiased for the causal estimand of interest. This motivates the description of these discrepancies as estimates of bias.

For the relative risk, we can compute the absolute discrepancy in the estimates, $\widehat{RR}_{m}-\widehat{RR}_{\text{exp}}$ . To put this is relative terms, we can compute the relative percent bias in the relative risk:

[TABLE]

For the risk difference, we can also compute absolute and percent bias, similarly to above. Since the risk difference $\delta=p^{(1)}-p^{(0)}$ is bounded from above by $p^{(1)}$ (i.e. when the behavior cannot occur without exposure), the maximum possible overestimate of $\delta$ is too large by $p^{(0)}$ . Thus, we can also characterize error in terms of this maximum possible overestimate, percent bias of the maximum possible overestimate:

[TABLE]

where we assume $\hat{\delta}_{m}\geq\hat{\delta}_{\text{exp}}$ .

To account for dependence among observations of the same user or same URL, all confidence intervals reported in this paper are 95% standard bootstrap confidence intervals robust to dependence among repeated observations of both users and URLs (Owen and Eckles, , 2012) and accounting for the sampling error in both the experimental and observational estimates; see Supplementary Information Sections 2.3.

3 Results

On average a user exposed to a peer sharing a URL (i.e., a user–URL pair in the feed condition) goes on to share that URL 0.130% of the time, while a user who was not exposed to a URL because that user–URL pair was randomly assigned to the no feed condition goes on to share that URL 0.019% of the time. That is, exposure to a peer sharing a URL causes sharing for $\hat{\delta}_{\text{exp}}=$ 0.111% of pairs (CI = [0.109, 0.114]), and users are $\widehat{RR}_{\text{exp}}=$ 6.8 times as likely to share a URL in the feed condition compared to those in the no feed condition (CI = [6.5, 7.0]). These are the experimental estimates of peer effects to which we compare observational estimates.

Bibliography94

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Allcott and Gentzkow, (2017) Allcott, H. and Gentzkow, M. (2017). Social media and fake news in the 2016 election. Journal of Economic Perspectives , 31(2):211–236.
2Allen et al., (2013) Allen, J., Weinrich, M., Hoppitt, W., and Rendell, L. (2013). Network-based diffusion analysis reveals cultural transmission of lobtail feeding in humpback whales. Science , 340(6131):485–488.
3Angrist, (2014) Angrist, J. D. (2014). The perils of peer effects. Labour Economics , 30:98–108.
4Aplin et al., (2015) Aplin, L. M., Farine, D. R., Morand-Ferron, J., Cockburn, A., Thornton, A., and Sheldon, B. C. (2015). Experimentally induced innovations lead to persistent culture via conformity in wild birds. Nature , 518(7540):538–541.
5Aral et al., (2009) Aral, S., Muchnik, L., and Sundararajan, A. (2009). Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences , 106(51):21544–21549.
6Aral and Walker, (2011) Aral, S. and Walker, D. (2011). Creating social contagion through viral product design: A randomized trial of peer influence in networks. Management Science , 57(9):1623–1639.
7Asch, (1956) Asch, S. E. (1956). Studies of independence and conformity: I. A minority of one against a unanimous majority. Psychological Monographs: General and Applied , 70(9):1–70.
8Athey et al., (2016) Athey, S., Imbens, G. W., Wager, S., et al. (2016). Efficient inference of average treatment effects in high dimensions via approximate residual balancing. https://arxiv.org/abs/1604.07125.