Reducing Seed Bias in Respondent-Driven Sampling by Estimating Block Transition Probabilities
Yilin Zhang, Karl Rohe, Sebastien Roch

TL;DR
This paper introduces a method to reduce seed bias in respondent-driven sampling by estimating block transition probabilities and using them in a post-stratified estimator, improving accuracy in population proportion estimates.
Contribution
It presents a novel approach to estimate block transition probabilities and applies them to create a seed-bias-reducing estimator with proven consistency and improved performance.
Findings
Estimated block transition probabilities are highly accurate.
The proposed post-stratified estimator reduces seed bias effectively.
Simulation results show lower RMSE compared to existing methods.
Abstract
Respondent-driven sampling (RDS) is a popular approach to study marginalized or hard-to-reach populations. It collects samples from a networked population by incentivizing participants to refer their friends into the study. One major challenge in analyzing RDS samples is seed bias. Seed bias refers to the fact that when the social network is divided into multiple communities (or blocks), the RDS sample might not provide a balanced representation of the different communities in the population, and such unbalance is correlated with the initial participant (or the seed). In this case, the distributions of estimators are typically non-trivial mixtures, which are determined (1) by the seed and (2) by how the referrals transition from one block to another. This paper shows that (1) block-transition probabilities are easy to estimate with high accuracy, and (2) we can use these estimated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHIV, Drug Use, Sexual Risk · HIV/AIDS Research and Interventions · Opioid Use Disorder Treatment
Reducing Seed Bias in Respondent-Driven Sampling
by Estimating Block Transition Probabilities
Yilin Zhanglabel=e1][email protected] [
Karl Rohelabel=e2][email protected] [
Sebastien Rochlabel=e3][email protected] [ University of Wisconsin-Madison,
Department of Statistics and Department of Mathematics
Yilin Zhang, Karl Rohe
Department of Statistics
University of Wisconsin Madison
1300 University Ave
Madison, WI 53706
USA
E-mail: e2
Sebastien Roch
Department of Mathematics
University of Wisconsin-Madison
480 Lincoln Drive
Madison, WI 53706
USA
Abstract
Respondent-driven sampling (RDS) is a popular approach to study marginalized or hard-to-reach populations. It collects samples from a networked population by incentivizing participants to refer their friends into the study. One major challenge in analyzing RDS samples is seed bias. Seed bias refers to the fact that when the social network is divided into multiple communities (or blocks), the RDS sample might not provide a balanced representation of the different communities in the population, and such unbalance is correlated with the initial participant (or the seed). In this case, the distributions of estimators are typically non-trivial mixtures, which are determined (1) by the seed and (2) by how the referrals transition from one block to another. This paper shows that (1) block-transition probabilities are easy to estimate with high accuracy, and (2) we can use these estimated block-transition probabilities to estimate the stationary distribution over blocks and thus, an estimate of the block proportions. This stationary distribution on blocks has previously been used in the RDS literature to evaluate whether the sampling process has appeared to “mix”. We use these estimated block proportions in a simple post-stratified (PS) estimator that greatly diminishes seed bias. By aggregating over the blocks/strata in this way, we prove that the PS estimator is -consistent under a Markov model, even when other estimators are not. Simulations show that the PS estimator has smaller Root Mean Square Error (RMSE) compared to the state-of-the-art estimators.
respondent-driven sampling,
post-stratification,
social network,
Stochastic Blockmodel,
Markov process,
keywords:
\startlocaldefs\endlocaldefs
, , and
t1These authors gratefully acknowledge support from NSF grant DMS-1612456 and ARO grant W911NF-15-1-0423. t2This author gratefully acknowledges support from NSF grants DMS-1149312 (CAREER), DMS-1614242 and CCF-1740707 (TRIPODS), and a Simons Fellowship.
1 Introduction
Respondent-driven sampling (RDS) is one of the most popular network-based approaches to sample marginalized and hard-to-reach populations, such as drug users, sex workers, and the homeless [1]. RDS has been widely used, for instance, to quantify HIV prevalence in at-risk populations [2, 3]. According to a recent literature review [4], RDS has been used in over 460 studies from 69 countries.
RDS collects samples through peer referral on a social network. It starts from some initial participant as the seed, which forms wave zero. In the process, we incentivize each participant to pass some (usually three to five) referral coupons to their friends. Those who return to the study site with a referral coupon form the next wave of samples. We repeat this process until we get enough samples or the participants stop referring. Figure 1 from [5] gives an illustration for the RDS sampling process. There are three components in RDS sampling: (1) the social network, (2) the sampling tree, and (3) the variable of interest (denoted by color in Figure 1). The underlying social network is the target population to study, which is unobserved. For each sampled node, we observe their HIV status (black or grey in Figure 1), and which node refers them to the sample. We aim to estimate the proportion of people with certain trait, such as HIV positive (nodes that are grey in Figure 1), in the population.
The link-tracing sampling procedure of RDS enables us to reach the hard-to-reach populations. However, RDS samples are dependent. This dependence is particularly bad when there are multiple communities in the target population and the people form most of their friendships within their own communities (i.e. blocks). For example, people from the east side of the town might only know a few people from the west side of the town, and thus they are much more likely to refer people from the west side of the town. This is referred to as a “bottleneck” and it leads to a sample that is unbalanced between the different communities. If the HIV prevalence is higher on one side of the town, then this bottleneck creates dependence between observations in an RDS sample. If the initial participant is from the east side, then the sample may underrepresent people from the west side. This creates “seed bias.” In statistical models which presume that the seed node is randomized, this “seed bias” appears as additional variance in the final estimator. When some participants refer too many contacts, the variance of the traditional RDS estimator, Volz-Heckathorn (VH) estimator [1], decays at a rate slower than [5]. We provide an example in Appendix B.3. To address this issue, recent work [6] has derived an idealized generalized least squares (GLS) estimator for which the standard error decays at rate with growing sample size under a fixed social network. The practical implementation of the estimator, called the feasible GLS (fGLS) estimator, requires solving an system of equations and comes with no theoretical guarantees.
This paper provides an estimator that is easy to compute and has root mean squared error that decays at rate up to log factors, by implicitly adjusting for bottlenecks between different communities. While this estimator is new, its essential components are well known and reported in the RDS literature. This new estimator assumes that we have collected the “bottlenecked” community memberships of the sampled individuals. With this data, a key summary is the empirical transition matrix between communities, in which element is the proportion of referrals from participants in community to participants in community . In the RDS literature, this matrix is a common way to summarize the sampling procedure and understand the underlying social network. For example, the original RDS paper [1] reports on a sample of drug users. Table 1c from that paper (reprinted as Figure 2 herein) gives the empirical transition matrix between communities defined by drug preference. This empirical transition matrix is also a key piece of the feasible GLS estimator [6].
Interestingly, an estimate of the proportion of nodes in each community can be derived from the empirical transition matrix. Notice in Figure 2 that [1] reports the equilibrium distribution on the different strata/communities. This takes the empirical transition matrix as a Markov transition matrix on the different communities and computes the stationary (i.e. equilibrium) distribution of this Markov process (i.e. the leading left eigenvector of the transition matrix). In Figure 2, the equilibrium distribution is close to the total distribution of recruits. When there is a bottleneck, this paper shows that the equilibrium distribution is a better estimator than the total distribution of recruits. The basic reason is that even when there is a bottleneck, each row of the empirical transition matrix is composed of nearly independent multinomial samples. There is one caveat; our estimator does not use the actual equilibrium distribution of the empirical transition matrix (i.e. the quantity reported in Figure 2). Instead, we have a simple approximation of the equilibrium which is easier to compute and thus simplifies the proof.
The final estimator is a post-stratified estimator where the strata are the community memberships and the estimated proportion of nodes in each strata is derived from the estimated equilibrium distribution. We call this the PS estimator. The PS estimator has three major advantages: (1) computational efficiency, (2) smaller variation (bias square, variance and RMSE), and (3) block-wise byproducts. We show in Theorem 4.1 that our PS estimator has both its bias and standard deviation decay at rate up to log factors, which does not hold for the popular Volz-Heckathorn (VH) esimtator [1] and does not show the GLS estimator [6]. The simulation studies also show our PS estimator has smaller variation (bias square, variance and RMSE) compared to the VH estimator and fGLS estimator. The improvement is significant especially when there exists bottleneck in social networks.
The paper is organized as follows. Section 2 defines the Markov model, the quantity to estimate, and the traditional RDS estimators. Section 3 introduces the PS estimator. Section 4 shows PS estimator is -consistent under the Degree Corrected Stochastic Blockmodel (DC-SBM). In Section 5, we show by simulations that PS estimator has smaller variation than the state-of-the-art estimators, especially when there exists bottleneck in social networks. We summarize with a discussion in Section 6.
2 Preliminaries
We model referrals using a Markov process similar to the ones previously considered in the RDS literature [7, 1, 8, 9, 5, 6].
2.1 Markov process on a social network
A social network consists of a node set of individuals and an undirected edge set
[TABLE]
We use and interchangeably. We assume that is connected. Let be the weight of edge , which models recruitment preference (more details in Section 4). For any , we let by convention. If the graph is unweighted, then for all . For each node , we denote its neighbor in the network by We denote the degree of node as and the mean degree of graph as .
We model the collection of samples in RDS with a Markov process on the social network indexed by a tree. It starts with an initial participant as seed, which we index as vertex 0, and develops into a rooted tree, (a connected graph with nodes, no cycles, and a vertex [math]). We use to denote that node belongs to the samples indexed by . For each node , we denote the parent of as (the node that refers to the sample). Formally, an RDS sample is an indexed collection of random nodes , where each referral has probability
[TABLE]
where the transition matrix has elements
[TABLE]
Since the graph is undirected and connected, is a reversible Markov transition matrix with unique stationary distribution with
[TABLE]
While the referrals are random, we think of itself as deterministic.
Following [10], we refer to this Markov process as a -walk on . Note that and are two distinct graphs: the node set in indexes the population, which is a social network, and the node set in indexes the samples, which is a sampling tree. We say that the -walk is stationary if the seed is chosen according to the stationary distribution.
2.2 Quantity to estimate and the Volz-Heckathorn estimator
For each node , we denote the variable of interest (e.g., the indicator of HIV status) as . We wish to estimate the population mean of the variable of interest
[TABLE]
For each sample , we observe
[TABLE]
The sample average
[TABLE]
is generally biased, since nodes with larger degrees are more likely to be sampled in the Markov process. Specifically, under the stationary -walk on , it has expectation
[TABLE]
In general, .
To obtain an unbiased estimator of , the sample average must be adjusted. Using , the inverse probability weighted estimator (IPW),
[TABLE]
is an unbiased estimator of [11]. Additionally estimating with the harmonic mean of the observed node degrees,
[TABLE]
leads to the popular Volz-Heckathorn (VH) estimator [9],
[TABLE]
The VH estimator has been extensively used in the study of marginalized populations [2, 3, 4], but it is highly variable. The variance of the VH estimator in general may decay at a rate slower than [5], implying that many more samples are required to reduce the standard error. See Section B.3. We address this issue by introducing a post-stratification approach to RDS in the following section.
3 A new estimator
3.1 A post-stratification approach to RDS
Stratification
Stratification has been extensively used in traditional random sampling to reduce variance. The key idea of stratified sampling is as follows. Assume that the overall population can be divided into (ideally homogeneous) sub-groups (which we refer to as blocks) based on some variable, such as gender, race, etc. Then the sample mean and sample variance of the total population can be calculated using block-wise sample means and variances.
Specifically, suppose there are blocks in a population with individuals. For each block , we denote the block size as , the block-wise population mean as , the sample size as and the block-wise sample average as . The sample average and sample variance for the total population can be derived from the block-wise quantities by
[TABLE]
Stratified sampling by proportionate allocation randomly selects individuals proportionally to the sizes of the different blocks, with the goal of improving accuracy by reducing sampling error. Post-stratified sampling, on the other hand, performs stratification after sampling and calculates and as above. Post-stratification is useful when the samples constitute an unbalanced representation of the full population.
Block proportions are unobserved in marginalized populations
We seek to apply this last approach to RDS in order to deal with seed bias. An important issue arises however. Per (3.1), traditional post-stratification requires the knowledge of the block proportions . These are typically unknown in marginalized populations. Hence, we need to estimate the block proportions from the samples. In the next section, we describe how we do this and we formally define a novel post-stratified estimator for RDS.
3.2 Block-wise quantities
For a set , denote its cardinality by . Suppose there are blocks in the social network . For each node , denote its block membership as , i.e., if belongs to block . To simplify notation, we write to mean . For each block , we denote the block size as and the block-wise mean as .
For each sample , we let its block membership be and we write to mean . We define for each block the sample size as , the block-wise harmonic average degree as
[TABLE]
and the block-wise sample average weighted by degree, i.e., the VH estimator for , as
[TABLE]
Suppose that we observe the block membership of each sample, i.e., we observe for all . We define the matrix such that, for any two blocks ,
[TABLE]
and the row-normalized matrix whose -entry is
[TABLE]
Here, for a matrix , we let and is the indicator of event . Finally we define the vector with entries
[TABLE]
3.3 The post-stratified estimator
We define our new estimator next.
Definition 3.1** (The post-stratified estimator).**
For an RDS sample on a graph with blocks, the post-stratified (PS) estimator is
[TABLE]
with
[TABLE]
where , , and are defined in (3.2), (3.3) and (3.4) respectively.
Comparing (3.6) with (3.1), the estimator can indeed be seen as a post-stratified estimator. In Section 3.4, we argue that is an estimator of the block proportion of block . Note that we also use the VH estimator on each block , instead of the block-wise sample average, to adjust for the bias induced by node degrees.
3.4 Motivation for the PS estimator
To motivate our new estimator, we analyze its behavior under a standard model of random social network with community structure, the degree-corrected stochastic blockmodel (DC-SBM) [12].
Definition 3.2** (Degree-corrected stochastic blockmodel).**
Let be a positive, symmetric matrix and let be a positive vector. Under the DC-SBM, a social network with is drawn randomly as follows. Assume that we have a partition of into blocks labeled . Let be the respective sizes of the blocks. For a node , let be its block. Each possible edge is present independently from all other edges with probability
[TABLE]
By convention, we assume for all block .
Remark 3.3** (Self-loops).**
To simplify the notation throughout, we allow self-loops in the DC-SBM, each of which will contribute to degree counts (instead of the standard convention of ). Note that, in a dense graph, such self-loops will play a negligible role.
To justify our PS estimator under the DC-SBM, we make three observations:
Define the matrices , where , and , where
[TABLE]
for any two blocks . Since is positive and row-normalized version of the symmetric matrix , it has a unique stationary distribution , where
[TABLE]
Indeed
[TABLE] 2. 2.
The expected degree of node in block is
[TABLE]
Hence the block-wise mean expected degree over block is
[TABLE] 3. 3.
Combining the two observations above, we get
[TABLE]
Because the denominator is constant, we have finally
[TABLE]
Therefore, by (3.1), the population mean can be re-written as
[TABLE]
From this it follows that, to estimate , it suffices to estimate the block-wise mean , the block-wise expected mean degree , and the stationary distribution of , for each block . We estimate them with , , and , respectively—leading to the PS estimator in (3.6). In the proof of Theorem 4.1 below, we analyze the accuracy of these estimators (see Claims B.6, B.9 and B.7).
4 Main theoretical result
In this section, we show that the PS estimator defined in (3.6) has error with high probability when the social network is distributed under a dense DC-SBM.
Theorem 4.1** (Main result).**
Suppose the social network of size is distributed according to the DC-SBM with blocks of respective sizes and parameters and . Suppose is a sampling tree of size . Let be the variable of interest. Assume that there are universal constants and independent of and such that the following assumptions hold:
- (a)
[Linear-sized blocks] for all k; 2. (b)
[Dense graph] for all blocks ; 3. (c)
[Degree homogeneity] for all nodes ; 4. (d)
[Bounded variables] for all nodes ; 5. (e)
[Limited referrals] The maximum degree of is less than or equal to .
Then, for any , there exists a constant (not depending on ) such that, with probability over the choice of , the following holds. For any -walk on the PS estimator defined in (3.6) satisfies
[TABLE]
with probability .
A direct consequence of Theorem 4.1 is that the bias and standard deviation decay at rate up to log factors. This does not hold for the traditional VH estimator [1], since its standard deviation decays at a rate slower than [5], which we also show by example in Appendix B.3. For the recent GLS-based estimators proposed in [6], it is shown that their standard deviation decays at rate as goes to infinity for a fixed network size, but no finite size guarantees are provided.
Assumptions (b) and (c) require the graph to be dense. In the following section, we show through simulations that the PS estimator also works well on sparse graphs.
5 Simulations
This section compares the PS estimator to the VH and fGLS estimators on simulated networks (in Section 5.1) as well as social networks collected by the National Longitudinal Study of Adolescent Health (Add Health Networks) (in Section 5.2), both with simulated RDS samples. In both cases, the PS estimator has smaller variation than the VH and fGLS estimators.
5.1 Simulated Networks
We simulated 100 random social networks by DC-SBM with nodes, expected average degree , and blocks with the same sizes. The stochastic matrix was chosen proportional to
[TABLE]
We simulated the binary outcomes to be perfectly aligned with one of the block labels.
On each social network, we generated RDS samples by link tracing without replacement. We randomly sampled the seed proportionally to the node degree. Then, for each participant in the sample, we recruited number of friends, where . The recruiting process stopped when there were 1000 participants in the RDS sample. If it terminated before recruiting 1000 participants, then we re-started the recruiting process. We generated 200 different RDS samples on each network. For each RDS sample, we computed the VH, fGLS, and PS estimators. On each network, we computed the absolute bias, standard deviation, and RMSE of the 200 estimators of each type. In the simulations, we computed the fGLS estimator as in [6], which re-weights the outcome to adjust for the sampling bias.
Figure 3 shows that the PS estimator has smaller variation than the VH and fGLS estimators in terms of absolute bias, standard deviation, and RMSE.
Note that there are some factors that may affect the performance of the estimators, such as (1) bottlenecks in the social network (2) the alignment of the block labels with the variable of interest , and (3) the network density, etc. We explored these factors and how they affected the performance of the estimators. More explorations on other factors including network sizes and sample sizes are in Section A in the appendix. The following simulations in Figure 4, 5 and 6 have the same setting as in Figure 3, except that the values of the corresponding factor are made to vary.
Bottleneck
Bottlenecks exist when there are much fewer connections across different blocks than within blocks. Recall that, in the DC-SBM, the stochastic block matrix shows the average number of links between any two blocks. We simulated the stochastic block matrix such that,
[TABLE]
with for identification. We refer to the difference as the bottleneck strength. With a larger bottleneck strength, there are more connections within blocks and fewer connections across blocks. When there is no bottleneck (strength is zero), there is only one block in the network. Figure 4 shows that the PS estimator has smaller variation than the fGLS and VH estimators, especially when there exists a bottleneck. In particular, the PS estimator appears to reduce the seed bias and standard deviation caused by bottlenecks much better than the fGLS and VH estimators.
Alignment
We capture the alignment of the block labels and the variable of interest by the difference of the block-wise means of the variable of interest, i.e. with blocks. Figure 5 shows that the fGLS and PS estimators exhibit the largest improvement over the VH estimator when the block label perfectly aligns with the variable of interest (i.e., the alignment is ). The three estimators perform equally well when the block-wise means of the variable of interest are equal (i.e., the alignment is [math]). When the block label partially aligns with the variable of interest (i.e., the alignment is strictly between [math] and ), the fGLS and VH estimators exhibit similar variation, but the PS estimator has smaller variation when the block-wise difference is over 0.4.
Network density
We use the expected average degree of the network to quantify the network density. Though Theorem 4.1 requires the networks to be dense enough, Figure 6 shows that the estimators perform similarly on sparse networks.
5.2 Add Heath Networks
In this section, we consider RDS simulations obtained by tracing contacts in social networks collected in the National Longitudinal Study of Adolescent Health (Add Health Networks). This study collected a nationally represented sample of adolescents from grade 7 to 12 in the United States in the 1994-1995 school year. The sample covers 84 pairs of middle and high schools in which students nominated of up to five male and five female friends in their middle or high school network ([13]). In this analysis, we symmetrized all contacts to create a social network, and we restricted each network to its largest connected component. These networks were previously studied in [14], [15], and [6].
We restricted our analysis to the 25 Add Heath networks with over 1000 nodes. On each network, we simulated 200 different RDS samples, each with 500 participants. On each RDS sample, we computed the VH, fGLS, and PS estimators. In the simulation, we randomly sampled seed nodes proportional to node degrees. We computed the absolute bias, RMSE, and standard deviation of the estimators on each network. In the analysis we used the school label (middle school or high school) as the outcome and the grade label (7-12) as the block labels.
The recruitment process was similar to that in Section 5.1, but without replacement. In this case, each person could be recruited no more than once. For each participant , if they had fewer number of unrecruited friends than , then we recruited all of their unrecruited friends.
Figure 7 shows the variation of the estimators. Overall, the PS estimator has substantially smaller variation than the fGLS and VH estimators.
6 Discussion
RDS has been widely used in studying marginalized populations. But the estimators derived from RDS samples have suffered from high variance. This is due to two related issues (1) the complicated network dependence of the RDS samples, and (2) seed bias caused by bottlenecks. In this paper, we introduced post-stratification to RDS and provided a novel estimator. Our easy-to-compute PS estimator reduces seed bias. We derived some theoretical results for the PS estimator, showing its bias and standard deviation decay at (up to log factors) under the degree-corrected stochastic block model. This is the first estimator with such guarantees. Though we require the networks to be dense in theory, we showed through simulations that the estimator performs similarly on sparse networks.
One future direction is how to select the block labels in practice. In [6], an approach for selecting block labels using eigenvalues of the block-wise transition matrix is proposed. Further discussions on this issue would be helpful to apply the PS (and fGLS) estimators.
Appendix A More Simulations
In this section, we explore how network sizes and sample sizes affect the performances of RDS estimators. The simulation settings are the same as in Section 5.1. Figure 8 shows the estimators perform similarly with different sizes of networks. Figure 9 shows the RDS estimators have smaller variation with larger sample sizes.
Appendix B Proof of the main theorem
B.1 Notation
For each node , we denote its neighborhood in the social network as and its neighborhood within block as We denote by the size of the latter. The degree of node is denoted and we have .
While the RDS sampling procedure is a random walk on the social network , under a dense DC-SBM our analysis relies on establishing an approximation of the process by a “population-level” random walk on blocks. We define the block transition probability at node by
[TABLE]
for any blocks and any sample . Recall that, for any sample , we denote its parent as .
Under our assumptions, is the expected number of edges between blocks ; indeed
[TABLE]
Recalling the matrix , where is the matrix in the definition of the DC-SBM model (3.7) and , the population block transition probability is given by
[TABLE]
for any two blocks . We refer to
[TABLE]
as the population transition matrix on blocks. Recall that its unique stationary distribution is .
For each block , . We also define the number of referrals from block to be
[TABLE]
For any two blocks , we define the number of referrals between block and block as
[TABLE]
Note that and The elements of the estimated block transition matrix in (3.4) can be rewritten as . We use to denote these quantities, i.e.,
[TABLE]
To summarize, for any blocks , the quantities , and represent respectively the block transition probability at node , the population block transition probability, and the estimated block transition probability.
B.2 Proof
The proof of Theorem 4.1 follows from a series of claims. We begin with a sketch of the proof in this section.
Under the dense DC-SBM, random walk is mixing fast within each block (Claims B.2 and B.5). This plays a key role in estimating block-wise means, for which we use the VH estimator (Claims B.7, B.8, and B.10). 2. 2.
To estimate block proportions, we use the stationary distribution of block-wise transition matrix, which is the main, non-trivial contribution of this work. Indeed, the standard empirical frequency gives an estimate with much larger variance (see Section B.3). Instead we estimate the transition matrix between blocks, which is a “more local” quantity in the sense that it is not affected strongly by the seed, and compute its stationary distribution. As a result, block-wise transition probabilities are highly concentrated around their true value under the Markov chain on the blocks; their stationary distributions are also close to each other (Claim B.3, Claim B.6).
Note that there are two sources of randomness, the social network and the -indexed random walk. Claims B.1-B.5 are concerned with the randomness of , while Claims B.6-B.10 deal with the random walk.
\hat{\mu}_{{k}\mathrm{VH}}$$\mu_{k}$$\hat{\pi}_{k}^{B}$$\pi_{k}^{B}$$H^{(\delta_{k})}$$\delta_{k}^{(B)}=N_{k}^{-1}B_{k\ast}$$\hat{\mu}_{\mathrm{mVH}}$$\mu_{\text{true}}Claim B.10Claim B.6Claim B.9Combine above
Throughout, is as in the statement of the theorem.
High-probability properties of the social network
We first use standard concentration inequalities to control the degrees of . Recall that under the DC-SBM the expectation of is .
Claim B.1** (Degrees are concentrated).**
Under the DC-SBM, there exists (depending on but not on ) such that, with probability over the choice of , the following event holds: simultaneously for all pairs of blocks and all nodes ,
[TABLE]
We let be the event in the claim.
Proof of Claim B.1.
Fix blocks and . Under the DC-SBM, each node in block connects with node independently with probability . Hence we can write as a sum of independent indicators, whose overall expectation is , where we used that . By Hoeffding’s inequality [16], for any constant , by choosing large enough
[TABLE]
where we used that in the second inequality. Taking a union bound over , and gives
[TABLE]
simultaneously for all and all with probability at least . Dividing by and using for any node and for any blocks , gives the result for appropriately chosen . ∎
The following claim will be useful to control the mixing rate within a block. For any blocks and two distinct nodes , , we consider the number of two-edge paths from to in whose middle vertex is in block , weighted by a quantity related to the expected degree of the middle vertex under the DC-SBM:
[TABLE]
Claim B.2** (Two-edge paths).**
There exists such that, with probability over the choice of , the following holds: simultaneously for all blocks , and all , with ,
[TABLE]
We let be the event in the claim.
Proof of Claim B.2.
Fix blocks , and nodes . By Claim B.1, we can choose large enough such that
[TABLE]
for some .
We treat the case where all blocks are distinct. The other cases are similar. Let be the event that and note that . Conditioned on , each of the edges incident to and block has a corresponding endpoint which itself connects to —independently of all other such endpoints—with probability . Since weighs this last edge by , its expected contribution is . Moreover, the possibly non-zero terms in the sum defining are uniformly bounded by a constant by the assumption that . Hence, we can apply Hoeffding’s inequality again, and by choosing large enough we have
[TABLE]
for some , where we used (B.7) in the second inequality and we used that and in the last inequality.
Combining (B.7) and (B.8), and taking a union bound over , , , and gives
[TABLE]
for a constant chosen large enough. Dividing by and using again that and gives the result, for an appropriately chosen constant . ∎
Properties of the walk
Before proving our main theorem, we will also need some results about the behavior of simple random walk on the network. We first show that, from any , the probability of jumping to a vertex in block is close to the population-level probability .
Claim B.3** (Transitions between blocks).**
There exists such that, conditioned on , for any blocks and any
[TABLE]
Proof.
Fix and . Recall
[TABLE]
Under ,
[TABLE]
for a constant large enough. A similar inequality holds in the opposite direction. ∎
The previous claim also implies that any step has a probability bounded away from [math] of landing in any block.
Claim B.4** (Landing in a block).**
There is such that, conditioned on , for any blocks and any , we have
[TABLE]
provided is larger than a sufficiently large constant.
Proof.
Let
[TABLE]
The result then follows from Claim B.3. ∎
We next show that two steps of the walk are enough to mix within a block.
Claim B.5** (Two steps suffice for within-block mixing).**
For each sample , we denote its grandchildren as . For a -walk on , there exists such that, on and , for all and
[TABLE]
for all blocks , and nodes , with .
Proof.
To simplify notation, the conditioning on is implicit throughout the proof. Assume and hold. Fix blocks as well as nodes and with . Let and let be the ancestor of on , which is necessarily a child of . Then, for some constants , using and
[TABLE]
where recall that . A similar inequality holds in the other direction. That implies the claim. ∎
Concentration of key estimates
The PS estimator defined in (3.1) relies on three key estimates, whose concentration we establish now.
We begin with the concentration of by showing that our estimates of block transition probabilities are concentrated, which boils down to proving that the ’s are concentrated. Recall that Claim B.3 implies that the block transition probabilities are concentrated at each , i.e., the ’s are concentrated. Proving that the estimate itself is concentrated requires an argument. Indeed, as shown in Section B.3 below, both the numerator and denominator of this estimator in general may have variance asymptotically much greater than . Instead, we use the Markovian structure of the model to control the deviation of .
Claim B.6** (Concentration of block-wise steady-state probability estimates).**
Conditioned on and , there exists such that, for any block , with probability at least ,
[TABLE]
Recall that was defined in (B.3).
Proof.
Throughout this proof, we implicitly condition on and assume that (from Claim B.1) holds. We let be a topological ordering of the vertices of , i.e., an ordering such that: if is an ancestor of , then . For a fixed , we let be the corresponding filtration, i.e.,
[TABLE]
Recall that is the parent of . The proof relies on three sub-claims:
Deviation of : For and , let
[TABLE]
where recall that is the block of . Note that
[TABLE]
We consider the process
[TABLE]
with . We claim that is a martingale with bounded increments. Indeed, by the ordering of the samples, since for some . Hence for all . So . Moreover, following a standard calculation,
[TABLE]
Finally, observe that by definition
[TABLE]
By the Azuma-Hoeffding inequality (see e.g. [17]), for a constant large enough
[TABLE] 2. 2.
Deviation of : Next, we bound
[TABLE]
where we use the Markov property of the walk indexed by . By Claim B.3, for all ,
[TABLE]
Combining (B.10) and (B.11), we get
[TABLE]
where we used that and that is non-decreasing for . 3. 3.
Lower bound on : Let be the number of internal vertices in . Because each leaf has a parent that is an internal vertex and has maximum degree for some constant , it follows that . Moreover, by Claim B.4, the state of each internal vertex of (except the root) has probability at least of coming from block , independently of all other ’s. As a result, stochastically dominates a binomial random variable with trials and probability of success . By Hoeffding’s inequality we therefore have for a constant large enough that
[TABLE]
Together with , that implies that for some constant
[TABLE]
Combining (B.9), (B.12), and (B.13), with probability at least for any block , there exists some constant , such that
[TABLE]
Recall that the stationary distribution of is
[TABLE]
for any and that
[TABLE]
Then, there exists some constant , such that
[TABLE]
Indeed,
[TABLE]
for large enough , and similarly in the other direction. The second line is from (B.15) and (B.16) while fourth line is from (B.14). ∎
We then evaluate the deviation of
[TABLE]
Recall that, for any block , the population block-wise average is
[TABLE]
Before showing that our block-wise estimator is close to , we first look at a related quantity, below, which serves as a “bridge.” We define the weighted block-wise average as
[TABLE]
Using an argument similar to that in Claim B.6, we show in Claim B.7 that is concentrated for each block . We then show in Claim B.10 that is close to . As a result, we will have established that is close to .
\hat{\mu}_{{k}\mathrm{VH}}$$\hat{\mu}_{{k},\mathrm{w}}$$\mu_{k}close toClaim B.10close toClaim B.7
Claim B.7** (Concentration of block-wise sample averages weighted by degrees).**
Conditioned on , and , there exists such that, with probability at least , for any block
[TABLE]
Proof.
Because the structure of the proof is similar to that of Claim B.6, we only sketch it here. We also make use of Claim B.5, which shows that simple random walk on mixes well within blocks in two steps. Because of the latter, we control separately the odd and even levels of . Let be the vertices of whose graph distance to the root is even, including the root , in a topological ordering. Let be the grand-children of in . Let and for
[TABLE]
For each node , define
[TABLE]
Fix block and let
[TABLE]
and
[TABLE]
where note that the last sum excludes the root. Following the proof of Claim B.6, we note that the partial sums
[TABLE]
form a martingale indexed by with increments satisfying
[TABLE]
where we used that has maximum degree and by assumption. Hence, arguing as in Step 1 of Claim B.6, we get that with probability at least for all
[TABLE]
Moreover, let and notice that by construction . Hence, by Claim B.5,
[TABLE]
where recall that we condition on . Similarly in the opposite direction. So, arguing as in Step 2 of Claim B.6, for some large enough ,
[TABLE]
where we used .
In addition, we argue as in Step 3 of Claim B.6. Because each node with odd distance to the root has a parent with even distance to the root, and has maximum degree , it follows that . Moreover, by Claim B.4, the state of each internal vertex of (except the root) has probability at least of coming from block , independently of all other ’s. As a result, stochastically dominates a binomial random variable with trials and probability of success . By Hoeffding’s inequality we therefore have for a constant large enough that
[TABLE]
Together with , that implies that with probability at least for all block for some constant
[TABLE]
Finally, following the proof of Claim B.6 once again, we also get that with probability at least for all
[TABLE]
for some constant . Combining (B.18), (B.19), and (B.21), with probability at least
[TABLE]
for some constant .
The same holds for the odd levels. Together with (B.20) and a similar inequality for odd levels (and the fact that the first two levels of have negligible effect asymptotically), we get the claim. ∎
By replacing by 1 in the proof of Claim B.7, we can also derive the following.
Claim B.8**.**
Conditioned on , and , there exists such that, with probability at least , for any block
[TABLE]
Using Claims B.1 and B.8, we derive the deviation of the block-wise harmonic average degrees. Recall, for any block , the block population mean degree is
[TABLE]
and the block-wise harmonic average degree as
[TABLE]
Claim B.9** (Concentration of block-wise harmonic average of degrees).**
Conditioned on , and , there exists such that, with probability at least , for any block ,
[TABLE]
Proof.
Conditioned on , and , under the DC-SBM,
[TABLE]
for some large enough constant . The first inequality is from Claim B.1, which holds with probability . The second inequality is from Claim B.8. A similar inequality holds for the opposite direction. Thus,
[TABLE]
Thus,
[TABLE]
for some large enough constant . By definition of we are done. ∎
Directly from Claim B.9, we show that is close to for each block in the following claim.
Claim B.10**.**
Conditioned on , and , there exists such that, with probability at least , for any block ,
[TABLE]
Proof.
Conditioned on , and , under the DC-SBM, Claims B.9 and B.7 hold simultaneously with probability . Then
[TABLE]
for some large enough constant . The first inequality is from Claim B.9, while the second inequality is from Claim B.1. A similar bound holds for the opposite direction. Combining with Claim B.7,
[TABLE]
for some large enough constant . ∎
Putting everything together
Finally, we prove the main result.
Proof of Theorem 4.1.
By Claims B.1 and B.2, events and hold with probability at least . Under those events, by Claims B.6 and B.9 with hold with probability ,
[TABLE]
for some large enough . Similar for the other direction. Then, using Claim B.10,
[TABLE]
for some constant . Similarly for the other direction. Thus, there exists constant such that
[TABLE]
∎
B.3 A simple instance showing that the variance of the VH estimator converges slower than
The following example shows that, in general, the Volz-Heckathorn estimator, i.e.,
[TABLE]
has a variance asymptotically worse than on a two-block stochastic block model. Recall that is the block of .
Theorem B.11** (Negative example).**
Let and denote the blocks by . Let , where , for all . Let be chosen uniformly at random. Let be a complete -ary tree. Assume that for some and that
[TABLE]
Then, with probability at least over the network,
[TABLE]
for some .
Proof.
By Claim B.1, the event occurs with probability at least . Therefore, by the conditional variance formula,
[TABLE]
By symmetry, . Hence, on , we have further that
[TABLE]
by our assumption on , where we used that for all . To simplify notation, in the rest of the proof, we implicitly condition on and .
The population-level chain satisfies
[TABLE]
Let be a Markov chain on indexed by with transition probabilities . By Claim B.3, on , we can couple and except with probability , an event we denote by . This is because, for each of the transitions, there can only be a difference in probability of .Hence, by the conditional variance formula again,
[TABLE]
To simplify notation, in the rest of the proof, we implicitly condition on .
Define
[TABLE]
and notice that, by translation,
[TABLE]
and that is centered under . Under , the function is a right-eigenvector with eigenvalue
[TABLE]
Hence, for any at graph distance , it holds that
[TABLE]
and
[TABLE]
where we used that . Let be the leaves of . Because the samples are positively correlated by the above calculation and , we have further that
[TABLE]
Finally, by symmetry and the conditional variance formula once more, recalling that is the root of we have
[TABLE]
with by (B.22). Combining the latter with (B.23), (B.24), (B.25), (B.26), and (B.27) gives the result. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Heckathorn [1997] Douglas D Heckathorn. Respondent-driven sampling: a new approach to the study of hidden populations. Social problems , 44(2):174–199, 1997.
- 2Malekinejad et al. [2008] Mohsen Malekinejad, Lisa Grazina Johnston, Carl Kendall, Ligia Regina Franco Sansigolo Kerr, Marina Raven Rifkin, and George W Rutherford. Using respondent-driven sampling methodology for hiv biological and behavioral surveillance in international settings: a systematic review. AIDS and Behavior , 12(1):105–130, 2008.
- 3Johnston [2013] LG Johnston. Introduction to hiv/aids and sexually transmitted infection surveillance: Module 4: Introduction to respondent driven sampling. World Health Organization , 2013.
- 4White et al. [2015] Richard G White, Avi J Hakim, Matthew J Salganik, Michael W Spiller, Lisa G Johnston, Ligia Kerr, Carl Kendall, Amy Drake, David Wilson, Kate Orroth, et al. Strengthening the reporting of observational studies in epidemiology for respondent-driven sampling studies:“strobe-rds” statement. Journal of clinical epidemiology , 68(12):1463–1471, 2015.
- 5Rohe [2015] Karl Rohe. Network driven sampling; a critical threshold for design effects. ar Xiv preprint ar Xiv:1505.05461 , 2015.
- 6Roch and Rohe [2017] Sebastien Roch and Karl Rohe. Generalized least squares can overcome the critical threshold in respondent-driven sampling. ar Xiv preprint ar Xiv:1708.04999 , 2017.
- 7Goel and Salganik [2009] Sharad Goel and Matthew J Salganik. Respondent-driven sampling as markov chain monte carlo. Statistics in medicine , 28(17):2202–2229, 2009.
- 8Salganik and Heckathorn [2004] Matthew J Salganik and Douglas D Heckathorn. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological methodology , 34(1):193–240, 2004.
