Reducing Seed Bias in Respondent-Driven Sampling by Estimating Block   Transition Probabilities

Yilin Zhang; Karl Rohe; Sebastien Roch

arXiv:1812.01188·math.ST·December 21, 2018

Reducing Seed Bias in Respondent-Driven Sampling by Estimating Block Transition Probabilities

Yilin Zhang, Karl Rohe, Sebastien Roch

PDF

Open Access

TL;DR

This paper introduces a method to reduce seed bias in respondent-driven sampling by estimating block transition probabilities and using them in a post-stratified estimator, improving accuracy in population proportion estimates.

Contribution

It presents a novel approach to estimate block transition probabilities and applies them to create a seed-bias-reducing estimator with proven consistency and improved performance.

Findings

01

Estimated block transition probabilities are highly accurate.

02

The proposed post-stratified estimator reduces seed bias effectively.

03

Simulation results show lower RMSE compared to existing methods.

Abstract

Respondent-driven sampling (RDS) is a popular approach to study marginalized or hard-to-reach populations. It collects samples from a networked population by incentivizing participants to refer their friends into the study. One major challenge in analyzing RDS samples is seed bias. Seed bias refers to the fact that when the social network is divided into multiple communities (or blocks), the RDS sample might not provide a balanced representation of the different communities in the population, and such unbalance is correlated with the initial participant (or the seed). In this case, the distributions of estimators are typically non-trivial mixtures, which are determined (1) by the seed and (2) by how the referrals transition from one block to another. This paper shows that (1) block-transition probabilities are easy to estimate with high accuracy, and (2) we can use these estimated…

Equations290

E = {{i, j} : i and j can refer one another} .

E = {{i, j} : i and j can refer one another} .

\mathbbm P (X_{τ} = j ∣ X_{τ^{'}} = i) = P_{ij}, \forall i, j \in G,

\mathbbm P (X_{τ} = j ∣ X_{τ^{'}} = i) = P_{ij}, \forall i, j \in G,

P_{ij} = \frac{w _{ij}}{d _{i}} .

P_{ij} = \frac{w _{ij}}{d _{i}} .

π_{i} = \frac{d _{i}}{N d ˉ} .

π_{i} = \frac{d _{i}}{N d ˉ} .

μ_{true} = \frac{1}{N} i \in G \sum y (i) .

μ_{true} = \frac{1}{N} i \in G \sum y (i) .

Y_{τ} = y (X_{τ}), \forall τ \in \mathbbm T .

Y_{τ} = y (X_{τ}), \forall τ \in \mathbbm T .

\overset{μ}{^} = \frac{1}{n} τ \in \mathbbm T \sum Y_{τ}

\overset{μ}{^} = \frac{1}{n} τ \in \mathbbm T \sum Y_{τ}

\mathbbm E [\overset{μ}{^}] = μ = \mathbbm E [Y_{0}] = i \in G \sum y (i) π_{i} .

\mathbbm E [\overset{μ}{^}] = μ = \mathbbm E [Y_{0}] = i \in G \sum y (i) π_{i} .

\overset{μ}{^}_{IPW} = \frac{1}{n} τ \in \mathbbm T \sum \frac{Y _{τ}}{π _{X_{τ}} N} = \frac{d ˉ}{n} τ \in \mathbbm T \sum \frac{Y _{τ}}{d _{X_{τ}}},

\overset{μ}{^}_{IPW} = \frac{1}{n} τ \in \mathbbm T \sum \frac{Y _{τ}}{π _{X_{τ}} N} = \frac{d ˉ}{n} τ \in \mathbbm T \sum \frac{Y _{τ}}{d _{X_{τ}}},

\hat{H} = \frac{1}{n} τ \in \mathbbm T \sum \frac{1}{d _{X_{τ}}}^{- 1},

\hat{H} = \frac{1}{n} τ \in \mathbbm T \sum \frac{1}{d _{X_{τ}}}^{- 1},

\overset{μ}{^}_{VH} = \frac{H ^}{n} τ \in \mathbbm T \sum \frac{Y _{τ}}{d _{X_{τ}}} .

\overset{μ}{^}_{VH} = \frac{H ^}{n} τ \in \mathbbm T \sum \frac{Y _{τ}}{d _{X_{τ}}} .

\overset{μ}{^} = k = 1 \sum K (\frac{N _{k}}{N}) \overset{μ}{^}_{k}, and s^{2} = k = 1 \sum K (\frac{N _{k}}{N})^{2} \frac{N _{k} - n _{k}}{N _{k}} \frac{s _{k}^{2}}{n _{k}} .

\overset{μ}{^} = k = 1 \sum K (\frac{N _{k}}{N}) \overset{μ}{^}_{k}, and s^{2} = k = 1 \sum K (\frac{N _{k}}{N})^{2} \frac{N _{k} - n _{k}}{N _{k}} \frac{s _{k}^{2}}{n _{k}} .

\hat{H}_{k} = \frac{1}{n _{k}} τ \in \mathbbm T_{k} \sum \frac{1}{d _{X_{τ}}}^{- 1},

\hat{H}_{k} = \frac{1}{n _{k}} τ \in \mathbbm T_{k} \sum \frac{1}{d _{X_{τ}}}^{- 1},

\overset{μ}{^}_{k VH} = \frac{H ^ _{k}}{n _{k}} τ \in \mathbbm T_{k} \sum \frac{Y _{τ}}{d _{X_{τ}}} .

\overset{μ}{^}_{k VH} = \frac{H ^ _{k}}{n _{k}} τ \in \mathbbm T_{k} \sum \frac{Y _{τ}}{d _{X_{τ}}} .

\hat{Q}_{uv} = \frac{1}{n} \times number of referrals from block u to block v,

\hat{Q}_{uv} = \frac{1}{n} \times number of referrals from block u to block v,

\overset{p}{^}_{uv} = \frac{Q ^ _{uv}}{Q ^ _{u *}} .

\overset{p}{^}_{uv} = \frac{Q ^ _{uv}}{Q ^ _{u *}} .

\overset{π}{^}_{k}^{B} = [v \sum \frac{p ^ _{k v}}{p ^ _{v k}}]^{- 1} .

\overset{π}{^}_{k}^{B} = [v \sum \frac{p ^ _{k v}}{p ^ _{v k}}]^{- 1} .

\overset{μ}{^}_{PS} = k \sum \overset{α}{^}_{k} \overset{μ}{^}_{k VH},

\overset{μ}{^}_{PS} = k \sum \overset{α}{^}_{k} \overset{μ}{^}_{k VH},

\overset{α}{^}_{k} = \frac{π ^ _{k}^{B} / H ^ _{k}}{\sum _{ℓ} π ^ _{ℓ}^{B} / H ^ _{ℓ}},

\overset{α}{^}_{k} = \frac{π ^ _{k}^{B} / H ^ _{k}}{\sum _{ℓ} π ^ _{ℓ}^{B} / H ^ _{ℓ}},

P [{i, j} \in E] = θ_{i} θ_{j} B_{Z_{i}, Z_{j}} .

P [{i, j} \in E] = θ_{i} θ_{j} B_{Z_{i}, Z_{j}} .

p_{uv} = \frac{B _{uv}}{B _{u *}} = \frac{Q _{uv}}{Q _{u *}},

p_{uv} = \frac{B _{uv}}{B _{u *}} = \frac{Q _{uv}}{Q _{u *}},

π_{k}^{B} = Q_{k *} = [v \sum \frac{Q _{v *}}{Q _{k *}}]^{- 1} = [v \sum \frac{Q _{k v} / Q _{k *}}{Q _{v k} / Q _{v *}}]^{- 1} = [v \sum \frac{p _{k v}}{p _{v k}}]^{- 1} .

π_{k}^{B} = Q_{k *} = [v \sum \frac{Q _{v *}}{Q _{k *}}]^{- 1} = [v \sum \frac{Q _{k v} / Q _{k *}}{Q _{v k} / Q _{v *}}]^{- 1} = [v \sum \frac{p _{k v}}{p _{v k}}]^{- 1} .

k \sum Q_{k *} p_{k v} = k \sum Q_{k *} \frac{Q _{k v}}{Q _{k *}} = k \sum Q_{k v} = Q_{v *} .

k \sum Q_{k *} p_{k v} = k \sum Q_{k *} \frac{Q _{k v}}{Q _{k *}} = k \sum Q_{k v} = Q_{v *} .

E [d_{i}] = j \sum θ_{i} θ_{j} B_{Z_{i}, Z_{j}} = θ_{i} ℓ \sum j \in V_{ℓ} \sum θ_{j} B_{k ℓ} = θ_{i} B_{k *} .

E [d_{i}] = j \sum θ_{i} θ_{j} B_{Z_{i}, Z_{j}} = θ_{i} ℓ \sum j \in V_{ℓ} \sum θ_{j} B_{k ℓ} = θ_{i} B_{k *} .

δ_{k}^{B} = \frac{1}{N _{k}} i \in V_{k} \sum θ_{i} B_{k *} = \frac{B _{k *}}{N _{k}} .

δ_{k}^{B} = \frac{1}{N _{k}} i \in V_{k} \sum θ_{i} B_{k *} = \frac{B _{k *}}{N _{k}} .

\frac{π _{k}^{B}}{δ _{k}^{B}} = \frac{N _{k}}{\sum _{k} B _{k *}} .

\frac{π _{k}^{B}}{δ _{k}^{B}} = \frac{N _{k}}{\sum _{k} B _{k *}} .

α_{k} = \frac{π _{k}^{B} / δ _{k}^{B}}{\sum _{ℓ} π _{ℓ}^{B} / δ _{ℓ}^{B}} = \frac{N _{k}}{N} .

α_{k} = \frac{π _{k}^{B} / δ _{k}^{B}}{\sum _{ℓ} π _{ℓ}^{B} / δ _{ℓ}^{B}} = \frac{N _{k}}{N} .

μ_{true} = k \sum α_{k} μ_{k} .

μ_{true} = k \sum α_{k} μ_{k} .

∣ \overset{μ}{^}_{PS} - μ_{true} ∣ \leq c \frac{lo g n}{n},

∣ \overset{μ}{^}_{PS} - μ_{true} ∣ \leq c \frac{lo g n}{n},

(0.95 0.05 0.05 0.95) .

(0.95 0.05 0.05 0.95) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHIV, Drug Use, Sexual Risk · HIV/AIDS Research and Interventions · Opioid Use Disorder Treatment

Full text

Reducing Seed Bias in Respondent-Driven Sampling

by Estimating Block Transition Probabilities

Yilin Zhanglabel=e1][email protected] [

Karl Rohelabel=e2][email protected] [

Sebastien Rochlabel=e3][email protected] [ University of Wisconsin-Madison,

Department of Statistics and Department of Mathematics

Yilin Zhang, Karl Rohe

Department of Statistics

University of Wisconsin Madison

1300 University Ave

Madison, WI 53706

USA

E-mail: e2

Sebastien Roch

Department of Mathematics

University of Wisconsin-Madison

480 Lincoln Drive

Madison, WI 53706

USA

Abstract

Respondent-driven sampling (RDS) is a popular approach to study marginalized or hard-to-reach populations. It collects samples from a networked population by incentivizing participants to refer their friends into the study. One major challenge in analyzing RDS samples is seed bias. Seed bias refers to the fact that when the social network is divided into multiple communities (or blocks), the RDS sample might not provide a balanced representation of the different communities in the population, and such unbalance is correlated with the initial participant (or the seed). In this case, the distributions of estimators are typically non-trivial mixtures, which are determined (1) by the seed and (2) by how the referrals transition from one block to another. This paper shows that (1) block-transition probabilities are easy to estimate with high accuracy, and (2) we can use these estimated block-transition probabilities to estimate the stationary distribution over blocks and thus, an estimate of the block proportions. This stationary distribution on blocks has previously been used in the RDS literature to evaluate whether the sampling process has appeared to “mix”. We use these estimated block proportions in a simple post-stratified (PS) estimator that greatly diminishes seed bias. By aggregating over the blocks/strata in this way, we prove that the PS estimator is $\sqrt{n}$ -consistent under a Markov model, even when other estimators are not. Simulations show that the PS estimator has smaller Root Mean Square Error (RMSE) compared to the state-of-the-art estimators.

respondent-driven sampling,

post-stratification,

social network,

Stochastic Blockmodel,

Markov process,

keywords:

\startlocaldefs\endlocaldefs

, , and

t1These authors gratefully acknowledge support from NSF grant DMS-1612456 and ARO grant W911NF-15-1-0423. t2This author gratefully acknowledges support from NSF grants DMS-1149312 (CAREER), DMS-1614242 and CCF-1740707 (TRIPODS), and a Simons Fellowship.

1 Introduction

Respondent-driven sampling (RDS) is one of the most popular network-based approaches to sample marginalized and hard-to-reach populations, such as drug users, sex workers, and the homeless [1]. RDS has been widely used, for instance, to quantify HIV prevalence in at-risk populations [2, 3]. According to a recent literature review [4], RDS has been used in over 460 studies from 69 countries.

RDS collects samples through peer referral on a social network. It starts from some initial participant as the seed, which forms wave zero. In the process, we incentivize each participant to pass some (usually three to five) referral coupons to their friends. Those who return to the study site with a referral coupon form the next wave of samples. We repeat this process until we get enough samples or the participants stop referring. Figure 1 from [5] gives an illustration for the RDS sampling process. There are three components in RDS sampling: (1) the social network, (2) the sampling tree, and (3) the variable of interest (denoted by color in Figure 1). The underlying social network is the target population to study, which is unobserved. For each sampled node, we observe their HIV status (black or grey in Figure 1), and which node refers them to the sample. We aim to estimate the proportion of people with certain trait, such as HIV positive (nodes that are grey in Figure 1), in the population.

The link-tracing sampling procedure of RDS enables us to reach the hard-to-reach populations. However, RDS samples are dependent. This dependence is particularly bad when there are multiple communities in the target population and the people form most of their friendships within their own communities (i.e. blocks). For example, people from the east side of the town might only know a few people from the west side of the town, and thus they are much more likely to refer people from the west side of the town. This is referred to as a “bottleneck” and it leads to a sample that is unbalanced between the different communities. If the HIV prevalence is higher on one side of the town, then this bottleneck creates dependence between observations in an RDS sample. If the initial participant is from the east side, then the sample may underrepresent people from the west side. This creates “seed bias.” In statistical models which presume that the seed node is randomized, this “seed bias” appears as additional variance in the final estimator. When some participants refer too many contacts, the variance of the traditional RDS estimator, Volz-Heckathorn (VH) estimator [1], decays at a rate slower than $O(n^{-1})$ [5]. We provide an example in Appendix B.3. To address this issue, recent work [6] has derived an idealized generalized least squares (GLS) estimator for which the standard error decays at rate $O(n^{-1/2})$ with growing sample size $n$ under a fixed social network. The practical implementation of the estimator, called the feasible GLS (fGLS) estimator, requires solving an $n\times n$ system of equations and comes with no theoretical guarantees.

This paper provides an estimator that is easy to compute and has root mean squared error that decays at rate $\Theta(n^{-1/2})$ up to log factors, by implicitly adjusting for bottlenecks between different communities. While this estimator is new, its essential components are well known and reported in the RDS literature. This new estimator assumes that we have collected the “bottlenecked” community memberships of the sampled individuals. With this data, a key summary is the empirical transition matrix between communities, in which element $u,v$ is the proportion of referrals from participants in community $u$ to participants in community $v$ . In the RDS literature, this matrix is a common way to summarize the sampling procedure and understand the underlying social network. For example, the original RDS paper [1] reports on a sample of drug users. Table 1c from that paper (reprinted as Figure 2 herein) gives the empirical transition matrix between communities defined by drug preference. This empirical transition matrix is also a key piece of the feasible GLS estimator [6].

Interestingly, an estimate of the proportion of nodes in each community can be derived from the empirical transition matrix. Notice in Figure 2 that [1] reports the equilibrium distribution on the different strata/communities. This takes the empirical transition matrix as a Markov transition matrix on the different communities and computes the stationary (i.e. equilibrium) distribution of this Markov process (i.e. the leading left eigenvector of the transition matrix). In Figure 2, the equilibrium distribution is close to the total distribution of recruits. When there is a bottleneck, this paper shows that the equilibrium distribution is a better estimator than the total distribution of recruits. The basic reason is that even when there is a bottleneck, each row of the empirical transition matrix is composed of $O(n)$ nearly independent multinomial samples. There is one caveat; our estimator does not use the actual equilibrium distribution of the empirical transition matrix (i.e. the quantity reported in Figure 2). Instead, we have a simple approximation of the equilibrium which is easier to compute and thus simplifies the proof.

The final estimator is a post-stratified estimator where the strata are the community memberships and the estimated proportion of nodes in each strata is derived from the estimated equilibrium distribution. We call this the PS estimator. The PS estimator has three major advantages: (1) computational efficiency, (2) smaller variation (bias square, variance and RMSE), and (3) block-wise byproducts. We show in Theorem 4.1 that our PS estimator has both its bias and standard deviation decay at rate $\Theta(n^{-1/2})$ up to log factors, which does not hold for the popular Volz-Heckathorn (VH) esimtator [1] and does not show the GLS estimator [6]. The simulation studies also show our PS estimator has smaller variation (bias square, variance and RMSE) compared to the VH estimator and fGLS estimator. The improvement is significant especially when there exists bottleneck in social networks.

The paper is organized as follows. Section 2 defines the Markov model, the quantity to estimate, and the traditional RDS estimators. Section 3 introduces the PS estimator. Section 4 shows PS estimator is $\sqrt{n}$ -consistent under the Degree Corrected Stochastic Blockmodel (DC-SBM). In Section 5, we show by simulations that PS estimator has smaller variation than the state-of-the-art estimators, especially when there exists bottleneck in social networks. We summarize with a discussion in Section 6.

2 Preliminaries

We model referrals using a Markov process similar to the ones previously considered in the RDS literature [7, 1, 8, 9, 5, 6].

2.1 Markov process on a social network

A social network $G$ consists of a node set $V=\{1,\dots,N\}$ of individuals and an undirected edge set

[TABLE]

We use $i\in V$ and $i\in G$ interchangeably. We assume that $G$ is connected. Let $w_{ij}=w_{ji}>0$ be the weight of edge $\{i,j\}\in E$ , which models recruitment preference (more details in Section 4). For any $\{i,j\}\not\in E$ , we let $w_{ij}=w_{ji}=0$ by convention. If the graph is unweighted, then $w_{ij}=1$ for all $\{i,j\}\in E$ . For each node $i\in V$ , we denote its neighbor in the network $G$ by $\mathcal{N}(i)=\left\{j\in V\,:\,\{i,j\}\in E\right\}.$ We denote the degree of node $i$ as $d_{i}=\sum_{j}w_{ij}$ and the mean degree of graph $G$ as $\bar{d}=\sum_{i}d_{i}/N$ .

We model the collection of samples in RDS with a Markov process on the social network $G$ indexed by a tree. It starts with an initial participant as seed, which we index as vertex 0, and develops into a rooted tree, $\mathbbm{T}$ (a connected graph with $n$ nodes, no cycles, and a vertex [math]). We use $\tau\in\mathbbm{T}$ to denote that node $\tau$ belongs to the samples indexed by $\mathbbm{T}$ . For each node $\tau\in\mathbbm{T}$ , we denote the parent of $\tau$ as $\tau^{\prime}$ (the node that refers $\tau$ to the sample). Formally, an RDS sample is an indexed collection of random nodes $(X_{\tau}\in G:\tau\in\mathbbm{T})$ , where each referral $X_{\tau^{\prime}}\rightarrow X_{\tau}$ has probability

[TABLE]

where the transition matrix $P\in\mathbbm{R}^{N\times N}$ has elements

[TABLE]

Since the graph $G$ is undirected and connected, $P$ is a reversible Markov transition matrix with unique stationary distribution $\bm{\pi}=(\pi_{i})_{i\in G}\in\mathbbm{R}^{N}$ with

[TABLE]

While the referrals are random, we think of $\mathbbm{T}$ itself as deterministic.

Following [10], we refer to this Markov process as a $(\mathbbm{T},P)$ -walk on $G$ . Note that $G$ and $\mathbbm{T}$ are two distinct graphs: the node set in $G$ indexes the population, which is a social network, and the node set in $\mathbb{T}$ indexes the samples, which is a sampling tree. We say that the $(\mathbbm{T},P)$ -walk is stationary if the seed is chosen according to the stationary distribution.

2.2 Quantity to estimate and the Volz-Heckathorn estimator

For each node $i\in G$ , we denote the variable of interest (e.g., the indicator of HIV status) as $y(i)$ . We wish to estimate the population mean of the variable of interest

[TABLE]

For each sample $X_{\tau}$ , we observe

[TABLE]

The sample average

[TABLE]

is generally biased, since nodes with larger degrees are more likely to be sampled in the Markov process. Specifically, under the stationary $(\mathbbm{T},P)$ -walk on $G$ , it has expectation

[TABLE]

In general, $\mu\not=\mu_{\text{true}}$ .

To obtain an unbiased estimator of $\mu_{\text{true}}$ , the sample average must be adjusted. Using $\pi_{i}=d_{i}/(N\bar{d})$ , the inverse probability weighted estimator (IPW),

[TABLE]

is an unbiased estimator of $\mu_{\text{true}}$ [11]. Additionally estimating $\bar{d}$ with the harmonic mean of the observed node degrees,

[TABLE]

leads to the popular Volz-Heckathorn (VH) estimator [9],

[TABLE]

The VH estimator has been extensively used in the study of marginalized populations [2, 3, 4], but it is highly variable. The variance of the VH estimator in general may decay at a rate slower than $O(n^{-1})$ [5], implying that many more samples are required to reduce the standard error. See Section B.3. We address this issue by introducing a post-stratification approach to RDS in the following section.

3 A new estimator

3.1 A post-stratification approach to RDS

Stratification

Stratification has been extensively used in traditional random sampling to reduce variance. The key idea of stratified sampling is as follows. Assume that the overall population can be divided into (ideally homogeneous) sub-groups (which we refer to as blocks) based on some variable, such as gender, race, etc. Then the sample mean and sample variance of the total population can be calculated using block-wise sample means and variances.

Specifically, suppose there are $K$ blocks in a population with $N$ individuals. For each block $k$ , we denote the block size as $N_{k}$ , the block-wise population mean as $\mu_{k}$ , the sample size as $n_{k}$ and the block-wise sample average as $\hat{\mu}_{k}$ . The sample average $\hat{\mu}$ and sample variance $s^{2}$ for the total population can be derived from the block-wise quantities by

[TABLE]

Stratified sampling by proportionate allocation randomly selects individuals proportionally to the sizes of the different blocks, with the goal of improving accuracy by reducing sampling error. Post-stratified sampling, on the other hand, performs stratification after sampling and calculates $\hat{\mu}$ and $s^{2}$ as above. Post-stratification is useful when the samples constitute an unbalanced representation of the full population.

Block proportions are unobserved in marginalized populations

We seek to apply this last approach to RDS in order to deal with seed bias. An important issue arises however. Per (3.1), traditional post-stratification requires the knowledge of the block proportions $N_{k}/N$ . These are typically unknown in marginalized populations. Hence, we need to estimate the block proportions from the samples. In the next section, we describe how we do this and we formally define a novel post-stratified estimator for RDS.

3.2 Block-wise quantities

For a set $V^{\prime}$ , denote its cardinality by $|V^{\prime}|$ . Suppose there are $K$ blocks in the social network $G$ . For each node $i\in G$ , denote its block membership as $z(i)$ , i.e., $z(i)=k$ if $i$ belongs to block $k\in\{1,\dots,K\}$ . To simplify notation, we write $i\in V_{k}$ to mean $z(i)=k$ . For each block $k$ , we denote the block size as $N_{k}=|V_{k}|$ and the block-wise mean as $\mu_{k}=N_{k}^{-1}\sum\limits_{i\in V_{k}}y(i)$ .

For each sample $\tau\in\mathbbm{T}$ , we let its block membership be $Z_{\tau}=z(X_{\tau})$ and we write $\tau\in\mathbbm{T}_{k}$ to mean $Z_{\tau}=k$ . We define for each block $k$ the sample size as $n_{k}$ , the block-wise harmonic average degree as

[TABLE]

and the block-wise sample average weighted by degree, i.e., the VH estimator for $\mu_{k}$ , as

[TABLE]

Suppose that we observe the block membership of each sample, i.e., we observe $Z_{\tau}=z(X_{\tau})$ for all $\tau\in\mathbbm{T}$ . We define the matrix $\hat{Q}\in\mathbbm{R}^{K\times K}$ such that, for any two blocks $u,v\in\{1,\dots,K\}$ ,

[TABLE]

and the row-normalized matrix $\hat{P}^{B}\in\mathbbm{R}^{K\times K}$ whose $(u,v)$ -entry is

[TABLE]

Here, for a matrix $A$ , we let $A_{u\ast}=\sum\limits_{v}A_{uv}$ and $\mathbf{1}\{\mathcal{E}\}$ is the indicator of event $\mathcal{E}$ . Finally we define the vector $\hat{\bm{\pi}}^{B}=(\hat{\pi}_{k}^{B})_{k}$ with entries

[TABLE]

3.3 The post-stratified estimator

We define our new estimator next.

Definition 3.1 (The post-stratified estimator).

For an RDS sample on a graph $G$ with $K$ blocks, the post-stratified (PS) estimator is

[TABLE]

with

[TABLE]

where $\hat{H}_{k}$ , $\hat{\mu}_{{k}\mathrm{VH}}$ , and $\hat{\pi}_{k}^{B}$ are defined in (3.2), (3.3) and (3.4) respectively.

Comparing (3.6) with (3.1), the estimator $\hat{\mu}_{\mathrm{PS}}$ can indeed be seen as a post-stratified estimator. In Section 3.4, we argue that $\hat{\alpha}_{k}$ is an estimator of the block proportion of block $k$ . Note that we also use the VH estimator $\hat{\mu}_{{k}\mathrm{VH}}$ on each block $k$ , instead of the block-wise sample average, to adjust for the bias induced by node degrees.

3.4 Motivation for the PS estimator

To motivate our new estimator, we analyze its behavior under a standard model of random social network with community structure, the degree-corrected stochastic blockmodel (DC-SBM) [12].

Definition 3.2 (Degree-corrected stochastic blockmodel).

Let $B\in\mathbbm{R}_{+}^{K\times K}$ be a positive, symmetric matrix and let $\theta\in\mathbbm{R}_{+}^{N}$ be a positive vector. Under the DC-SBM, a social network $G=(V,E)$ with $V=\{1,\ldots,N\}$ is drawn randomly as follows. Assume that we have a partition $V_{1},\ldots,V_{K}$ of $V$ into $K$ blocks labeled $\{1,\ldots,K\}$ . Let $N_{1},\ldots,N_{K}$ be the respective sizes of the blocks. For a node $i\in V$ , let $Z_{i}$ be its block. Each possible edge $\{i,j\}$ is present independently from all other edges with probability

[TABLE]

By convention, we assume $\sum_{i\in V_{k}}\theta_{i}=1$ for all block $k$ .

Remark 3.3 (Self-loops).

To simplify the notation throughout, we allow self-loops $\{i,i\}$ in the DC-SBM, each of which will contribute $1$ to degree counts (instead of the standard convention of $2$ ). Note that, in a dense graph, such self-loops will play a negligible role.

To justify our PS estimator under the DC-SBM, we make three observations:

Define the matrices $Q=B/m$ , where $m=\mathbf{1}^{T}B\mathbf{1}$ , and $P^{B}=(p_{uv})_{u,v}$ , where

[TABLE]

for any two blocks $u,v$ . Since $P^{B}$ is positive and row-normalized version of the symmetric matrix $Q$ , it has a unique stationary distribution $\bm{\pi}^{B}=(\pi_{k}^{B})_{k}$ , where

[TABLE]

Indeed

[TABLE] 2. 2.

The expected degree of node $i$ in block $k$ is

[TABLE]

Hence the block-wise mean expected degree over block $k$ is

[TABLE] 3. 3.

Combining the two observations above, we get

[TABLE]

Because the denominator is constant, we have finally

[TABLE]

Therefore, by (3.1), the population mean $\mu_{\text{true}}$ can be re-written as

[TABLE]

From this it follows that, to estimate $\mu_{\text{true}}$ , it suffices to estimate the block-wise mean $\mu_{k}$ , the block-wise expected mean degree $\delta^{B}_{k}$ , and the stationary distribution $\pi_{k}^{B}$ of $P^{B}$ , for each block $k$ . We estimate them with $\hat{\mu}_{{k}\mathrm{VH}}$ , $\hat{H}_{k}$ , and $\hat{\pi}_{k}^{B}$ , respectively—leading to the PS estimator in (3.6). In the proof of Theorem 4.1 below, we analyze the accuracy of these estimators (see Claims B.6, B.9 and B.7).

4 Main theoretical result

In this section, we show that the PS estimator defined in (3.6) has error $O(\sqrt{\log n/n})$ with high probability when the social network is distributed under a dense DC-SBM.

Theorem 4.1 (Main result).

Suppose the social network $G=(V,E)$ of size $N$ is distributed according to the DC-SBM with $K$ blocks of respective sizes $N_{1},\ldots,N_{k}$ and parameters $B\in\mathbbm{R}_{+}^{K\times K}$ and $\theta\in\mathbbm{R}_{+}^{N}$ . Suppose $\mathbbm{T}$ is a sampling tree of size $n\leq N$ . Let $y\in\mathbbm{R}_{+}^{N}$ be the variable of interest. Assume that there are universal constants $0<c_{-}<c_{+}<+\infty$ and $0<c_{y},c_{d}<+\infty$ independent of $N$ and $n$ such that the following assumptions hold:

(a)

[Linear-sized blocks] $c_{-}N\leq N_{k}\leq c_{+}N$ for all k; 2. (b)

[Dense graph] $c_{-}N^{2}\leq B_{uv}\leq c_{+}N^{2}$ for all blocks $u,v$ ; 3. (c)

[Degree homogeneity] $c_{-}N^{-1}\leq\theta_{i}\leq c_{+}N^{-1}$ for all nodes $i\in G$ ; 4. (d)

[Bounded variables] $0\leq y(i)\leq c_{y}$ for all nodes $i\in G$ ; 5. (e)

[Limited referrals] The maximum degree of $\mathbbm{T}$ is less than or equal to $c_{d}$ .

Then, for any $\varepsilon,\varepsilon^{\prime}>0$ , there exists a constant $c>0$ (not depending on $n,N$ ) such that, with probability $1-\varepsilon$ over the choice of $G$ , the following holds. For any $(\mathbbm{T},P)$ -walk on $G$ the PS estimator defined in (3.6) satisfies

[TABLE]

with probability $1-\varepsilon^{\prime}$ .

A direct consequence of Theorem 4.1 is that the bias and standard deviation decay at rate $O(n^{-1/2})$ up to log factors. This does not hold for the traditional VH estimator [1], since its standard deviation decays at a rate slower than $O(n^{-1/2})$ [5], which we also show by example in Appendix B.3. For the recent GLS-based estimators proposed in [6], it is shown that their standard deviation decays at rate $O(n^{-1/2})$ as $n$ goes to infinity for a fixed network size, but no finite size guarantees are provided.

Assumptions (b) and (c) require the graph to be dense. In the following section, we show through simulations that the PS estimator also works well on sparse graphs.

5 Simulations

This section compares the PS estimator to the VH and fGLS estimators on simulated networks (in Section 5.1) as well as social networks collected by the National Longitudinal Study of Adolescent Health (Add Health Networks) (in Section 5.2), both with simulated RDS samples. In both cases, the PS estimator has smaller variation than the VH and fGLS estimators.

5.1 Simulated Networks

We simulated 100 random social networks by DC-SBM with $10^{5}$ nodes, expected average degree $100$ , and $K=2$ blocks with the same sizes. The stochastic matrix $B$ was chosen proportional to

[TABLE]

We simulated the binary outcomes to be perfectly aligned with one of the block labels.

On each social network, we generated RDS samples by link tracing without replacement. We randomly sampled the seed proportionally to the node degree. Then, for each participant $\tau$ in the sample, we recruited $R_{\tau}\in\mathbbm{N}$ number of friends, where $R_{\tau}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}\text{Poi}(2)$ . The recruiting process stopped when there were 1000 participants in the RDS sample. If it terminated before recruiting 1000 participants, then we re-started the recruiting process. We generated 200 different RDS samples on each network. For each RDS sample, we computed the VH, fGLS, and PS estimators. On each network, we computed the absolute bias, standard deviation, and RMSE of the 200 estimators of each type. In the simulations, we computed the fGLS estimator as in [6], which re-weights the outcome $Y$ to adjust for the sampling bias.

Figure 3 shows that the PS estimator has smaller variation than the VH and fGLS estimators in terms of absolute bias, standard deviation, and RMSE.

Note that there are some factors that may affect the performance of the estimators, such as (1) bottlenecks in the social network (2) the alignment of the block labels $z(i)$ with the variable of interest $y(i)$ , and (3) the network density, etc. We explored these factors and how they affected the performance of the estimators. More explorations on other factors including network sizes and sample sizes are in Section A in the appendix. The following simulations in Figure 4, 5 and 6 have the same setting as in Figure 3, except that the values of the corresponding factor are made to vary.

Bottleneck

Bottlenecks exist when there are much fewer connections across different blocks than within blocks. Recall that, in the DC-SBM, the stochastic block matrix $B$ shows the average number of links between any two blocks. We simulated the stochastic block matrix such that,

[TABLE]

with $p+q=1$ for identification. We refer to the difference $p-q$ as the bottleneck strength. With a larger bottleneck strength, there are more connections within blocks and fewer connections across blocks. When there is no bottleneck (strength is zero), there is only one block in the network. Figure 4 shows that the PS estimator has smaller variation than the fGLS and VH estimators, especially when there exists a bottleneck. In particular, the PS estimator appears to reduce the seed bias and standard deviation caused by bottlenecks much better than the fGLS and VH estimators.

Alignment

We capture the alignment of the block labels and the variable of interest by the difference of the block-wise means of the variable of interest, i.e. $|\mu_{1}-\mu_{2}|$ with $K=2$ blocks. Figure 5 shows that the fGLS and PS estimators exhibit the largest improvement over the VH estimator when the block label perfectly aligns with the variable of interest (i.e., the alignment is $1$ ). The three estimators perform equally well when the block-wise means of the variable of interest are equal (i.e., the alignment is [math]). When the block label partially aligns with the variable of interest (i.e., the alignment is strictly between [math] and $1$ ), the fGLS and VH estimators exhibit similar variation, but the PS estimator has smaller variation when the block-wise difference is over 0.4.

Network density

We use the expected average degree of the network to quantify the network density. Though Theorem 4.1 requires the networks to be dense enough, Figure 6 shows that the estimators perform similarly on sparse networks.

5.2 Add Heath Networks

In this section, we consider RDS simulations obtained by tracing contacts in social networks collected in the National Longitudinal Study of Adolescent Health (Add Health Networks). This study collected a nationally represented sample of adolescents from grade 7 to 12 in the United States in the 1994-1995 school year. The sample covers 84 pairs of middle and high schools in which students nominated of up to five male and five female friends in their middle or high school network ([13]). In this analysis, we symmetrized all contacts to create a social network, and we restricted each network to its largest connected component. These networks were previously studied in [14], [15], and [6].

We restricted our analysis to the 25 Add Heath networks with over 1000 nodes. On each network, we simulated 200 different RDS samples, each with 500 participants. On each RDS sample, we computed the VH, fGLS, and PS estimators. In the simulation, we randomly sampled seed nodes proportional to node degrees. We computed the absolute bias, RMSE, and standard deviation of the estimators on each network. In the analysis we used the school label (middle school or high school) as the outcome and the grade label (7-12) as the block labels.

The recruitment process was similar to that in Section 5.1, but without replacement. In this case, each person could be recruited no more than once. For each participant $\tau$ , if they had fewer number of unrecruited friends than $R_{\tau}$ , then we recruited all of their unrecruited friends.

Figure 7 shows the variation of the estimators. Overall, the PS estimator has substantially smaller variation than the fGLS and VH estimators.

6 Discussion

RDS has been widely used in studying marginalized populations. But the estimators derived from RDS samples have suffered from high variance. This is due to two related issues (1) the complicated network dependence of the RDS samples, and (2) seed bias caused by bottlenecks. In this paper, we introduced post-stratification to RDS and provided a novel estimator. Our easy-to-compute PS estimator reduces seed bias. We derived some theoretical results for the PS estimator, showing its bias and standard deviation decay at $O(n^{-1/2})$ (up to log factors) under the degree-corrected stochastic block model. This is the first estimator with such guarantees. Though we require the networks to be dense in theory, we showed through simulations that the estimator performs similarly on sparse networks.

One future direction is how to select the block labels in practice. In [6], an approach for selecting block labels using eigenvalues of the block-wise transition matrix $\hat{Q}$ is proposed. Further discussions on this issue would be helpful to apply the PS (and fGLS) estimators.

Appendix A More Simulations

In this section, we explore how network sizes and sample sizes affect the performances of RDS estimators. The simulation settings are the same as in Section 5.1. Figure 8 shows the estimators perform similarly with different sizes of networks. Figure 9 shows the RDS estimators have smaller variation with larger sample sizes.

Appendix B Proof of the main theorem

B.1 Notation

For each node $i\in V$ , we denote its neighborhood in the social network $G$ as $\mathcal{N}(i)=\left\{j\in V\,:\,\{i,j\}\in E\right\}$ and its neighborhood within block $k$ as $\mathcal{N}(i;k)=\left\{j\in V_{k}\,:\,\{i,j\}\in E\right\}.$ We denote by $d(i;k)=|\mathcal{N}(i;k)|$ the size of the latter. The degree of node $i$ is denoted $d_{i}=|\mathcal{N}(i)|$ and we have $d_{i}=\sum\limits_{k}d(i;k)$ .

While the RDS sampling procedure is a random walk on the social network $G$ , under a dense DC-SBM our analysis relies on establishing an approximation of the process by a “population-level” random walk on blocks. We define the block transition probability at node $i\in V$ by

[TABLE]

for any blocks $u,v$ and any sample $\tau\in\mathbb{T}$ . Recall that, for any sample $\tau\in\mathbbm{T}$ , we denote its parent as $\tau^{\prime}$ .

Under our assumptions, $B_{uv}$ is the expected number of edges between blocks $u\neq v$ ; indeed

[TABLE]

Recalling the matrix $Q=B/m$ , where $B$ is the matrix in the definition of the DC-SBM model (3.7) and $m=\mathbf{1}^{T}B\mathbf{1}$ , the population block transition probability is given by

[TABLE]

for any two blocks $u,v$ . We refer to

[TABLE]

as the population transition matrix on blocks. Recall that its unique stationary distribution is $\bm{\pi}^{B}=(\pi_{k}^{B})_{k}$ .

For each block $k\in\{1,\dots,K\}$ , $n_{k}=|\mathbbm{T}_{k}|$ . We also define the number of referrals from block $k$ to be

[TABLE]

For any two blocks $u,v\in\{1,\dots,K\}$ , we define the number of referrals between block $u$ and block $v$ as

[TABLE]

Note that $\hat{Q}_{uv}=n_{u^{\prime}v}/n$ and $\hat{P}^{B}_{uv}=n_{u^{\prime}v}/n_{u^{\prime}}$ The elements of the estimated block transition matrix $\hat{P}^{B}$ in (3.4) can be rewritten as $\hat{P}^{B}_{uv}=n_{u^{\prime}v}/n_{u^{\prime}}$ . We use $\hat{p}_{uv}$ to denote these quantities, i.e.,

[TABLE]

To summarize, for any blocks $u,v$ , the quantities $p_{uv}(i)$ , $\tilde{p}_{uv}$ and $\hat{p}_{uv}$ represent respectively the block transition probability at node $i\in G$ , the population block transition probability, and the estimated block transition probability.

B.2 Proof

The proof of Theorem 4.1 follows from a series of claims. We begin with a sketch of the proof in this section.

Under the dense DC-SBM, random walk is mixing fast within each block (Claims B.2 and B.5). This plays a key role in estimating block-wise means, for which we use the VH estimator (Claims B.7, B.8, and B.10). 2. 2.

To estimate block proportions, we use the stationary distribution of block-wise transition matrix, which is the main, non-trivial contribution of this work. Indeed, the standard empirical frequency gives an estimate with much larger variance (see Section B.3). Instead we estimate the transition matrix between blocks, which is a “more local” quantity in the sense that it is not affected strongly by the seed, and compute its stationary distribution. As a result, block-wise transition probabilities are highly concentrated around their true value under the Markov chain on the blocks; their stationary distributions are also close to each other (Claim B.3, Claim B.6).

Note that there are two sources of randomness, the social network $G$ and the $\mathbbm{T}$ -indexed random walk. Claims B.1-B.5 are concerned with the randomness of $G$ , while Claims B.6-B.10 deal with the random walk.

$\hat{\mu}_{{k}\mathrm{VH}}$$\mu_{k}$$\hat{\pi}_{k}^{B}$$\pi_{k}^{B}$$H^{(\delta_{k})}$$\delta_{k}^{(B)}=N_{k}^{-1}B_{k\ast}$$\hat{\mu}_{\mathrm{mVH}}$$\mu_{\text{true}}$ Claim B.10Claim B.6Claim B.9Combine above

Throughout, $\varepsilon>0$ is as in the statement of the theorem.

High-probability properties of the social network

We first use standard concentration inequalities to control the degrees of $G$ . Recall that under the DC-SBM the expectation of $d(i;v)$ is $\theta_{i}B_{uv}$ .

Claim B.1 (Degrees are concentrated).

Under the DC-SBM, there exists $c_{1}>0$ (depending on $\varepsilon$ but not on $N$ ) such that, with probability $1-\varepsilon/2$ over the choice of $G$ , the following event holds: simultaneously for all pairs of blocks $u,v$ and all nodes $i\in V_{u}$ ,

[TABLE]

We let $\mathcal{E}_{\mathrm{D}}$ be the event in the claim.

Proof of Claim B.1.

Fix blocks $u,v$ and $i\in V_{u}$ . Under the DC-SBM, each node $j$ in block $v$ connects with node $i$ independently with probability $\theta_{i}\theta_{j}B_{uv}$ . Hence we can write $d(i;v)$ as a sum of $N_{v}$ independent indicators, whose overall expectation is $\theta_{i}B_{uv}$ , where we used that $\sum\limits_{j\in V_{v}}\theta_{j}=1$ . By Hoeffding’s inequality [16], for any constant $c_{1}^{\prime}>1$ , by choosing $c_{1}>0$ large enough

[TABLE]

where we used that $N_{v}=\Theta(N)$ in the second inequality. Taking a union bound over $u$ , $v$ and $i$ gives

[TABLE]

simultaneously for all $u,v$ and all $i\in V_{v}$ with probability at least $1-K^{2}\cdot N\cdot N^{-c_{1}^{\prime}}$ . Dividing by $\theta_{i}B_{uv}$ and using $\theta_{i}=\Theta(N^{-1})$ for any node $i$ and $B_{uw}=\Theta(N^{2})$ for any blocks $u,w$ , gives the result for appropriately chosen $c_{1},c_{1}^{\prime}>0$ . ∎

The following claim will be useful to control the mixing rate within a block. For any blocks $u,w,v$ and two distinct nodes $i\in V_{u}$ , $j\in V_{v}$ , we consider the number of two-edge paths from $i$ to $j$ in $G$ whose middle vertex is in block $w$ , weighted by a quantity related to the expected degree of the middle vertex under the DC-SBM:

[TABLE]

Claim B.2 (Two-edge paths).

There exists $c_{2}>0$ such that, with probability $1-\varepsilon/2$ over the choice of $G$ , the following holds: simultaneously for all blocks $u,w,v$ , and all $i\in V_{u}$ , $j\in V_{v}$ with $i\neq j$ ,

[TABLE]

We let $\mathcal{E}_{\mathrm{D},2}$ be the event in the claim.

Proof of Claim B.2.

Fix blocks $u,w,v$ , and nodes $i\not=j$ . By Claim B.1, we can choose $c_{1}^{\prime\prime}$ large enough such that

[TABLE]

for some $c_{1}^{\prime\prime\prime}>1$ .

We treat the case where all blocks are distinct. The other cases are similar. Let $\mathcal{E}_{i,w}$ be the event that $\left|d(i;w)-\theta_{i}B_{uw}\right|\leq c_{1}^{\prime\prime}\sqrt{N\log N}$ and note that $j\not\in\mathcal{N}(i;w)$ . Conditioned on $\mathcal{E}_{i,w}$ , each of the $d(i;w)$ edges incident to $i$ and block $w$ has a corresponding endpoint $k\in V_{w}$ which itself connects to $j$ —independently of all other such endpoints—with probability $\theta_{k}\theta_{j}B_{wv}$ . Since $d_{\theta}^{(2)}(i,j;w)$ weighs this last edge by $(N\theta_{k})^{-1}$ , its expected contribution is $N^{-1}\theta_{j}B_{wv}$ . Moreover, the $d(i;w)$ possibly non-zero terms in the sum defining $d_{\theta}^{(2)}(i,j;w)$ are uniformly bounded by a constant by the assumption that $\theta_{i}=\Theta(N^{-1})$ . Hence, we can apply Hoeffding’s inequality again, and by choosing $c_{2}^{\prime}>1$ large enough we have

[TABLE]

for some $c_{2}^{\prime\prime}>2$ , where we used (B.7) in the second inequality and we used that $\theta_{i}=\Theta(N^{-1})$ and $B_{uw}=\Theta(N^{2})$ in the last inequality.

Combining (B.7) and (B.8), and taking a union bound over $u$ , $w$ , $v$ , $i$ and $j$ gives

[TABLE]

for a constant $c_{2}^{\prime\prime\prime}>0$ chosen large enough. Dividing by $N^{-1}\theta_{i}\theta_{j}B_{uw}B_{wv}$ and using again that $\theta_{i}=\Theta(N^{-1})$ and $B_{uv}=\Theta(N^{2})$ gives the result, for an appropriately chosen constant $c_{2}>0$ . ∎

Properties of the walk

Before proving our main theorem, we will also need some results about the behavior of simple random walk on the network. We first show that, from any $i\in V_{u}$ , the probability of jumping to a vertex in block $v$ is close to the population-level probability $p_{uv}$ .

Claim B.3 (Transitions between blocks).

There exists $c_{3}>0$ such that, conditioned on $\mathcal{E}_{\mathrm{D}}$ , for any blocks $u,v$ and any $i\in V_{u}$

[TABLE]

Proof.

Fix $u,v$ and $i\in V_{u}$ . Recall

[TABLE]

Under $\mathcal{E}_{\mathrm{D}}$ ,

[TABLE]

for a constant $c_{3}>0$ large enough. A similar inequality holds in the opposite direction. ∎

The previous claim also implies that any step has a probability bounded away from [math] of landing in any block.

Claim B.4 (Landing in a block).

There is $p_{*}\in(0,1)$ such that, conditioned on $\mathcal{E}_{\mathrm{D}}$ , for any blocks $u,v$ and any $i\in V_{u}$ , we have

[TABLE]

provided $N$ is larger than a sufficiently large constant.

Proof.

Let

[TABLE]

The result then follows from Claim B.3. ∎

We next show that two steps of the walk are enough to mix within a block.

Claim B.5 (Two steps suffice for within-block mixing).

For each sample $\tau\in\mathbbm{T}$ , we denote its grandchildren as $\mathcal{C}^{(2)}(\tau)$ . For a $(\mathbbm{T},P)$ -walk on $G$ , there exists $c_{4}>0$ such that, on $\mathcal{E}_{\mathrm{D}}$ and $\mathcal{E}_{\mathrm{D},2}$ , for all $\tau$ and $\tau_{**}\in\mathcal{C}^{(2)}(\tau)$

[TABLE]

for all blocks $u,v$ , and nodes $i\in V_{u}$ , $j\in V_{v}$ with $i\neq j$ .

Proof.

To simplify notation, the conditioning on $G$ is implicit throughout the proof. Assume $\mathcal{E}_{\mathrm{D}}$ and $\mathcal{E}_{\mathrm{D},2}$ hold. Fix blocks $u,v$ as well as nodes $i\in V_{u}$ and $j\in V_{v}$ with $i\neq j$ . Let $\tau_{**}\in\mathcal{C}^{(2)}(\tau)$ and let $\tau_{*}$ be the ancestor of $\tau_{**}$ on $\mathbbm{T}$ , which is necessarily a child of $\tau$ . Then, for some constants $c_{4}^{\prime},c_{4}^{\prime\prime}>0$ , using $\mathcal{E}_{\mathrm{D}}$ and $\mathcal{E}_{\mathrm{D},2}$

[TABLE]

where recall that $p_{uv}=B_{uv}/B_{u\ast}$ . A similar inequality holds in the other direction. That implies the claim. ∎

Concentration of key estimates

The PS estimator defined in (3.1) relies on three key estimates, whose concentration we establish now.

We begin with the concentration of $\hat{\pi}_{k}^{B}$ by showing that our estimates of block transition probabilities are concentrated, which boils down to proving that the $\hat{p}_{uv}$ ’s are concentrated. Recall that Claim B.3 implies that the block transition probabilities are concentrated at each $i$ , i.e., the $p_{uv}(i)$ ’s are concentrated. Proving that the estimate $\hat{p}_{uv}=n_{u^{\prime}v}/n_{u^{\prime}}$ itself is concentrated requires an argument. Indeed, as shown in Section B.3 below, both the numerator and denominator of this estimator in general may have variance asymptotically much greater than $1/n$ . Instead, we use the Markovian structure of the model to control the deviation of $\hat{p}_{uv}$ .

Claim B.6 (Concentration of block-wise steady-state probability estimates).

Conditioned on $G$ and $\mathcal{E}_{\mathrm{D}}$ , there exists $c_{5}>0$ such that, for any block $k$ , with probability at least $1-\varepsilon^{\prime}/2$ ,

[TABLE]

Recall that $\bm{\pi}^{B}$ was defined in (B.3).

Proof.

Throughout this proof, we implicitly condition on $G$ and assume that $\mathcal{E}_{\mathrm{D}}$ (from Claim B.1) holds. We let $\tau_{0},\ldots,\tau_{n-1}$ be a topological ordering of the vertices of $\mathbbm{T}$ , i.e., an ordering such that: if $\tau_{i}$ is an ancestor of $\tau_{j}$ , then $i<j$ . For a fixed $G$ , we let $\mathcal{F}_{0},\ldots,\mathcal{F}_{n-1}$ be the corresponding filtration, i.e.,

[TABLE]

Recall that $\tau^{\prime}$ is the parent of $\tau\neq\tau_{0}$ . The proof relies on three sub-claims:

Deviation of $n_{u^{\prime}v}$ : For $u,v$ and $j=1,\ldots,n-1$ , let

[TABLE]

where recall that $z(i)$ is the block of $i$ . Note that

[TABLE]

We consider the process

[TABLE]

with $W_{0}=0$ . We claim that $\{W_{t}\}_{t}$ is a martingale with bounded increments. Indeed, by the ordering of the samples, $X_{\tau_{j}^{\prime}}\in\mathcal{F}_{j}$ since $\tau_{j}^{\prime}=\tau_{s}$ for some $s<j$ . Hence $I_{j}-\mathbb{E}[I_{j}\,|\,\mathcal{F}_{j-1}]\in\mathcal{F}_{t}$ for all $j\leq t$ . So $W_{t}\in\mathcal{F}_{t}$ . Moreover, following a standard calculation,

[TABLE]

Finally, observe that by definition

[TABLE]

By the Azuma-Hoeffding inequality (see e.g. [17]), for a constant $c_{5}>0$ large enough

[TABLE] 2. 2.

Deviation of $\sum\limits_{j=1}^{n-1}\mathbb{E}[I_{j}\,|\,\mathcal{F}_{j-1}]$ : Next, we bound

[TABLE]

where we use the Markov property of the walk indexed by $\mathbbm{T}$ . By Claim B.3, for all $i\in V_{u}$ ,

[TABLE]

Combining (B.10) and (B.11), we get

[TABLE]

where we used that $n\leq N$ and that $x/\ln x$ is non-decreasing for $x\geq e$ . 3. 3.

Lower bound on $n_{u^{\prime}}$ : Let $n_{\mathrm{in}}$ be the number of internal vertices in $\mathbbm{T}$ . Because each leaf has a parent that is an internal vertex and $\mathbbm{T}$ has maximum degree $d_{\mathrm{max}}\leq c_{d}$ for some constant $c_{d}>0$ , it follows that $n_{\mathrm{in}}=\Theta(n)$ . Moreover, by Claim B.4, the state of each internal vertex of $\mathbbm{T}$ (except the root) has probability at least $p_{*}$ of coming from block $u$ , independently of all other $X_{\tau}$ ’s. As a result, $n_{u^{\prime}}$ stochastically dominates a binomial random variable with $n_{\mathrm{in}}-1$ trials and probability of success $p_{*}$ . By Hoeffding’s inequality we therefore have for a constant $c_{6}>0$ large enough that

[TABLE]

Together with $n_{\mathrm{in}}=\Theta(n)$ , that implies that for some constant $c_{6}^{\prime}>0$

[TABLE]

Combining (B.9), (B.12), and (B.13), with probability at least $1-\varepsilon^{\prime}/2$ for any block $u,v$ , there exists some constant $c_{6}^{\prime\prime}>0$ , such that

[TABLE]

Recall that the stationary distribution of $P^{B}$ is

[TABLE]

for any $k\in\{1,\dots,K\}$ and that

[TABLE]

Then, there exists some constant $c_{6}^{\prime\prime\prime}>0$ , such that

[TABLE]

Indeed,

[TABLE]

for large enough $c_{6}^{\prime\prime\prime}$ , and similarly in the other direction. The second line is from (B.15) and (B.16) while fourth line is from (B.14). ∎

We then evaluate the deviation of

[TABLE]

Recall that, for any block $k$ , the population block-wise average is

[TABLE]

Before showing that our block-wise estimator $\hat{\mu}_{{k}\mathrm{VH}}$ is close to $\mu_{k}$ , we first look at a related quantity, $\hat{\mu}_{{k},\mathrm{w}}$ below, which serves as a “bridge.” We define the weighted block-wise average as

[TABLE]

Using an argument similar to that in Claim B.6, we show in Claim B.7 that $\hat{\mu}_{{k},\mathrm{w}}$ is concentrated for each block $k$ . We then show in Claim B.10 that $\hat{\mu}_{{k},\mathrm{w}}$ is close to $\hat{\mu}_{{k}\mathrm{VH}}$ . As a result, we will have established that $\hat{\mu}_{{k}\mathrm{VH}}$ is close to $\mu_{k}$ .

$\hat{\mu}_{{k}\mathrm{VH}}$$\hat{\mu}_{{k},\mathrm{w}}$$\mu_{k}$ close toClaim B.10close toClaim B.7

Claim B.7 (Concentration of block-wise sample averages weighted by degrees).

Conditioned on $G$ , $\mathcal{E}_{\mathrm{D}}$ and $\mathcal{E}_{\mathrm{D},2}$ , there exists $c_{7}>0$ such that, with probability at least $1-\varepsilon^{\prime}/4$ , for any block $k$

[TABLE]

Proof.

Because the structure of the proof is similar to that of Claim B.6, we only sketch it here. We also make use of Claim B.5, which shows that simple random walk on $G$ mixes well within blocks in two steps. Because of the latter, we control separately the odd and even levels of $\mathbbm{T}$ . Let $\nu_{1},\nu_{2},\ldots,\nu_{n^{(\mathrm{e})}}$ be the vertices of $\mathbbm{T}$ whose graph distance to the root is even, including the root $\nu_{1}=\tau_{0}$ , in a topological ordering. Let $\mathcal{C}^{(2)}(\nu)$ be the grand-children of $\nu$ in $\mathbbm{T}$ . Let $\mathcal{G}_{0}=\sigma(X_{\nu_{1}})=\sigma(X_{\tau_{0}})$ and for $j\geq 1$

[TABLE]

For each node $X_{\nu}\in V_{k}$ , define

[TABLE]

Fix block $k$ and let

[TABLE]

and

[TABLE]

where note that the last sum excludes the root. Following the proof of Claim B.6, we note that the partial sums

[TABLE]

form a martingale indexed by $J$ with increments satisfying

[TABLE]

where we used that $\mathbbm{T}$ has maximum degree $\leq c_{d}$ and $0\leq y(x)\leq c_{y}$ by assumption. Hence, arguing as in Step 1 of Claim B.6, we get that with probability at least $1-\varepsilon^{\prime}/20K$ for all $k$

[TABLE]

Moreover, let $\nu\in\mathcal{C}^{(2)}(\nu_{j})$ and notice that by construction $X_{\nu_{j}}\in\mathcal{G}_{j-1}$ . Hence, by Claim B.5,

[TABLE]

where recall that we condition on $G$ . Similarly in the opposite direction. So, arguing as in Step 2 of Claim B.6, for some large enough $c_{7}^{\prime\prime}>0$ ,

[TABLE]

where we used $n\leq N$ .

In addition, we argue as in Step 3 of Claim B.6. Because each node with odd distance to the root has a parent with even distance to the root, and $\mathbbm{T}$ has maximum degree $\leq c_{d}$ , it follows that $n^{(e)}=\Theta(n)$ . Moreover, by Claim B.4, the state of each internal vertex of $\mathbbm{T}$ (except the root) has probability at least $p_{*}$ of coming from block $u$ , independently of all other $X_{\tau}$ ’s. As a result, $n^{(\mathrm{e})}_{k}$ stochastically dominates a binomial random variable with $n^{(e)}-1$ trials and probability of success $p_{*}$ . By Hoeffding’s inequality we therefore have for a constant $c_{8}>0$ large enough that

[TABLE]

Together with $n^{(e)}=\Theta(n)$ , that implies that with probability at least $1-\varepsilon^{\prime}/10$ for all block $k$ for some constant $c_{8}^{\prime}>0$

[TABLE]

Finally, following the proof of Claim B.6 once again, we also get that with probability at least $1-\varepsilon^{\prime}/10$ for all $k,$

[TABLE]

for some constant $c_{8}^{\prime\prime}>0$ . Combining (B.18), (B.19), and (B.21), with probability at least $1-\varepsilon^{\prime}/5$

[TABLE]

for some constant $c_{8}^{\prime\prime\prime}>0$ .

The same holds for the odd levels. Together with (B.20) and a similar inequality for odd levels (and the fact that the first two levels of $\mathbbm{T}$ have negligible effect asymptotically), we get the claim. ∎

By replacing $y(X_{\tau})$ by 1 in the proof of Claim B.7, we can also derive the following.

Claim B.8.

Conditioned on $G$ , $\mathcal{E}_{\mathrm{D}}$ and $\mathcal{E}_{\mathrm{D},2}$ , there exists $c_{9}>0$ such that, with probability at least $1-\varepsilon^{\prime}/4$ , for any block $k$

[TABLE]

Using Claims B.1 and B.8, we derive the deviation of the block-wise harmonic average degrees. Recall, for any block $k$ , the block population mean degree is

[TABLE]

and the block-wise harmonic average degree as

[TABLE]

Claim B.9 (Concentration of block-wise harmonic average of degrees).

Conditioned on $G$ , $\mathcal{E}_{\mathrm{D}}$ and $\mathcal{E}_{\mathrm{D},2}$ , there exists $c_{10}>0$ such that, with probability at least $1-\varepsilon^{\prime}/4$ , for any block $k$ ,

[TABLE]

Proof.

Conditioned on $G$ , $\mathcal{E}_{\mathrm{D}}$ and $\mathcal{E}_{\mathrm{D},2}$ , under the DC-SBM,

[TABLE]

for some large enough constant $c_{10}^{\prime}>0$ . The first inequality is from Claim B.1, which holds with probability $1-\varepsilon^{\prime}/4$ . The second inequality is from Claim B.8. A similar inequality holds for the opposite direction. Thus,

[TABLE]

Thus,

[TABLE]

for some large enough constant $c_{10}>0$ . By definition of $\delta^{B}_{k}$ we are done. ∎

Directly from Claim B.9, we show that $\hat{\mu}_{{k},\mathrm{w}}$ is close to $\hat{\mu}_{{k}\mathrm{VH}}$ for each block $k$ in the following claim.

Claim B.10.

Conditioned on $G$ , $\mathcal{E}_{\mathrm{D}}$ and $\mathcal{E}_{\mathrm{D},2}$ , there exists $c_{11}>0$ such that, with probability at least $1-\varepsilon^{\prime}/2$ , for any block $k$ ,

[TABLE]

Proof.

Conditioned on $G$ , $\mathcal{E}_{\mathrm{D}}$ and $\mathcal{E}_{\mathrm{D},2}$ , under the DC-SBM, Claims B.9 and B.7 hold simultaneously with probability $1-\varepsilon^{\prime}/2$ . Then

[TABLE]

for some large enough constant $c_{11}^{\prime}>0$ . The first inequality is from Claim B.9, while the second inequality is from Claim B.1. A similar bound holds for the opposite direction. Combining with Claim B.7,

[TABLE]

for some large enough constant $c_{11}>0$ . ∎

Putting everything together

Finally, we prove the main result.

Proof of Theorem 4.1.

By Claims B.1 and B.2, events $\mathcal{E}_{\mathrm{D}}$ and $\mathcal{E}_{\mathrm{D},2}$ hold with probability at least $1-\varepsilon$ . Under those events, by Claims B.6 and B.9 with hold with probability $1-\varepsilon^{\prime}$ ,

[TABLE]

for some large enough $c_{12}^{\prime}>0$ . Similar for the other direction. Then, using Claim B.10,

[TABLE]

for some constant $c_{12}>0$ . Similarly for the other direction. Thus, there exists constant $c>0$ such that

[TABLE]

∎

B.3 A simple instance showing that the variance of the VH estimator converges slower than $O(n^{-1})$

The following example shows that, in general, the Volz-Heckathorn estimator, i.e.,

[TABLE]

has a variance asymptotically worse than $1/n$ on a two-block stochastic block model. Recall that $z(x)$ is the block of $x$ .

Theorem B.11 (Negative example).

Let $K=2$ and denote the blocks by $\{0,1\}$ . Let $N_{0}=N_{1}=N/2$ , $B_{01}=B_{10}=1-B_{00}=1-B_{11}=pN^{2}$ where $p\in(0,1/2)$ , $y(x)=z(x)$ for all $x\in V$ . Let $x_{0}\in V$ be chosen uniformly at random. Let $\mathbbm{T}$ be a complete $(\alpha-1)$ -ary tree. Assume that $N\gg n^{2+\gamma}$ for some $\gamma>0$ and that

[TABLE]

Then, with probability at least $1/2$ over the network,

[TABLE]

for some $\zeta>0$ .

Proof.

By Claim B.1, the event $\mathcal{E}_{\mathrm{D}}$ occurs with probability at least $1/2$ . Therefore, by the conditional variance formula,

[TABLE]

By symmetry, $\delta^{B}_{0}=\delta^{B}_{1}=N/2$ . Hence, on $\mathcal{E}_{\mathrm{D}}$ , we have further that

[TABLE]

by our assumption on $N$ , where we used that $y(x)\in[0,1]$ for all $x$ . To simplify notation, in the rest of the proof, we implicitly condition on $G$ and $\mathcal{E}_{\mathrm{D}}$ .

The population-level chain satisfies

[TABLE]

Let $(\tilde{f}_{\tau})_{\tau\in\mathbbm{T}}$ be a Markov chain on $\{0,1\}$ indexed by $\mathbbm{T}$ with transition probabilities $(p_{bu})_{bu\in\{0,1\}}$ . By Claim B.3, on $\mathcal{E}_{\mathrm{D}}$ , we can couple $(y(X_{\tau}))_{\tau}$ and $(\tilde{f}_{\tau})_{\tau}$ except with probability $O(n\sqrt{\log N/N})=o(1)$ , an event we denote by $\tilde{\mathcal{E}}$ . This is because, for each of the $n-1$ transitions, there can only be a difference in probability of $O(\sqrt{\log N/N})$ .Hence, by the conditional variance formula again,

[TABLE]

To simplify notation, in the rest of the proof, we implicitly condition on $\tilde{\mathcal{E}}$ .

Define

[TABLE]

and notice that, by translation,

[TABLE]

and that $\tilde{g}_{\tau}$ is centered under $\bm{\pi}^{B}$ . Under $(p_{bu})_{bu\in\{0,1\}}$ , the function $(-1,+1)$ is a right-eigenvector with eigenvalue

[TABLE]

Hence, for any $\tau,\tau^{\prime}\in\mathbbm{T}$ at graph distance $\eta$ , it holds that

[TABLE]

and

[TABLE]

where we used that $\tilde{g}_{\tau}^{2}=1$ . Let $\mathcal{L}$ be the leaves of $\mathbbm{T}$ . Because the samples $(\tilde{g}_{\tau})_{\tau\in\mathbbm{T}}$ are positively correlated by the above calculation and $|\mathcal{L}|=\Omega(n)$ , we have further that

[TABLE]

Finally, by symmetry and the conditional variance formula once more, recalling that $\tau_{0}$ is the root of $\mathbbm{T}$ we have

[TABLE]

with $\zeta=1+\log_{2}\theta^{2}=\log_{2}(2\theta^{2})>0$ by (B.22). Combining the latter with (B.23), (B.24), (B.25), (B.26), and (B.27) gives the result. ∎

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Heckathorn [1997] Douglas D Heckathorn. Respondent-driven sampling: a new approach to the study of hidden populations. Social problems , 44(2):174–199, 1997.
2Malekinejad et al. [2008] Mohsen Malekinejad, Lisa Grazina Johnston, Carl Kendall, Ligia Regina Franco Sansigolo Kerr, Marina Raven Rifkin, and George W Rutherford. Using respondent-driven sampling methodology for hiv biological and behavioral surveillance in international settings: a systematic review. AIDS and Behavior , 12(1):105–130, 2008.
3Johnston [2013] LG Johnston. Introduction to hiv/aids and sexually transmitted infection surveillance: Module 4: Introduction to respondent driven sampling. World Health Organization , 2013.
4White et al. [2015] Richard G White, Avi J Hakim, Matthew J Salganik, Michael W Spiller, Lisa G Johnston, Ligia Kerr, Carl Kendall, Amy Drake, David Wilson, Kate Orroth, et al. Strengthening the reporting of observational studies in epidemiology for respondent-driven sampling studies:“strobe-rds” statement. Journal of clinical epidemiology , 68(12):1463–1471, 2015.
5Rohe [2015] Karl Rohe. Network driven sampling; a critical threshold for design effects. ar Xiv preprint ar Xiv:1505.05461 , 2015.
6Roch and Rohe [2017] Sebastien Roch and Karl Rohe. Generalized least squares can overcome the critical threshold in respondent-driven sampling. ar Xiv preprint ar Xiv:1708.04999 , 2017.
7Goel and Salganik [2009] Sharad Goel and Matthew J Salganik. Respondent-driven sampling as markov chain monte carlo. Statistics in medicine , 28(17):2202–2229, 2009.
8Salganik and Heckathorn [2004] Matthew J Salganik and Douglas D Heckathorn. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological methodology , 34(1):193–240, 2004.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Reducing Seed Bias in Respondent-Driven Sampling

Abstract

keywords:

1 Introduction

2 Preliminaries

2.1 Markov process on a social network

2.2 Quantity to estimate and the Volz-Heckathorn estimator

3 A new estimator

3.1 A post-stratification approach to RDS

Stratification

Block proportions are unobserved in marginalized populations

3.2 Block-wise quantities

3.3 The post-stratified estimator

Definition 3.1** (The post-stratified estimator).**

3.4 Motivation for the PS estimator

Definition 3.2** (Degree-corrected stochastic blockmodel).**

Remark 3.3** (Self-loops).**

4 Main theoretical result

Theorem 4.1** (Main result).**

5 Simulations

5.1 Simulated Networks

Bottleneck

Alignment

Network density

5.2 Add Heath Networks

6 Discussion

Appendix A More Simulations

Appendix B Proof of the main theorem

B.1 Notation

B.2 Proof

High-probability properties of the social network

Claim B.1** (Degrees are concentrated).**

Proof of Claim B.1.

Claim B.2** (Two-edge paths).**

Proof of Claim B.2.

Properties of the walk

Claim B.3** (Transitions between blocks).**

Proof.

Claim B.4** (Landing in a block).**

Proof.

Claim B.5** (Two steps suffice for within-block mixing).**

Proof.

Concentration of key estimates

Claim B.6** (Concentration of block-wise steady-state probability estimates).**

Proof.

Claim B.7** (Concentration of block-wise sample averages weighted by degrees).**

Proof.

Claim B.8**.**

Claim B.9** (Concentration of block-wise harmonic average of degrees).**

Proof.

Claim B.10**.**

Proof.

Putting everything together

Proof of Theorem 4.1.

B.3 A simple instance showing that the variance of the VH estimator converges slower than O(n−1)O(n^{-1})O(n−1)

Theorem B.11** (Negative example).**

Proof.

Definition 3.1 (The post-stratified estimator).

Definition 3.2 (Degree-corrected stochastic blockmodel).

Remark 3.3 (Self-loops).

Theorem 4.1 (Main result).

Claim B.1 (Degrees are concentrated).

Claim B.2 (Two-edge paths).

Claim B.3 (Transitions between blocks).

Claim B.4 (Landing in a block).

Claim B.5 (Two steps suffice for within-block mixing).

Claim B.6 (Concentration of block-wise steady-state probability estimates).

Claim B.7 (Concentration of block-wise sample averages weighted by degrees).

Claim B.8.

Claim B.9 (Concentration of block-wise harmonic average of degrees).

Claim B.10.

B.3 A simple instance showing that the variance of the VH estimator converges slower than $O(n^{-1})$

Theorem B.11 (Negative example).