Respondent driven sampling and sparse graph convergence
Siva Athreya, Adrian R\"ollin

TL;DR
This paper studies a respondent-driven sampling method modeled by a graphon, demonstrating that under certain conditions, the resulting sparse graphs converge to the graphon using advanced probabilistic tools.
Contribution
It introduces a novel approach to analyze respondent-driven sampling via graphon convergence and develops a specific clumping procedure for sparse graph construction.
Findings
Sparse graphs constructed via the method converge to the graphon in the cut-metric.
Stationarity of the vertex-sets is key for convergence.
Uses concentration inequalities and Stein-Chen method for analysis.
Abstract
We consider a particular respondent-driven sampling procedure governed by a graphon. By a specific clumping procedure of the sampled vertices we construct a sequence of sparse graphs. If the sequence of the vertex-sets is stationary then the sequence of sparse graphs converge to the governing graphon in the cut-metric. The tools used are concentration inequality for Markov chains and the Stein-Chen method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
RESPONDENT DRIVEN SAMPLING AND SPARSE GRAPH CONVERGENCE
Siva Athreya1
Adrian Röllin2
Abstract
We consider a particular respondent-driven sampling procedure governed by a graphon. By a specific clumping procedure of the sampled vertices we construct a sequence of sparse graphs. If the sequence of the vertex-sets is stationary then the sequence of sparse graphs converge to the governing graphon in the cut-metric. The tools used are concentration inequality for Markov chains and the Stein-Chen method.
11footnotetext: Indian Statistical Institute, 8th Mile Mysore Road, Bangalore, 560059 India.
Email: [email protected]: Department of Statistics and Applied Probability, National University of Singapore, 6 Science Drive 2, Singapore 117546. Email: [email protected]
2000 Mathematics Subject Classification. Primary 05C80, 60J20; Secondary 37A30, 9482.
Keywords. Respondent Driven Sampling; random graph; sparse graph limits; dense graph limits.
1 Introduction
Respondent Driven Sampling (RDS), popularised by Heckathorn (1997), is a method to sample from hard-to-reach populations, such as drug users, MSM and people with HIV, and it is being routinely used in studies involving such populations. The sampling procedure is subject to various biases, one of which is a bias towards individuals with higher degrees, as these are more likely to appear in the sample.
How this bias affects the network as a whole has been described by Athreya and Röllin (2016) in the context of dense graph limits. The model considered there is defined in terms of a two-step procedure. First, vertices are sampled according to an ergodic process (the important point to note is that the vertices need not be sampled independently of each other). Second, edges between vertices are sampled independently of each other, where the probability of an edge is determined via a graphon representing the underlying network.
Dense graphs are at one extreme of graph sequences. These are graphs on vertices with the number of edges being of order , which is far more than what is observed in real world networks. At the opposite end are sequences of graphs with bounded (average) degree and consequently having order edges. These have a separate limiting theory which is not quite applicable to many real world networks. There is class of graph sequences between these two extremes, called sparse graphs — these are graphs for which the average degree grows in the number of vertices, but only at sub-linear speed.
The purpose of this note is to extend the work of Athreya and Röllin (2016) to sparse graphs, and to consider more realistic models of sampling. Since RDS data typically comes in the form of trees, the actual graphs are those with average degrees remaining bounded as the number of nodes grows. We propose a model where “close enough” participants are “clumped” together so that the average degree now grows in . Our main result is that the random sparse graph sequence obtained through a specific respondent-driven sampling procedure converges almost surely to the graphon underlying the network in the cut-metric, provided the sequence of the vertex-sets is stationary
The method of proof in this article is entirely different from that of Athreya and Röllin (2016). This is mainly due to the fact that, unlike in the dense case, subgraphs counts no longer characterise graph convergence. We compare our random sparse graph sequence with an “expected” (deterministic) sparse graph via a concentration inequality. We then use the Stein-Chen method to compare this deterministic sparse graph to a sequence of graphs which are close to the graphon of the underlying network.
The rest of the article is organised as follows. In Section 2 we provide a brief introduction to sparse graph convergence. In Section 3 we describe our model and state our main result (Theorem 3.1). We present the proof of the main result in Section 4. We then conclude with some remarks in a final discussion section on Respondent Driven Sampling and Dense graph sequences.
Acknowledgements:
Adrian Röllin was supported by NUS Research Grand R-155-000-167-112. Siva Athreya was supported by CPDA grant from the Indian Statistical Institute and an ISF-UGC project grant.
2 Sparse graph convergence
This section is a very brief introduction to sparse graph convergence. The convergence of sparse graphs was initiated by Bollobás and Riordan (2009) and then the theory was established in Borgs et al. (2014a) and Borgs et al. (2014b). We present the minimal amount of material necessary to formulate and prove our main result. We first define weighted graphs, followed by definition of graphon and conclude with a brief discussion on a convergence result.
Weighted graphs.
Consider a graph , given by its set of vertices and set of edges . A (edge-)weighted graph is simply a graph which has, in addition, a weight function , where, for each , we interpret the value as the weight of that edge. By making the convention that whenever there is no edge between vertices and , the information about is contained in , so that any weighted graph is determined by and . Moreover, any unweighted graph can be interpreted as a weighted graph by setting whenever .
For any weighted graph and any constant , we shall define to be the weighted graph on the same set of vertices and edge weights .
Graphons.
A graphon is any symmetric, function which is integrable; note that we restrict ourselves to non-negative graphons, whereas Borgs et al. (2014a) allow for more general graphons. For any graphon , the cut-norm of is defined as
[TABLE]
where the supremum is taken over Lebesgue-measurable subsets of . The -norm of is given by
[TABLE]
For any two graphons and , we let
[TABLE]
Since a Lebesgue measure preserving transformation of will not change the norm of a graphon, it is customary to define the cut-metric on graphons by
[TABLE]
where the infimum ranges over all measure-preserving bijections , and where the graphon is defined as .
Every weighted graph is naturally associated with a graphon in the following way. First, divide the interval into intervals of lengths for each . The function is then given the constant value on for every . It is easily verified that is indeed a graphon.
Thus, even if and have different set of vertices, we can define their cut-distance through the cut-distance of their associated graphons; that is,
[TABLE]
If two weighted graphs and have the same set of vertices , then it is clear that we can express their cut-distance as
[TABLE]
Finally, if is a graphon and is a weighted graph, then we will define
[TABLE]
Convergence to graphon.
Let be a graphon with . Let satisfy and as . Let the vertex set be given by . Let be i.i.d. chosen uniformly in .
Define to be the graph defined by connecting and with probability . It is clear that is a sparse graph sequence and in (Borgs et al., 2014a, Theorem 2.14 and Corollary 2.15) it is shown that, with probability ,
[TABLE]
as . In this article we generalise the above result when the vertex labels come from a Markov Chain and the sparse graph is constructed after suitable clumping.
3 Model and main results
3.1 Constructing a random graph from RDS
We shall construct a sparse graph on vertices driven by Respondent Driven Sampling (RDS). We will sample individuals, labelled , where . We note that the label space is chosen arbitrarily to be the unit interval only for the sake of mathematical convenience. After sampling, the individuals are clumped into equally spaced bins, which we represent by the intervals , where (it is understood that also includes the right-most point 1). We connect and if two successive individuals fall into bin followed by bin or vice-versa. We chose in such a way that the graph constructed is sparse and we establish an limit for the same. We begin with a precise definition of the sampling scheme via a Markov chain.
Markov Chain representing RDS.
Let be a graphon. Let be a probability space, on which we define a Markov chain with transition probabilities given by
[TABLE]
Since is symmetric, the Markov chain is time-reversible with stationary distribution
[TABLE]
We shall assume that , which means the chain is stationary. Then the probability of seeing a transition from to is given by
[TABLE]
Sparse Random Graph from RDS.
Let be a graphon. Let and . We will now construct a random graph via the following steps:
- •
Let the vertex set be .
- •
Let be a realisation of the stationary Markov Chain defined in the previous section up to time .
- •
Equi-partition the unit interval by the intervals . For with , define
[TABLE]
- •
For with , connect and if , and leave it unconnected otherwise.
If we choose appropriately (i.e. ) then the above random graph will be a sparse random graph sequence.
3.2 Main Result
Let be a given graphon, and consider the sparse graph sequence defined as in the previous paragraph. We shall make the following assumptions on and .
** Assumption (K1).**
There are a constant and an integrable function such that
[TABLE]
** Assumption (N1).**
There are constants and , where and , such that the sequence satisfies
[TABLE]
We are now ready to state the main result.
Theorem 3.1**.**
Under Assumption (K1) and Assumption (N1), and if ,
[TABLE]
almost surely with respect to .
4 Proof of Theorem 3.1
To prove our result we will need to define two (deterministic and intermediate) weighted graphs. The first graph is an “averaged” version of , which we shall denote by ; it is the weighted graph on the vertices with edge weights
[TABLE]
Denote by be the weighted graph obtained by scaling the weights of by (as described in Section 2). The second graph, denoted by , is the weighted graph on the vertices with edge weights
[TABLE]
For , let and be such that and for all . Observe that by the Lebesgue density theorem, almost everywhere on .
Our strategy will be to show that, for large , is close to , followed by the fact that is close to , and finally that is close to .
We start with the first lemma, which shows that the distance between and goes to [math] almost surely with respect to . The key ingredient of the proof is a concentration inequality of Paulin (2015).
Lemma 4.1**.**
We have
[TABLE]
Proof.
Note that
[TABLE]
where . As is Harris recurrent by Assumption (K1), we obtain from (Meyn and Tweedie, 2009, Theorem 16.0.2) that the Markov chain has finite mixing time . Let be given. Now, changing one point in will change by at most 2 edges; that is, is -Hamming-Lipschitz. Therefore, by (Paulin, 2015, Corollary 2.10),
[TABLE]
where is a constant that only depends on . Using the union bound,
[TABLE]
By (3.3) and Borel-Cantelli, the claim follows. ∎
Our second lemma shows that the distance between and goes to [math]. The key ingredient of the proof is an application of the Stein-Chen method.
Lemma 4.2**.**
We have
[TABLE]
Proof.
Let
[TABLE]
where with , and note that
[TABLE]
Clearly, . Now,
[TABLE]
where Z_{n}(i,j)\stackrel{{\scriptstyle\mathscr{D}}}{{=}}\mathop{\mathrm{Poisson}}\bigl{(}\frac{2N}{n^{2}}\mu_{n}(i,j)\bigr{)}. Now, let be a random variable having the size-bias distribution of . Then, the Stein-Chen method (see, for example, (Barbour et al., 1992, Theorem 1.B)) yields
[TABLE]
where denotes the total variation distance. Note that for all , hence for all . Thus, we can use the standard way to construct the size-bias distribution (see for example Goldstein and Rinott (1996)). To this end, let be a uniformly chosen index from to , independent of all else. It is not difficult to show that is the size-bias distribution of . We now construct on the same probability space as in the following way. Consider as given, and consider a process with law
[TABLE]
Let , and observe that
[TABLE]
Thus,
[TABLE]
has the size-bias distribution of . If , we can couple the two processes and perfectly. If , we couple the two processes as follows. Condition (3.2) implies that is Harris-recurrent; that is,
[TABLE]
Thus, it is possible to couple and such that
[TABLE]
and, similarly,
[TABLE]
We can easily extend the processes and so that and are defined for all . Now, let and be geometric random variables with success probability dominating the coupling time forward and backward in time from and respectively. Note that we can construct and such that (G_{1},G_{2})\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X and (G_{1},G_{2})\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X^{\prime} (note, however that (G_{1},G_{2})\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(X,X^{\prime})). Then,
[TABLE]
Now,
[TABLE]
Applying this bound to (4.5), we have, for each and ,
[TABLE]
In conjunction with (LABEL:tsc) and interchanging summation with integration, we arrive at
[TABLE]
Using (3.3), the claim follows. ∎
Our third lemma shows that the distance between and goes to [math]. The proof is a basic exercise in real analysis.
Lemma 4.3**.**
We have
[TABLE]
Proof.
To simplify writing, we introduce the notation . Recall that is the weighted graph on the vertices with edge weights as in (4.1). Define the graphon by
[TABLE]
Let be the graphon associated with the graph , which is given by
[TABLE]
Now,
[TABLE]
By (Borgs et al., 2014b, Lemma 5.6),
[TABLE]
Note that, by Taylor’s approximation, for . Hence, we have for any that
[TABLE]
Let (to be chosen later). For any graphon , let be the graphon defined as and let the graphon be defined analogously to (4.7). Now,
[TABLE]
By the contraction property,
[TABLE]
Let . Then there exists such that
[TABLE]
For this choice of , as \min\bigl{\{}1,\frac{2N}{n^{2}}{\mathaccent 28766{\kappa}}_{n}(x,y)\bigr{\}}({\mathaccent 28766{\kappa}}\wedge\tau)_{n}(x,y) converges to zero pointwise and is bounded by , we can use dominated convergence to conclude that there exists such that
[TABLE]
for all . Therefore, applying (4.11)–(4.13) to (4.10), we have that
[TABLE]
As was arbitrary, we conclude that
[TABLE]
From (4.8), (4.9), and (4.14) the claim now follows. ∎
We are now ready to prove the main result. It follows immediately from the triangle inequality and the above three lemmas.
Proof of Theorem 3.1.
As indicated above using the triangle inequality, we have
[TABLE]
Application of Lemma 4.1, Lemma 4.2 and Lemma 4.3 completes the proof. ∎
5 Discussion
We conclude this note, with some remarks on dense graph sequences and Respondent Driven Sampling.
Dense Graph Sequence.
We have chosen so as to ensure that the graph sequence was sparse. If , then we obtain a dense graph sequence. In this case as well, the convergence in the cut-metric would hold but to a “Poissonised” in the following sense.
Proposition 5.1**.**
Under Assumption (K1) and Assumption (N1), and if ,
[TABLE]
almost surely with respect to , where the graphon is given by
[TABLE]
Proof.
The proof follows the same way as the proof of Theorem 3.1. So we provide a sketch.
Lemma 4.1 and Lemma 4.2 hold for case as well. Instead of Lemma 4.3, we have to show
[TABLE]
Define the graphon as f_{n}(x,y)=\lambda^{-1}\bigl{(}1-e^{-\lambda{\mathaccent 28766{\kappa}}_{n}(x,y)}\bigr{)}. Now,
[TABLE]
Recall that for all . So, for any ,
[TABLE]
By (Borgs et al., 2014b, Lemma 5.6), . Hence, using the above this readily implies
[TABLE]
Note that for and ,
[TABLE]
So, for any , we have
[TABLE]
As \bigl{\lvert}n^{2}/(2N)-1/\lambda\bigr{\rvert}\to 0, dominated convergence implies
[TABLE]
From this the result follows as in the proof of Theorem 3.1 ∎
We note that the Stein-Chen method plays a critical role in proof of Lemma 4.2 when , as ; that is, the mean of the Poisson random variable does not converge to [math], so that moment bounds would not suffice to prove Lemma 4.2.
Respondent Driven Sampling (RDS).
One common approach in RDS to correct for bias towards high degrees, is to ask participants of the study to estimate their own degree and then weigh the participants by the inverse of their reported degree. This procedure is known as multiplicity sampling, and was first used in the context of RDS by Rothbart et al. (1982). What Theorem 3.1 implies in essence is that one could also clump participants together according to general characteristics (such as age, gender, etc.). If the degree of the participants is captured by these characteristics, the bias towards participants with high degrees would disappear.
It was argued by Heckathorn (2007) that multiplicity sampling cannot in general correct for the bias towards nodes with high degree due to possible differential recruitment, which means that some groups of participants are systematically able to recruit more people than others. Other methods of estimations, including the original estimators of Heckathorn (1997) as well as the clumping procedure proposed in this article, are equally susceptible to differential recruitment bias.
The mathematical reason behind this bias is that the stationary distribution of a one-referral Markov process on a set of types, which is the commonly used mathematical tool to derive RDS estimators, can be different from the stationary distribution of a multi-type branching process with the same transition probabilities if the average number of offspring depends on the types. This was described precisely by Athreya and Röllin (2016), where the two models, a one-referral Markov chain and Poisson-offspring branching process, show substantially different over-sampling of high-degree vertices in the network. In the one-referral Markov chain case, the over-sampling is exactly proportional to the degree, but in the case of a Poisson number of referrals, it is proportional to a quantity that is harder to calculate (the eigenfunction of the mean replacement measure of the branching process). In practice, differential recruitment bias is typically reduced by limiting the number of referrals, traditionally to no more than three.
Heckathorn (2007) also proposes a method, called estimation through dual-components, which is supposed to take differential recruitment into account. This is the default method used in the widely-used statistical software RDSAT (see Volz et al. (2012)). The basic idea is to estimate the transition probabilities governing the referrals, calculate the proportion of different types one would expect to see under absence of both bias due to different degrees and bias due to differential recruitment, compare with the actual observed proportions, and then to work backwards to find the true proportions in the population. However, the theoretical justifications in Heckathorn (2007) for the details of the procedure are somewhat opaque.
Open Problems.
We conclude the article with a couple of questions that can be explored.
- (1)
In Athreya and Röllin (2016) a rigourous framework was set up to handle convergence in dense graph limits. For dense graphs, the theory of graphons (whose range is ) was used to establish the convergence. Graphons in dense graph setting characterise the limit via convergence of subgraph counts. This aspect applies under several equivalent metrics. One should be able to establish the RDS models used in Athreya and Röllin (2016) to prove convergence in the metric as in this article. The approach could be one as laid out in proof of (Borgs et al., 2014a, Theorem 2.14). 2. (2)
As already mentioned before, in practice, an RDS sample comes typically in the form of a tree, rather than a single chain, and hence, a multi-type branching process, where the types could represent characteristics such as gender, age etc., would constitute a more realistic mathematical model. The stationary distribution of such a branching process is difficult to solve analytically in general, but under additional assumptions, such as considering only finitely many types, a numerical approach would definitely be feasible. In this light, it seems that a statistical theory based on branching process theory, rather than Markov chain theory, could put the framework of dual-components from Heckathorn (2007) onto solid ground or even improve on it.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Athreya and Röllin (2016) S. Athreya and A. Röllin (2016). Dense graph limits under respondent-driven sampling. Ann. App. Probab. 26 , 2193–2210.
- 2Barbour et al. (1992) A. D. Barbour, L. Holst and S. Janson (1992). Poisson approximation . Oxford University Press, New York.
- 3Bollobás and Riordan (2009) B. Bollobás and O. Riordan (2009). Metrics for sparse graphs. In Surveys in combinatorics 2009 , volume 365 of London Math. Soc. Lecture Note Ser. , pages 211–287. Cambridge University Press, Cambridge.
- 4Borgs et al. (2014 a) C. Borgs, J. T. Chayes, H. Cohn and Y. Zhao (2014 a). An L p superscript 𝐿 𝑝 L^{p} theory of sparse graph convergence I: limits, sparse random graph models, and power law distributions. ar Xiv preprint ar Xiv:1401.2906 .
- 5Borgs et al. (2014 b) C. Borgs, J. T. Chayes, H. Cohn and Y. Zhao (2014 b). An L p superscript 𝐿 𝑝 L^{p} theory of sparse graph convergence II: LD convergence, quotients, and right convergence. ar Xiv preprint ar Xiv:1408.0744 .
- 6Goldstein and Rinott (1996) L. Goldstein and Y. Rinott (1996). Multivariate normal approximations by Stein’s method and size bias couplings. J. Appl. Probab. 33 , 1–17.
- 7Heckathorn (1997) D. D. Heckathorn (1997). Respondent-driven sampling: A new approach to the study of hidden populations. Soc. Probl. 44 , pp. 174–199.
- 8Heckathorn (2007) D. D. Heckathorn (2007). Extensions of respondent-driven sampling: analyzing continuous variables and controlling for differential recruitment. Sociol. Methodol. 37 , 151–207.
