Asymptotic Theory for Clustered Samples

Bruce E. Hansen; Seojeong Lee

arXiv:1902.01497·econ.EM·February 3, 2026

Asymptotic Theory for Clustered Samples

Bruce E. Hansen, Seojeong Lee

PDF

TL;DR

This paper develops a comprehensive asymptotic distribution theory for large samples of clustered data, accommodating heterogeneity and unbounded cluster sizes, and extends classical results to complex clustered sampling scenarios.

Contribution

It generalizes classical asymptotic results to clustered data with heterogeneous and unbounded clusters, providing a unified framework for various estimators.

Findings

01

Unified asymptotic distribution theory for clustered data

02

Conditions that include classical i.i.d. results as special cases

03

Applicable to linear, 2SLS, nonlinear MLE, and GMM estimators

Abstract

We provide a complete asymptotic distribution theory for clustered data with a large number of independent groups, generalizing the classic laws of large numbers, uniform laws, central limit theory, and clustered covariance matrix estimation. Our theory allows for clustered observations with heterogeneous and unbounded cluster sizes. Our conditions cleanly nest the classical results for i.n.i.d. observations, in the sense that our conditions specialize to the classical conditions under independent sampling. We use this theory to develop a full asymptotic distribution theory for estimation based on linear least-squares, 2SLS, nonlinear MLE, and nonlinear GMM.

Equations630

X_{g} = j = 1 \sum n_{g} X_{g j}

X_{g} = j = 1 \sum n_{g} X_{g j}

\overline{X}_{n} = \frac{1}{n} g = 1 \sum G X_{g} .

\overline{X}_{n} = \frac{1}{n} g = 1 \sum G X_{g} .

g \leq G max \frac{n _{g}}{n} \to 0.

g \leq G max \frac{n _{g}}{n} \to 0.

M \to \infty lim i sup (E ∥ X_{i} ∥ 1 (∥ X_{i} ∥ > M)) = 0

M \to \infty lim i sup (E ∥ X_{i} ∥ 1 (∥ X_{i} ∥ > M)) = 0

\overline{X}_{n} - E \overline{X}_{n} p 0.

\overline{X}_{n} - E \overline{X}_{n} p 0.

\frac{\sum _{g = 1}^{G} n _{g}^{2}}{n ^{2}} \to 0.

\frac{\sum _{g = 1}^{G} n _{g}^{2}}{n ^{2}} \to 0.

g \leq G max \frac{n _{g}}{n} = (g \leq G max \frac{n _{g}^{2}}{n ^{2}})^{1/2} \leq (g = 1 \sum G \frac{n _{g}^{2}}{n ^{2}})^{1/2} \to 0

g \leq G max \frac{n _{g}}{n} = (g \leq G max \frac{n _{g}^{2}}{n ^{2}})^{1/2} \leq (g = 1 \sum G \frac{n _{g}^{2}}{n ^{2}})^{1/2} \to 0

sd (\overline{X}_{n}) = \frac{1}{n} (g = 1 \sum G var (X_{g}))^{1/2} .

sd (\overline{X}_{n}) = \frac{1}{n} (g = 1 \sum G var (X_{g}))^{1/2} .

var (X_{g}) = n_{g} = n^{α}

var (X_{g}) = n_{g} = n^{α}

sd (\overline{X}_{n}) = n^{- 1/2} .

sd (\overline{X}_{n}) = n^{- 1/2} .

var (X_{g}) = n_{g}^{2} = n^{2 α}

var (X_{g}) = n_{g}^{2} = n^{2 α}

sd (\overline{X}_{n}) = n^{- (1 - α) /2} = G^{- 1/2} .

sd (\overline{X}_{n}) = n^{- (1 - α) /2} = G^{- 1/2} .

var (X_{g}) \sim n_{g} lo g n_{g} \sim n^{α} lo g n

var (X_{g}) \sim n_{g} lo g n_{g} \sim n^{α} lo g n

sd (\overline{X}_{n}) \sim lo g n / n .

sd (\overline{X}_{n}) \sim lo g n / n .

var (X_{g}) \sim n_{g}^{3}

var (X_{g}) \sim n_{g}^{3}

sd (\overline{X}_{n}) \sim n^{α - 1/2} .

sd (\overline{X}_{n}) \sim n^{α - 1/2} .

sd (\overline{X}_{n}) = (\frac{G _{1} + G _{2} n ^{2 α}}{n ^{2}})^{1/2} = (\frac{1 + n ^{α}}{2 n})^{1/2} = O (n^{- (1 - α) /2}) .

sd (\overline{X}_{n}) = (\frac{G _{1} + G _{2} n ^{2 α}}{n ^{2}})^{1/2} = (\frac{1 + n ^{α}}{2 n})^{1/2} = O (n^{- (1 - α) /2}) .

Ω_{n}

Ω_{n}

= \frac{1}{n} g = 1 \sum G E ((X_{g} - E X_{g}) (X_{g} - E X_{g})^{'}) .

\frac{( \sum _{g = 1}^{G} n _{g}^{r} ) ^{2/ r}}{n} \leq C < \infty,

\frac{( \sum _{g = 1}^{G} n _{g}^{r} ) ^{2/ r}}{n} \leq C < \infty,

g \leq G max \frac{n _{g}^{2}}{n} \to 0,

g \leq G max \frac{n _{g}^{2}}{n} \to 0,

M \to \infty lim i sup (E ∥ X_{i} ∥^{r} 1 (∥ X_{i} ∥ > M)) = 0,

M \to \infty lim i sup (E ∥ X_{i} ∥^{r} 1 (∥ X_{i} ∥ > M)) = 0,

λ_{n} = λ_{m i n} (Ω_{n}) \geq λ > 0,

λ_{n} = λ_{m i n} (Ω_{n}) \geq λ > 0,

Ω_{n}^{- 1/2} n (\overline{X}_{n} - E \overline{X}_{n}) d N (0, I_{p}) .

Ω_{n}^{- 1/2} n (\overline{X}_{n} - E \overline{X}_{n}) d N (0, I_{p}) .

g \leq G max \frac{n _{g}}{n ^{(r - 2) /2 (r - 1)}} \leq C

g \leq G max \frac{n _{g}}{n ^{(r - 2) /2 (r - 1)}} \leq C

\frac{( \sum _{g = 1}^{G} n _{g}^{r} ) ^{2/ r}}{n λ _{n}} \leq C < \infty

\frac{( \sum _{g = 1}^{G} n _{g}^{r} ) ^{2/ r}}{n λ _{n}} \leq C < \infty

g \leq G max \frac{n _{g}^{2}}{n λ _{n}} \to 0

g \leq G max \frac{n _{g}^{2}}{n λ _{n}} \to 0

g = 1 \sum G \frac{n _{g}^{2}}{n λ _{n}} E (∥ Z_{g} ∥^{2} 1 (∥ Z_{g} ∥^{2} \geq \frac{n λ _{n} ε}{n _{g}^{2}})) \to 0

g = 1 \sum G \frac{n _{g}^{2}}{n λ _{n}} E (∥ Z_{g} ∥^{2} 1 (∥ Z_{g} ∥^{2} \geq \frac{n λ _{n} ε}{n _{g}^{2}})) \to 0

g \leq G max \frac{n _{g}}{n ^{(r - 2) /2 (r - 1)} λ _{n}^{r /2 (r - 1)}} = o (1) .

g \leq G max \frac{n _{g}}{n ^{(r - 2) /2 (r - 1)} λ _{n}^{r /2 (r - 1)}} = o (1) .

(g \leq G max \frac{n _{g}^{2}}{n λ _{n}})^{1/2} = g \leq G max \frac{n _{g}}{n ^{(r - 2) /2 (r - 1)} λ _{n}^{r /2 (r - 1)}} (\frac{λ _{n}}{n})^{1/2 (r - 1)} = o (1)

(g \leq G max \frac{n _{g}^{2}}{n λ _{n}})^{1/2} = g \leq G max \frac{n _{g}}{n ^{(r - 2) /2 (r - 1)} λ _{n}^{r /2 (r - 1)}} (\frac{λ _{n}}{n})^{1/2 (r - 1)} = o (1)

Ω_{n} = \frac{1}{n} g = 1 \sum G E (X_{g} X_{g}^{'}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Asymptotic Theory for Clustered Samples

$\begin{array}[c]{ccc}\text{Bruce E. Hansen}\thanks{Hansen thanks the National Science Foundation and the Phipps Chair for research support.}&&\text{Seojeong Lee}\thanks{Lee acknowledges that this research was supported under the Australian Research Council Discovery Early Career Researcher Award (DECRA) funding scheme (project number DE170100787).}\\ \text{University of Wisconsin}&&\text{University of New South Wales}\end{array}$ Hansen thanks the National Science Foundation and the Phipps Chair for research support.Lee acknowledges that this research was supported under the Australian Research Council Discovery Early Career Researcher Award (DECRA) funding scheme (project number DE170100787).

(February 2019111We thank the Co-Editor Han Hong and two referees for helpful comments on a previous version, and Morten Nielsen and James MacKinnon for valuable conversations and suggestions.)

Abstract

We provide a complete asymptotic distribution theory for clustered data with a large number of independent groups, generalizing the classic laws of large numbers, uniform laws, central limit theory, and clustered covariance matrix estimation. Our theory allows for clustered observations with heterogeneous and unbounded cluster sizes. Our conditions cleanly nest the classical results for i.n.i.d. observations, in the sense that our conditions specialize to the classical conditions under independent sampling. We use this theory to develop a full asymptotic distribution theory for estimation based on linear least-squares, 2SLS, nonlinear MLE, and nonlinear GMM.

1 Introduction

Clustered samples are widely used in current applied econometric practice. Despite this dominance, there is little formal large-sample theory for estimation and inference. This paper provides such a foundation. We develop a complete, rigorous, and easily-interpretable asymptotic distribution theory for the “large number of clusters” framework. Our theory allows heterogeneous and growing cluster sizes, but requires that the number of clusters $G$ grows with sample size $n$ . Our core theory provides a weak law of large numbers (WLLN), central limit theorem (CLT), and consistent clustered variance estimation for clustered sample means. We also provide uniform laws of large numbers and uniform consistent clustered variance estimation appropriate for the distribution theory of nonlinear econometric estimators.

We apply this core theory to develop large sample distribution theory for standard econometric estimators: linear least-squares, 2SLS, MLE, and GMM. For each, we provide conditions for consistent estimation, asymptotic normality, consistent covariance matrix estimation, and asymptotic distributions for t-ratios and Wald statistics. The theory provided in this paper is the first formal theory for such econometric estimators allowing for clustered dependence.

Our assumptions are minimal, requiring only uniform integrability for the WLLN and squared uniform integrability for the CLT and clustered covariance matrix estimators, plus the requirement that individual clusters are asymptotically negligible. Our results show that there are inherent trade-offs in the conditions between the allowed degree of heterogeneity in cluster sizes and the number of finite moments. These trade-offs are least restrictive for the WLLN, are more restrictive for the CLT and consistent cluster covariance matrix estimation, and are strongest for CLTs applied to clustered second moments. These trade-offs do not arise in the independent sampling context.

We show that under clustering the convergence rate depends on the degree of clustered dependence. Convergence rates may equal the square root of the sample size, the square root of the number of clusters, be a rate in between these two, or even slower than both. Our assumptions and theory allow for these possibilities. This is in contrast to the existing literature, which imposes specific rate assumptions. One useful finding is that the rate does not need to be known by the user; the asymptotic distribution of t-ratios and Wald statistics does not depend on the underlying rate of convergence. This generalizes similar results in C. Hansen (2007) and related results in Tabord-Meehan (2018).

This paper makes the following technical contributions. We show that the key to extending the classical WLLN and CLT to cluster-level data is developing uniform integrability bounds for cluster sums. To allow for arbitrary within-cluster dependence, this means that such bounds will be scaled by cluster sizes. This leads to bounds on the degree of cluster size heterogeneity which can be allowed under cluster dependence. Some of the most difficult technical work presented here is the extension of classical results to clustered covariance matrix estimators. These are not sample averages, but rather average across clusters of squared cluster sums. Handling such estimators requires a new technical treatment.

Clustered dependence in econometrics dates to the work of Moulton (1986, 1990), Liang and Zeger (1986), and in particular Arellano (1987), who proposed the popular cluster-robust covariance matrix estimator. The method was popularized by the implementation in Stata by Rogers (1994) and the widely-cited paper of Bertrand, Duflo and Mullainathan (2004). Surveys can be found in Wooldridge (2003), Cameron and Miller (2011, 2015), MacKinnon (2012, 2016), and textbook treatments in Angrist and Pischke (2009) and Wooldridge (2010).

The “large $G$ ” asymptotic theory develops normal approximations under the assumption that $G\rightarrow\infty$ . The earliest treatment appears in White (1984). Wooldridge (2010) asserts a distribution theory under the assumption that the cluster sizes are fixed. C. Hansen (2007) provides two sets of asymptotic results, including both $\sqrt{G}$ and $\sqrt{n}$ convergence rates under two distinct assumptions on the rate of convergence of the estimation variance. His results are derived under the assumption that all clusters are identical in size. Carter, Schnepel and Steigerwald (2017) provided asymptotic results allowing for heterogeneous clusters, but their results are limited by atypical regularity conditions. Independently of this paper, Djogbenou, MacKinnon, and Nielsen (2018) have provided a rigorous asymptotic theory for heterogeneous clusters, with similar but stronger regularity conditions than ours. Their primary focus is theory for regression wild bootstrap, while our focus is regularity conditions for general econometric estimators.

An alternative to the “large $G$ ” asymptotic is the “fixed $G$ ” framework, which leads to a non-normal inference theory. Contributions to this literature include C. Hansen (2007), Bester, Conley and C. Hansen (2011), and Ibragimov and Müeller (2010, 2016). A related paper is Conley and Taber (2011) which provide an asymptotic theory under the assumption of a small number of groups with policy changes. Canay, Romano, and Shaikh (2017) provide approximate randomization tests.

Small sample approaches to cluster robust inference include Donald and Lang (2007), Imbens and Kolesár (2016), and Young (2016). Bootstrap approaches are provided by Cameron, Gelbach and Miller (2008), and MacKinnon and Webb (2017, 2018).

A recent contribution which develops cluster-robust inference for GMM is Hwang (2017).

The organization of the paper is as follows. After Section 2, which introduces cluster sampling, Sections 3-8 cover the core asymptotic theory, providing rigorous conditions for the WLLN (Section 3), rates of convergence (Section 4), the CLT (Section 5), cluster-robust covariance matrix estimation (Section 6), the ULLN (Section 7), and the CLT for clustered second moments (Section 8). Following this, we provide the distribution theory for the core econometric estimators, specifically linear regression and 2SLS (Section 9), Maximum Likelihood (Section 10), and GMM (Section 11). Each of these latter sections are written self-sufficiently, so they can be used directly by readers. Proofs of the core theorems are provided in the Appendix, and proofs for the applications are provided in the Supplemental Appendix.

2 Cluster Sampling

The observations are $X_{i}\in\mathbb{R}^{p}$ for $i=1,...,n$ . They are grouped into $G$ mutually independent known clusters, indexed $g=1,...,G$ , where the $g^{th}$ cluster has $n_{g}$ observations. The clustering can be due to the sampling scheme, or done by the researcher due to known correlation structures. The number of observations $n_{g}$ per cluster (the “cluster sizes”) may vary across clusters. The total number of observations are $n=\sum_{g=1}^{G}n_{g}$ . It will also be convenient to double-index the observations as $X_{gj}$ for $g=1,...,G$ and $j=1,...,n_{g}$ .

As is conventional in the clustering literature, the only dependence assumption we make is that the observations are independent across clusters, while the dependence within each cluster is unrestricted. Furthermore, we do not require that the observations or clusters come from identical distributions. Thus our framework includes i.n.i.d (independent, not necessarily identically distributed) as the special case $n_{g}=1$ .

The notation and assumptions allow for linear panel data models with cluster-specific fixed effects. In this case the observations $X_{gj}$ should be viewed as clustered-demeaned observations. Another common application is linear panel data models with both cluster-specific and time-specific fixed effects. Our assumptions do not cover this case as removing the time effects will induce cross-cluster correlations. This is essentially “multiway” clustering and requires different methods. See MacKinnon, Nielsen and Webb (2017).

Our distributional framework is asymptotic as $n$ and $G$ simultaneously diverge to infinity. This is typically referred to as the “large $G$ ” framework. Our assumptions, however, will allow $G$ to diverge at a rate slower than $n$ , by allowing the cluster sizes $n_{g}$ to diverge. This is in contrast to the early asymptotic theory for clustering, which implicitly assumed that the cluster sizes were bounded.

Our theory assumes that the clusters are known, and observations are independent across clusters. This is a substantive restriction. Alternatively, it may be possible to develop a distribution theory which allows weak dependence across clusters, but we do not do so here.

A word on notation. For a vector $a$ let $\left\|a\right\|=\left(a^{\prime}a\right)^{1/2}$ denote the Euclidean norm. For a positive semi-definite matrix $A$ let $\lambda_{\min}(A)$ and $\lambda_{\max}(A)$ denote its smallest and largest eigenvalue, respectively. For a general matrix $A$ let $\left\|A\right\|=\sqrt{\lambda_{\max}\left(A^{\prime}A\right)}$ denote the spectral norm. For a positive semi-definite matrix $A$ let $A^{1/2}$ denote the symmetric square root matrix such that $A^{1/2}A^{1/2}=A$ . We let $C$ denote a generic positive constant, that may be different in different uses.

3 Weak Law of Large Numbers

For our core theory (WLLN & CLT), we focus on the sample mean $\overline{X}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}$ as an estimator of $E\overline{X}_{n}$ . It will be convenient to define the cluster sums

[TABLE]

which are mutually independent under clustered sampling. The sample mean can then be written as

[TABLE]

We use the following regularity condition.

Assumption 1.

As $n\rightarrow\infty$

[TABLE]

Theorem 1.

(WLLN for clustered means). If Assumption 1 holds and

[TABLE]

then as $n\rightarrow\infty$ ,

[TABLE]

The condition (2) states that $X_{i}$ is uniformly integrable222A referee points out that the sup in (2) could be weakened to an average. However our later results will use uniform integrability conditions similar to (2) so we state all results in this format.. This condition is identical to the standard condition for the WLLN for independent heterogeneous observations, and thus Theorem 1 is a direct generalization of the WLLN for i.n.i.d. samples. (2) simplifies to $E\left\|X_{i}\right\|<\infty$ when the observations have identical marginal distributions. A sufficient condition allowing for distributional heterogeneity is $\sup_{i}E\left\|X_{i}\right\|^{r}<\infty$ for some $r>1$ .

Assumption 1 states that each cluster size $n_{g}$ is asymptotically negligible. This implies $G\rightarrow\infty$ , so we do not explicitly need to list the latter as an assumption. Assumption 1 allows for considerable heterogeneity in cluster sizes. It allows the cluster sizes to grow with sample size, so long as the growth is not proportional. For example, it allows clusters to grow at the rate $n_{g}=n^{\alpha}$ for $0\leq\alpha<1$ .

Assumption 1 is necessary for parameter estimation consistency while allowing arbitrary within-cluster dependence. Otherwise a single cluster could dominate the sample average. To see this, suppose that there is a cluster $\ell$ such that all observations within the cluster are identical, so that $X_{\ell j}=Z_{\ell}$ for some non-degenerate random variable $Z_{\ell}$ , and that this cluster violates Assumption 1, so that $n_{\ell}/n\rightarrow c>0$ . Suppose for all other clusters that $EX_{gj}=0$ and $n_{g}/n\rightarrow 0$ . Then $\overline{X}_{n}\xrightarrow{p}Z_{\ell}$ and is inconsistent. Thus Assumption 1 is necessary for the WLLN (3) if we allow for unstructured cluster heterogeneity.

Assumption 1 is equivalent to the condition

[TABLE]

To see this, first observe that since $\sum_{g=1}^{G}n_{g}=n$ , the left-hand-side of (4) is smaller than $\max\limits_{g\leq G}n_{g}/n\rightarrow 0$ under Assumption 1. Thus Assumption 1 implies (4). Second,

[TABLE]

under (4). Thus (4) implies Assumption 1, so the two are equivalent.

4 Rate of Convergence

Under i.i.d. sampling the rate of convergence of the sample mean is $n^{-1/2}$ . Clustering can alter the rate of convergence. In this section we explore possible rates of convergence. From the work of C. Hansen (2007) it has been understood that if the dependence within each cluster is weak then the rate of convergence would be the i.i.d. rate $n^{-1/2}$ but if the dependence within each cluster is strong then the rate of convergence would be determined by the number of clusters: $G^{-1/2}$ . What we now show is that the rate of convergence can be in between or even slower than these rates.

The convergence rate can be calculated as the standard deviation of the sample mean. For simplicity we focus on the scalar case $p=1$ . The standard deviation of $\overline{X}_{n}$ is

[TABLE]

We now consider several examples. For our first four we take the case where the clusters are all the same size: $n_{g}=n^{\alpha}$ for $0<\alpha<1$ . In this case the number of clusters is $G=n^{1-\alpha}$ .

We first consider a case where the convergence is the i.i.d. rate $n^{-1/2}$ .

Example 1. The observations are independent within each cluster and $\text{var}(X_{i})=1$ . Then

[TABLE]

and

[TABLE]

The $n^{-1/2}$ rate extends to any case where the within-cluster dependence is weak, including autoregressive and moving average dependence.

Our second example is a case where the convergence is determined by the number of clusters.

Example 2. The observations are identical within each cluster (e.g. perfectly correlated) and $\text{var}(X_{i})=1$ . Then

[TABLE]

and

[TABLE]

The assumption that the observations are perfectly correlated is not essential to obtain the $G^{-1/2}$ rate. What is important is that there is a common component to the observations within a cluster.

Our third example is a case where the convergence rate is in between the above two cases. Not surprisingly, it can obtained by constructing strong but decaying within-cluster dependence.

Example 3. The observations are correlated within each cluster with $\text{var}(X_{i})=1$ and $\text{cov}(X_{gj},X_{gl})=1/|j-l|$ . Then

[TABLE]

and

[TABLE]

Furthermore, $G\text{var}\left(\overline{X}_{n}\right)\rightarrow 0.$ Thus $\text{sd}\left(\overline{X}_{n}\right)$ converges at a rate in between $n^{-1/2}$ and $G^{-1/2}$ .

Our next two examples are somewhat surprising. They are cases where the convergence rate is slower than both $n^{-1/2}$ and $G^{-1/2}$ .

Example 4. The observations follow random walks within each cluster: $X_{gj}=X_{gj-1}+\varepsilon_{gj}$ with $\varepsilon_{gj}$ i.i.d. $(0,1)$ and $X_{g0}=0.$ Then

[TABLE]

and

[TABLE]

Thus $\text{sd}\left(\overline{X}_{n}\right)$ converges at a rate slower than both $n^{-1/2}$ and $G^{-1/2}$ .

Example 5. The clusters are of two sizes, $n_{g}=1$ and $n_{g}=n^{\alpha}$ . There are $G_{1}=n/2$ of the first type and $G_{2}=n^{1-\alpha}/2$ of the second type. (So $G=G_{1}+G_{2}=O\left(n\right)$ .) Within each cluster the observations are identical and have unit variances. var $(\widetilde{X}_{g})$ for the two types of clusters are $1$ and $n^{2\alpha}$ , respectively. Then

[TABLE]

Thus $\text{sd}\left(\overline{X}_{n}\right)$ converges at at a rate slower than both $n^{-1/2}$ and $G^{-1/2}$ .

The final example illustrates the importance of considering heterogeneous cluster sizes. The reason why the convergence rate is slower than both $n^{-1/2}$ and $G^{-1/2}$ is because the number of clusters is determined by the large number of small clusters, but the convergence rate is determined by the (relatively) small number of large clusters.

What we have seen is that the convergence rate $\text{sd}\left(\overline{X}_{n}\right)$ can equal the square root of sample size $n^{-1/2}$ , can equal the square root of the number of groups $G^{-1/2}$ , can be in between $G^{-1/2}$ and $n^{-1/2}$ , or can be slower than both $n^{-1/2}$ and $G^{-1/2}$ .

When $\overline{X}_{n}$ is a vector, it is likely that its elements converge at different rates since they can have different within-cluster correlation structures. For example, some variables could be independent within clusters while others could be identical within clusters.

These examples show that under cluster dependence the convergence rate is context-dependent and variable-dependent, and it is therefore important to allow for general rates of convergence and to not impose arbitrary rates in asymptotic analysis.

5 Central Limit Theory

Under i.i.d. sampling the standard deviation of the sample mean is of order $O(n^{-1/2})$ , so $\sqrt{n}$ is the appropriate scaling to obtain the central limit theorem (CLT). As discussed in the previous section, clustering can alter the rate of convergence, so it is essential to standardize the sample mean by the actual variance rather than an assumed rate. The variance matrix of $\sqrt{n}\overline{X}_{n}$ is

[TABLE]

We use the following regularity condition.

Assumption 2.

For some $2\leq r<\infty$

[TABLE]

as $n\rightarrow\infty.$

Theorem 2.

(CLT) If for some $2\leq r<\infty$ Assumption 2 holds,

[TABLE]

and

[TABLE]

then as $n\rightarrow\infty$

[TABLE]

Theorem 2 provides a CLT for cluster samples which generalizes the classic CLT for independent heterogeneous samples. The latter holds with $r=2$ , $n_{g}=1$ and $G=n$ .

Assumption 2 and (7) are stronger than Assumption 1 and (2), and thus the conditions for the CLT imply those for the WLLN.

The condition (7) states that $\left\|X_{i}\right\|^{r}$ is uniformly integrable. When $r=2$ this is similar to the Lindeberg condition for the CLT under independent heterogeneous sampling. (7) simplifies to $E\left\|X_{i}\right\|^{r}<\infty$ when the observations have identical marginal distributions. A sufficient condition allowing for distributional heterogeneity is $\sup_{i}E\left\|X_{i}\right\|^{s}<\infty$ for some $s>r\geq 2$ .

Assumption 2 (5) is a restriction on the cluster sizes. It involves a trade-off with the number of moments $r$ . It is least restrictive for large $r$ , and more restrictive for small $r$ . As $r\rightarrow\infty$ it approaches $\max_{g\leq G}n_{g}^{2}/n=O(1)$ , which is implied by Assumption 2 (6).

Assumption 2 allows for growing and heterogeneous cluster sizes. For example, it allows clusters to grow uniformly at the rate $n_{g}=n^{\alpha}$ for $0\leq\alpha\leq(r-2)/2(r-1)$ . (Note that this requires the cluster sizes to be bounded if $r=2$ .) It also allows for only a small number of clusters to grow. For example, suppose that $n_{g}=\overline{n}$ (bounded) for $G-K$ clusters and $n_{g}=G^{\alpha/2}$ for $K$ clusters, with $K$ fixed. Then Assumption 2 holds for any $\alpha<1$ and $r\geq 2$ .

Assumption 2 (5) is implied by

[TABLE]

and they are equivalent when the cluster sizes are homogeneous. In general, however, (5) is less restrictive than (10). For example, when $r=2$ , (10) requires the cluster sizes to be bounded, while (5) does not. (Consider the heterogeneous example given in the previous paragraph. This satisfies (5) but not (10) when $r=2$ .)

The condition (8) specifies that $\text{var}\left(\sqrt{n}\alpha^{\prime}\overline{X}_{n}\right)$ does not vanish for any conformable vector $\alpha\neq 0$ . This excludes degenerate cases and perfect negative within-cluster correlation. In general, if $X_{i}$ is non-degenerate then (8) is not restrictive as there is no reasonable setting where it will be violated. If $\overline{X}_{n}$ converges at rate $n^{-1/2}$ then $\lambda_{n}=O(1)$ but when $\overline{X}_{n}$ converges at rate slower than $n^{-1/2}$ then $\lambda_{n}$ will actually diverge with $n$ . It should also be mentioned that condition (8) allows the components of $\Omega_{n}$ to converge at different rates.

Our proof of Theorem 2 actually uses the conditions

[TABLE]

and

[TABLE]

instead of (5)-(8). (11)-(12) is weaker than (5)-(8) when $\lambda_{n}$ diverges to infinity (which occurs when $\overline{X}_{n}$ converges at a rate slower than $n^{-1/2}$ ). Since the sequence $\lambda_{n}$ is unknown in an application it is difficult to interpret the assumptions (11)-(12). Hence we prefer the assumptions (5)-(8).

The conditions (11)-(12) may be stronger than necessary when within-cluster dependence is weak, but are necessary under strong within-cluster dependence. To see this, suppose that all observations within a cluster are identical, so that $X_{gj}=Z_{g}$ and $Z_{g}$ has a finite variance but no higher moments. Then the Lindeberg condition for the CLT can be simplified to

[TABLE]

for all $\varepsilon>0$ . Each term in the sum must limit to zero, which requires (11)-(12) with $r=2$ .

We now compare our conditions with those of Djogbenou, MacKinnon, and Nielsen (2018). Their Assumption 3 states (in our notation) for $r\geq 4$

[TABLE]

Equation (13) implies and is stronger than (11). Calculations similar to those in our appendix show that $\lambda_{n}\leq O\left(\max_{g}n_{g}\right)=O(n)$ . So (13) also implies

[TABLE]

which is (12). Thus our conditions (11)-(12) are less restrictive than their condition (13), and do not require $r\geq 4$ .

6 Cluster-Robust Variance Matrix Estimation

We now discuss cluster-robust covariance matrix estimation.

We first consider the case where $X_{i}$ is mean zero (or equivalently that the mean is known). In this case the covariance matrix equals

[TABLE]

In this case a natural estimator is

[TABLE]

Theorem 3.

Under the assumptions of Theorem 2, if in addition $EX_{i}=0$ then as $n\rightarrow\infty$

[TABLE]

and

[TABLE]

Theorem 3 shows that the cluster-robust covariance matrix estimator is consistent, and replacing the covariance matrix in the CLT with the estimated covariance matrix does not affect the asymptotic distribution. Implications of (15) are that cluster-robust t-ratios are asymptotically standard normal, and that cluster-robust Wald statistics are asymptotically chi-square distributed with $p$ degrees of freedom.

Construction of practical covariance matrix estimators is context-specific, depending on the mean structure. For example, suppose that $\mu=EX_{i}$ does not vary across observations. In this case we can write

[TABLE]

The natural estimator for $\mu$ is $\overline{X}_{n}$ and that for $\Omega_{n}$ is

[TABLE]

Theorem 4.

Under the assumptions of Theorem 2, if in addition $\mu=EX_{i}$ does not vary across observations, then as $n\rightarrow\infty$

[TABLE]

and

[TABLE]

7 Uniform Laws of Large Numbers

Now consider a uniform WLLN. Consider functions $f(x,\theta)\in\mathbb{R}^{k}$ indexed on $\theta\in\Theta$ where $\Theta$ is compact. Define the sample mean

[TABLE]

The following result is an application of Theorem 3 of Andrews (1992).

Theorem 5.

(ULLN for clustered means). Suppose that Assumption 1 holds and for each $\theta\in\Theta$

[TABLE]

Suppose as well that for each $\theta_{1},\theta_{2}\in\Theta$

[TABLE]

where $h(u)\downarrow 0$ as $u\downarrow 0$ and $\sup_{i}EA(X_{i})\leq C$ . Then $E\overline{f}_{n}(\theta)$ is continuous in $\theta$ uniformly over $\theta\in\Theta$ and $n\geq 1$ , and as $n\rightarrow\infty$

[TABLE]

We also consider a uniform law for the clustered variance. Set $\mu(\theta)=Ef(X_{i},\theta)$ so that it does not vary across observations. The variance of $\sqrt{n}\overline{f}_{n}(\theta)$ is

[TABLE]

where $\widetilde{f}_{g}(\theta)=\sum_{j=1}^{n_{g}}f(X_{gj},\theta)$ are the cluster sums. An appropriate estimator for $\Omega_{n}(\theta)$ is

[TABLE]

In practice, a simpler estimator

[TABLE]

is often used if $\mu(\theta_{0})=0$ for $\theta_{0}\in interior\left(\Theta\right)$ and $\widehat{\theta}\xrightarrow{p}\theta_{0}$ for some estimator $\widehat{\theta}$ .

The following result is an extension of Theorem 5 to the case of clustered variance estimators. It also relies on Theorem 3 of Andrews (1992).

Theorem 6.

(ULLN for clustered variance). Suppose that Assumption 2 holds with $r=2$ , $\mu(\theta)=Ef(X_{i},\theta)$ does not vary across $i$ , for each $\theta\in\Theta$ ,

[TABLE]

and for each $\theta_{1},\theta_{2}\in\Theta$ (19) holds with $\sup_{i}EA(X_{i})^{2}\leq C$ . Then as $n\rightarrow\infty$

[TABLE]

If $\mu(\theta)=0$ , then as $n\rightarrow\infty$

[TABLE]

8 Central Limit Theorem for Clustered Second Moments

Although our primary focus is the sample mean, the core theory can be extended to statistics which are not sample means. In this section, we focus on the vectorized variance estimators

[TABLE]

where

[TABLE]

or

[TABLE]

The WLLN for $\overline{f}_{G}$ holds by Theorem 3 (14) and Theorem 4 (16), and the ULLN for $\overline{f}_{G}$ holds by Theorem 6. However, the CLT given in Theorem 2 cannot be applied to $\overline{f}_{G}$ because $\overline{f}_{G}$ cannot be written as the sample mean over $i$ . We provide the CLT for $\overline{f}_{G}$ below. This is useful to establish asymptotic distributions of estimators in a non-standard setting. For example, the asymptotic distribution of the generalized method of moments (GMM) estimators depends on the limiting distribution of the weight matrix when the moment condition is misspecified (Hall and Inoue, 2003; Lee, 2014; Hansen and Lee, 2018).

Similar to the sample mean, the convergence rate of $\overline{f}_{G}$ can vary under cluster dependence. Consider $\widetilde{f}_{g}=\widetilde{X}_{g}\otimes\widetilde{X}_{g}$ and assume $p=1$ for simplicity. The standard deviation of $\overline{f}_{G}$ is

[TABLE]

Under i.i.d. sampling $\text{sd}\left(\overline{f}_{G}\right)=O\left(n^{-1/2}\right)$ . Under the Examples 1 and 2 in Section 4, the convergence rate is $G^{-1/2}$ .

Define the variance matrix of $\sqrt{n}\overline{f}_{G}$ as

[TABLE]

We use the following regularity condition.

Assumption 3.

For some $2\leq r<\infty$

[TABLE]

as $n\rightarrow\infty$ .

Note that Assumption 3 is a strengthening of Assumption 2.

Theorem 7.

(CLT for clustered variance) For some $2\leq r<\infty$ Assumption 3 holds,

[TABLE]

and

[TABLE]

then as $n\rightarrow\infty$

[TABLE]

where $q=p^{2}$ .

Finally we provide a CLT combining the previous results. For $Y_{i}\in\mathbb{R}^{s}$ , $i=1,...,n$ , obtained by cluster sampling, let $\widetilde{\psi}_{g}$ be the stacked vector

[TABLE]

or

[TABLE]

and $\overline{\psi}_{G}=n^{-1}\sum_{g=1}^{G}\widetilde{\psi}_{g}$ . Let the variance matrix of $\sqrt{n}\overline{\psi}_{G}$ be

[TABLE]

The following Corollary provides the CLT for the joint process. Since it immediately follows from Theorems 2 and 7, the proof is omitted.

Corollary 1.

If for some $2\leq r<\infty$ Assumption 3 holds,

[TABLE]

and

[TABLE]

then as $n\rightarrow\infty$

[TABLE]

where $q=s+p+p^{2}$ .

9 Linear Regression and Two-Stage Least Squares

It is useful to use cluster-level notation. Let $\boldsymbol{y}_{g}=(y_{g1},...,y_{gn_{g}})^{\prime}$ , $\boldsymbol{X}_{g}=(\boldsymbol{x}_{g1},...,\boldsymbol{x}_{gn_{g}})^{\prime}$ and $\boldsymbol{Z}_{g}=(\boldsymbol{z}_{g1},...,\boldsymbol{z}_{gn_{g}})^{\prime}$ denote an $n_{g}\times 1$ vector of dependent variables, $n_{g}\times k$ matrix of regressors, and $n_{g}\times l$ matrix of instruments for the $g^{th}$ cluster. A linear model can be written using cluster notation as

[TABLE]

where $\boldsymbol{e}_{g}$ is a $n_{g}\times 1$ error vector. The case of linear regression holds as the special case where $\boldsymbol{Z}_{g}=\boldsymbol{X}_{g}$ and $l=k$ (so that (30) becomes identity). Assume $l\geq k$ . (29) is the structural equation and (30) is the first-stage equation.

The two-stage least squares (2SLS) estimator for $\boldsymbol{\beta}$ can be written as

[TABLE]

We first show consistency of $\widehat{\boldsymbol{\beta}}$ . Define

[TABLE]

Theorem 8.

If Assumption 1 holds, $Q_{n}$ has full rank $k$ , $\lambda_{\min}(W_{n})\geq C>0$ , and either

$(y_{i},\boldsymbol{x}_{i},\boldsymbol{z}_{i})$ * have identical marginal distributions with finite second moments;*

or 2. 2.

For some $r>2$ , $\sup_{i}E\left|y_{i}\right|^{r}<\infty$ , $\sup_{i}E\left\|\boldsymbol{x}_{i}\right\|^{r}<\infty$ , and $\sup_{i}E\left\|\boldsymbol{z}_{i}\right\|^{r}<\infty;$

then as $n\rightarrow\infty$ , $\widehat{\boldsymbol{\beta}}\xrightarrow{p}\boldsymbol{\beta}.$

Next we provide the asymptotic distribution. Define

[TABLE]

The residuals for the $g^{th}$ cluster are

[TABLE]

Define

[TABLE]

The variance estimator is

[TABLE]

with $d_{n}$ a possible finite-sample degree-of-freedom adjustment. For example, C. Hansen (2007) proposed $d_{n}=G/(G-1)$ for the regression case (under homogeneous cluster sizes), and Stata sets

[TABLE]

for the OLS and 2SLS estimators under cluster option.

Theorem 9.

Suppose that Assumption 2 holds for some $2\leq r\leq s<\infty$ , $Q_{n}$ has full rank $k$ , $\lambda_{\min}(W_{n})\geq C>0$ , $\lambda_{\min}(\Omega_{n})\geq\lambda>0$ , $\sup_{i}E\left|y_{i}\right|^{2s}<\infty$ , $\sup_{i}E\left\|\boldsymbol{x}_{i}\right\|^{2s}<\infty$ , and $\sup_{i}E\left\|\boldsymbol{z}_{i}\right\|^{2s}<\infty$ , and either

$(y_{i},\boldsymbol{x}_{i},\boldsymbol{z}_{i})$ * have identical marginal distributions; or* 2. 2.

$r<s$ ;

then, for any sequence of full-rank $k\times q$ matrices $R_{n}$ , as $n\rightarrow\infty$

[TABLE]

and

[TABLE]

The standard errors for $R_{n}^{\prime}\widehat{\boldsymbol{\beta}}$ can be obtained by taking the square roots of the diagonal elements of $n^{-1}R_{n}^{\prime}\widehat{V}_{n}R_{n}$ .

10 (Pseudo) Maximum Likelihood

Suppose that we observe a sequence of random vectors $X_{i}\in\mathbb{R}^{p}$ , $i=1,...,n$ with the same marginal distributions from a density $f(x,\boldsymbol{\theta})$ where $\boldsymbol{\theta}\in\mathbb{R}^{k}$ . Let $\boldsymbol{X}_{g}=(X_{g1},...,X_{gn_{g}})^{\prime}$ be a $n_{g}\times p$ matrix for each cluster. For the observations in the cluster $g$ , let $f_{g}(\boldsymbol{X}_{g},\boldsymbol{\theta}_{0})$ be the joint density. Since the observations within the same cluster need not be independent, $f_{g}(\boldsymbol{X}_{g},\boldsymbol{\theta}_{0})\neq\prod_{i=1}^{n_{g}}f(X_{gi},\boldsymbol{\theta}_{0})$ in general. This also implies that $f_{g}(\boldsymbol{X}_{g},\boldsymbol{\theta}_{0})\neq f_{h}(\boldsymbol{X}_{h},\boldsymbol{\theta}_{0})$ for $g\neq h$ . Given specification of $f_{g}(\boldsymbol{X}_{g},\boldsymbol{\theta}_{0})$ , the maximum likelihood estimator (MLE) can be obtained as the maximizer of

[TABLE]

However, the joint density $f_{g}(\boldsymbol{X}_{g},\boldsymbol{\theta})$ may be difficult to specify in practice. A simpler alternative is to use a pseudo-likelihood $\prod_{i=1}^{n_{g}}f(X_{gi},\boldsymbol{\theta}_{0})$ for the joint density $f_{g}(\boldsymbol{X}_{g},\boldsymbol{\theta}_{0})$ , and specify the log likelihood function as

[TABLE]

Define the pseudo-MLE as

[TABLE]

This estimator is also called the partial (or pooled) MLE (Wooldridge, 2010).

This estimator is the standard implementation of MLE under clustered dependence. To our knowledge there is no existing distribution theory for this standard estimator.

We first show consistency of $\widehat{\boldsymbol{\theta}}$ . The following is based on Theorem 2.1 of Newey and McFadden (1994).

Theorem 10.

If Assumption 1 holds,

$X_{i}$ * have identical marginal distributions with the density $f(x,\boldsymbol{\theta}_{0})$ and $\boldsymbol{\theta}_{0}\in\boldsymbol{\Theta}$ , which is compact,* 2. 2.

if $\boldsymbol{\theta}\neq\boldsymbol{\theta}_{0}$ then $f(x,\boldsymbol{\theta})\neq f(x,\boldsymbol{\theta}_{0})$ , 3. 3.

$E[\sup_{\boldsymbol{\theta}\in\Theta}|\log f(X_{i},\boldsymbol{\theta})|]<\infty$ , 4. 4.

for each $\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2}\in\Theta$ ,

[TABLE]

where $h(u)\downarrow 0$ as $u\downarrow 0$ and $EA(X_{i})\leq C$ ,

Then as $n\rightarrow\infty$ , $\widehat{\boldsymbol{\theta}}\xrightarrow{p}\boldsymbol{\theta}_{0}.$

Next we show the asymptotic distribution. Define

[TABLE]

Define the sample versions

[TABLE]

The variance estimator is

[TABLE]

Note that the information matrix equality does not hold because $\sum_{j=1}^{n_{g}}\log f(X_{gj},\boldsymbol{\theta}_{0})\neq f_{g}(\boldsymbol{X}_{g},\boldsymbol{\theta}_{0})$ in general.

Theorem 11.

In addition to the assumptions of Theorem 10, Assumption 2 holds with $r=2$ ,

$\boldsymbol{\theta}_{0}\in\text{interior}(\boldsymbol{\Theta})$ , 2. 2.

for some neighborhood $\mathcal{N}$ of $\boldsymbol{\theta}_{0}$ ,

(a)

$f(x,\boldsymbol{\theta})$ * is twice continuously differentiable and $f(x,\boldsymbol{\theta})>0$ ,* 2. (b)

$\int\sup_{\boldsymbol{\theta}\in\mathcal{N}}\left\|\frac{\partial}{\partial\boldsymbol{\theta}}\log f(x,\boldsymbol{\theta})\right\|dx<\infty$ , 3. (c)

$E\left\|\frac{\partial}{\partial\boldsymbol{\theta}}\log f(X_{i},\boldsymbol{\theta})\right\|^{2}<\infty$ , 4. (d)

$E\sup_{\boldsymbol{\theta}\in\mathcal{N}}\left\|\frac{\partial^{2}}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^{\prime}}\log f(X_{i},\boldsymbol{\theta})\right\|^{2}<\infty$ , 5. (e)

and for each $\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2}\in\mathcal{N}$ ,

[TABLE]

where $h(u)\downarrow 0$ as $u\downarrow 0$ and $EA(X_{i})\leq C$ , 3. 3.

$\lambda_{\min}(H_{n}(\boldsymbol{\theta}_{0}))\geq C>0$ , 4. 4.

$\lambda_{\min}(\Omega_{n}(\boldsymbol{\theta}_{0}))\geq\lambda>0$ ,

then for any sequence of full-rank $k\times q$ matrices $R_{n}$ , as $n\rightarrow\infty$

[TABLE]

and

[TABLE]

The standard errors for $R_{n}^{\prime}\widehat{\boldsymbol{\beta}}$ can be obtained by taking the square roots of the diagonal elements of $n^{-1}R_{n}^{\prime}\widehat{V}_{n}R_{n}$ .

11 Generalized Method of Moments

Suppose that we observe a sequence of random vectors $X_{i}\in\mathbb{R}^{p}$ , $i=1,...,n$ from cluster sampling. A known moment function is given by $m(X_{i},\boldsymbol{\theta})$ where $m(\cdot,\cdot)$ is $l\times 1$ and $\boldsymbol{\theta}$ is $k\times 1$ . Define the cluster sum as

[TABLE]

An unconditional moment model in cluster notation is given by

[TABLE]

We assume that $\boldsymbol{\theta}_{0}$ is identified and $l>k$ so the moment model is over-identified. Write the sample mean of the moment function as

[TABLE]

Since (37) holds for all $g=1,...,G$ , the usual unconditional moment condition $E\overline{m}_{n}(\boldsymbol{\theta}_{0})=0$ follows. The generalized method of moments (GMM) estimator is given by

[TABLE]

where $\widehat{W}_{n}^{-1}$ is an $l\times l$ positive definite weight matrix, which may or may not depend on an estimated parameter. Typically, the weight matrix is obtained by plugging in a preliminary consistent estimator, $\widetilde{\boldsymbol{\theta}}$ , so that $\widehat{W}_{n}^{-1}=\widehat{W}_{n}(\widetilde{\boldsymbol{\theta}})^{-1}$ .

We consider two forms of GMM estimator. The first one is based on a non-clustered weight matrix, which takes the form of

[TABLE]

for some $l\times 1$ vector $v(x,\boldsymbol{\theta})$ . This includes the conventional one-step and two-step GMM estimators. For 2SLS, $v(X_{i},\boldsymbol{\theta})=Z_{i}$ where $Z_{i}$ is an $l\times 1$ vector of instruments. The efficient two-step GMM uses $v(X_{i},\boldsymbol{\theta})=m(X_{i},\boldsymbol{\theta})$ or $v(X_{i},\boldsymbol{\theta})=m(X_{i},\boldsymbol{\theta})-\overline{m}_{n}(\boldsymbol{\theta})$ . The conventional efficient weight matrix, however, does not provide efficiency anymore under cluster sampling because a weight matrix of the form of (39) is not consistent for the variance matrix of $\sqrt{n}(\overline{m}_{n}(\boldsymbol{\theta})-E\overline{m}_{n}(\boldsymbol{\theta}))$ in general.

The second is based on the clustered efficient weight matrix, which leads to the two-step efficient GMM under cluster sampling. The weight matrix takes the form of

[TABLE]

Alternatively, the uncentered version of $\widehat{W}_{n}(\boldsymbol{\theta})$ and $\widehat{\Omega}_{n}(\boldsymbol{\theta})$ can be used to obtain the efficient two-step GMM estimator but the centered version is generally recommended. For more discussion, see Hansen (2018).

Since we assume that the weight matrix depends on a consistent preliminary estimator, we exclude the continuously updating (CU) GMM estimator in our analysis. Whenever possible, we omit the dependence of the weight matrices on $\widetilde{\boldsymbol{\theta}}$ and write $\widehat{W}_{n}=\widehat{W}_{n}(\widetilde{\boldsymbol{\theta}})$ . Define $W_{n}=E\widehat{W}_{n}(\boldsymbol{\theta}_{0})$ .

We first show consistency of the GMM estimator. The following is based on Theorem 2.1 of Newey and McFadden (1994).

Theorem 12.

If Assumption 1 holds,

$\Theta$ * is compact,* 2. 2.

$\boldsymbol{\theta}_{0}$ * is the unique solution to $E\overline{m}_{n}(\boldsymbol{\theta})=0$ ,* 3. 3.

for each $\boldsymbol{\theta}\in\boldsymbol{\Theta}$ , either $X_{i}$ have identical marginal distributions with $E\left\|m(X_{i},\boldsymbol{\theta})\right\|<\infty$ , or $\sup_{i}E\left\|m(X_{i},\boldsymbol{\theta})\right\|^{r}<\infty$ for some $r>1$ , 4. 4.

for each $\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2}\in\Theta$

[TABLE]

where $h(u)\downarrow 0$ as $u\downarrow 0$ and $EA(X_{i})\leq C$ , 5. 5.

$\lambda_{\min}(W_{n})\geq C>0$ , 6. 6.

$\widehat{W}_{n}^{-1}-W_{n}^{-1}\xrightarrow{p}0$ ,

then as $n\rightarrow\infty$ , $\widehat{\boldsymbol{\theta}}\xrightarrow{p}\boldsymbol{\theta}_{0}.$

Primitive conditions under which Condition 6 of Theorem 12 holds can be found given the choice of the weight matrix. For simplicity, we assume that if the conventional weight matrix is used then either $v(X_{i},\boldsymbol{\theta})=m(X_{i},\boldsymbol{\theta})$ or $v(X_{i},\boldsymbol{\theta})=m(X_{i},\boldsymbol{\theta})-\overline{m}_{n}(\boldsymbol{\theta})$ . If the clustered weight matrix is used then it takes the form of (40). The conditions of Theorem 13 are sufficient for Condition 6 of Theorem 12 to hold.

To show the asymptotic distribution of the GMM estimator, define

[TABLE]

where $Q_{n}=Q_{n}(\boldsymbol{\theta}_{0})$ and $\Omega_{n}=\Omega_{n}(\boldsymbol{\theta}_{0})$ . If the clustered efficient weight matrix (40) is used, then the asymptotic variance matrix simplifies to

[TABLE]

Define the sample versions as

[TABLE]

and let $\widehat{Q}_{n}=\widehat{Q}_{n}(\widehat{\boldsymbol{\theta}})$ and $\widehat{\Omega}_{n}=\widehat{\Omega}_{n}(\widehat{\boldsymbol{\theta}})$ . The variance estimator is

[TABLE]

if $\widehat{W}_{n}$ is given by (39) and

[TABLE]

if $\widehat{W}_{n}$ is given by (40), i.e., $\widehat{W}_{n}=\widehat{\Omega}_{n}$ .

The over-identifying restrictions test (the J test, hereinafter) is a test based on the GMM criterion to test whether the moment model is correctly specified or not, i.e., $E\widetilde{m}_{g}(\boldsymbol{\theta}_{0})=0$ . An implication of cluster sampling is that the conventional J test statistic will not have a standard chi-square asymptotic distribution because the conventional efficient weight matrix is not consistent for the inverse of the variance matrix of the moment function. The GMM criterion (38) based on the clustered efficient weight matrix (40) evaluated at the estimator is the robust J test statistic. Define

[TABLE]

Theorem 13.

In addition to the assumptions of Theorem 12, if Assumption 2 holds with $r=2$ ,

$\boldsymbol{\theta}_{0}\in\text{interior}(\boldsymbol{\Theta})$ , 2. 2.

for some neighborhood $\mathcal{N}$ of $\boldsymbol{\theta}_{0}$ ,

(a)

$m(X_{i},\boldsymbol{\theta})$ * is continuously differentiable with probability approaching one,* 2. (b)

either $X_{i}$ have identical marginal distributions with $E\sup_{\boldsymbol{\theta}\in\mathcal{N}}\left\|m(X_{i},\boldsymbol{\theta})\right\|^{2}<\infty$ ;

or $E\sup_{i}\sup_{\boldsymbol{\theta}\in\mathcal{N}}\left\|m(X_{i},\boldsymbol{\theta})\right\|^{r}<\infty$ for some $r>2$ , 3. (c)

$E\sup_{i}\sup_{\boldsymbol{\theta}\in\mathcal{N}}\left\|\frac{\partial}{\partial\boldsymbol{\theta}^{\prime}}m(X_{i},\boldsymbol{\theta})\right\|^{2}<\infty$ ** 4. (d)

for each $\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2}\in\mathcal{N}$

[TABLE]

where $h(u)\downarrow 0$ as $u\downarrow 0$ and $\sup_{i}EA(X_{i})\leq C$ , 3. 3.

$\lambda_{\min}(W_{n}(\boldsymbol{\theta}_{0}))\geq C>0$ , 4. 4.

$\lambda_{\min}(\Omega_{n}(\boldsymbol{\theta}_{0}))\geq\lambda>0$ , 5. 5.

$Q_{n}$ * is full column rank,*

then for any sequence of full-rank $k\times q$ matrices $R_{n}$ , as $n\rightarrow\infty$

[TABLE]

and

[TABLE]

The standard errors for $R_{n}^{\prime}\widehat{\boldsymbol{\beta}}$ can be obtained by taking the square roots of the diagonal elements of $n^{-1}R_{n}^{\prime}\widehat{V}_{n}R_{n}$ .

12 Appendix

We start with a useful technical result which states that if random variables are uniformly integrable then so are their cluster averages, regardless of their joint dependence.

Lemma 1.

For random vectors $X_{i}$ set $\widetilde{X}_{m}=\sum_{i=1}^{m}X_{i}$ . For $r\geq 1$ , if

[TABLE]

then

[TABLE]

Proof of Lemma 1: The proof is based on the proof of Theorem 1 of Etemadi (2006). Equation (45) implies that $\sup_{i}E\left\|X_{i}\right\|^{r}\leq C$ for some $C<\infty$ . By the $C_{r}$ inequality

[TABLE]

and hence

[TABLE]

Fix $\varepsilon>0$ . Find $B\geq(C/\varepsilon)^{2/r}$ sufficiently large such that

[TABLE]

which is feasible under (45). Using (47),

[TABLE]

by (49), Markov’s inequality, (48), and $B^{r/2}\geq C/\varepsilon$ . Since $\varepsilon$ is arbitrary this implies (46). $\blacksquare$

The next Lemma is useful for establishing the WLLN and CLT for the vectorized clustered second moments.

Lemma 2.

For random vectors $X_{i}$ set $\widetilde{X}_{m}=\sum_{i=1}^{m}X_{i}$ and $\widetilde{f}_{m}=\widetilde{X}_{m}\otimes\widetilde{X}_{m}$ or $\widetilde{f}_{m}=\left(\widetilde{X}_{m}-m\overline{X}_{n}\right)\otimes\left(\widetilde{X}_{m}-m\overline{X}_{n}\right)$ where $\overline{X}_{n}=n^{-1}\sum_{i=1}^{n}X_{i}$ . For $r\geq 2$ , if (45) holds then

[TABLE]

Proof of Lemma 2: The proof proceeds similar to that of Lemma 1. First consider $\widetilde{f}_{m}=\widetilde{X}_{m}\otimes\widetilde{X}_{m}$ . By the triangle inequality, the $C_{r}$ inequality, the fact that $\|\widetilde{X}_{m}\otimes\widetilde{X}_{m}\|^{r/2}=\|\widetilde{X}_{m}\|^{r}$ , and (48),

[TABLE]

Fix $\varepsilon>0$ . Find $B\geq\left(2^{r-2}C(1+\sqrt{1+2^{3-r}\varepsilon})/\varepsilon\right)^{4/r}$ sufficiently large such that

[TABLE]

which is feasible under (45). Using (51) and (47),

[TABLE]

by (52), Markov’s inequality, (48), and $2^{r-1}(B^{r/4}+C)C/B^{r/2}\leq\varepsilon$ using the discriminant. Since $\varepsilon$ is arbitrary this implies (50).

Now consider $\widetilde{f}_{m}=\left(\widetilde{X}_{m}-m\overline{X}_{n}\right)\otimes\left(\widetilde{X}_{m}-m\overline{X}_{n}\right)$ . By Minkowski’s inequality, the $C_{r}$ inequality, (47), and (48),

[TABLE]

and

[TABLE]

Given $\varepsilon$ , find $B\geq\left(2^{3r-2}C(1+\sqrt{1+2^{3(1-r)}\varepsilon})/\varepsilon\right)^{4/r}$ sufficiently large such that

[TABLE]

and proceed as above to show (50). This completes the proof. $\blacksquare$

Proof of Theorem 1: Without loss of generality assume $EX_{i}=0$ . Fix $\varepsilon>0$ . Pick $B$ sufficiently large so that

[TABLE]

which is feasible by Lemma 1 with $r=1$ under (2). Using the triangle inequality, Jensen’s inequality and (53),

[TABLE]

The equality uses the assumption that the clusters are independent and thus uncorrelated and the fact $\sum_{g=1}^{G}n_{g}=n$ . The third inequality uses the bound

[TABLE]

The fourth inequality is (4). Since $\varepsilon$ is arbitrary, $E\left\|\overline{X}_{n}\right\|\rightarrow 0$ . By Markov’s inequality, (3) follows. $\blacksquare$

Proof of Theorem 2: Without loss of generality we assume $EX_{i}=0.$ Note that

[TABLE]

We apply the multivariate Lindeberg-Feller central limit theorem (e.g. Hansen (2018) Theorem 6.15) since $\widetilde{X}_{g}$ are independent but not identically distributed. A sufficient condition for the CLT (9) is that for all $\varepsilon>0$

[TABLE]

as $n\rightarrow\infty$ .

Fix $\varepsilon>0$ and $\delta>0$ . Pick $B$ sufficiently large so that

[TABLE]

which is feasible by Lemma 1 under (7). Pick $n$ large enough so that

[TABLE]

which is feasible by (12). Thus

[TABLE]

The second inequality is (56), the third is (55), and the final is (11). Since $\varepsilon$ and $\delta$ are arbitrary we have established (54) and hence (9). $\blacksquare$

Proof of Theorem 3: Fix $\delta>0$ . Set $\varepsilon=\delta^{2}/4p$ . Define $\widetilde{X}_{g}^{\ast}=\Omega_{n}^{-1/2}\widetilde{X}_{g}$ and $\widetilde{Y}_{g}=\widetilde{X}_{g}^{\ast}1\left(\left\|\widetilde{X}_{g}^{\ast}\right\|^{2}\leq n\varepsilon\right)$ . Then

[TABLE]

By the triangle inequality,

[TABLE]

An argument similar to (57) shows that for $n$ sufficiently large (59) is bounded by $2\delta$ . We now consider (58).

Using Jensen’s inequality, the assumption that the clusters are independent and thus uncorrelated, and the triangle inequality, (58) is bounded by

[TABLE]

Using the bounds $\left\|\widetilde{Y}_{g}\widetilde{Y}_{g}^{\prime}\right\|\leq n\varepsilon$ and $\left\|\widetilde{Y}_{g}\widetilde{Y}_{g}^{\prime}\right\|\leq\left\|\widetilde{X}_{g}^{\ast}\right\|^{2}$ , we deduce $\left\|\widetilde{Y}_{g}\widetilde{Y}_{g}^{\prime}\right\|^{2}\leq n\varepsilon\left\|\widetilde{X}_{g}^{\ast}\right\|^{2}$ . Thus (60) is bounded by

[TABLE]

The first equality holds because $\widetilde{X}_{g}^{\ast}$ are independent and mean zero, and the second and third use the definition of $\overline{X}_{n}^{\ast}$ . The final equality is $\varepsilon=\delta^{2}/4p$ .

Together, we have shown that for $n$ sufficiently large,

[TABLE]

and hence (14) by Markov’s Inequality.

By the continuous mapping theorem

[TABLE]

and

[TABLE]

Combined with Theorem 2 we find

[TABLE]

This is (15). $\blacksquare$

Proof of Theorem 4: Since the estimator $\widehat{\Omega}_{n}$ is invariant to $\mu$ , without loss of generality we assume $\mu=0$ . In this case

[TABLE]

Then by the triangle inequality, Theorem 3, Theorem 2, and (6),

[TABLE]

This is (16). Equation (17) follows as in the proof of (15). $\blacksquare$

Proof of Theorem 5: Define the cluster sums $\widetilde{f}_{g}(\theta)=\sum_{i=1}^{n_{g}}$ $f(X_{gi},\theta)$ so that $\overline{f}_{n}(\theta)=\frac{1}{n}\sum_{g=1}^{G}\widetilde{f}_{g}(\theta)$ where $\widetilde{f}_{g}(\theta)$ are mutually independent.

Andrews (1992, Theorem 3) shows that (20) holds if $\Theta$ is totally bounded,

[TABLE]

and for all $\theta_{1},\theta_{2}\in\Theta\,$ ,

[TABLE]

where $h(u)\downarrow 0$ as $u\downarrow 0$ and $\frac{1}{n}\sum_{g=1}^{G}E\left(A_{g}\right)\leq A<\infty$ . The total boundedness condition holds by assumption and the WLLN holds by Theorem 1 under Assumption 1 and (18), so it only remains to establish the Lipschitz condition (61). Indeed, using the triangle inequality and (19)

[TABLE]

where $A_{g}=\sum_{j=1}^{n_{g}}A(X_{gj})$ . Notice that

[TABLE]

since $\sup_{i}EA(X_{i})\leq C$ . This verifies (61) and hence (20) holds. $\blacksquare$

Proof of Theorem 6: Without loss of generality, assume $\mu(\theta)=0$ .

We first examine the case with no estimated mean (23). Andrews (1992, Theorem 3) shows that (23) holds if for all $\theta\in\Theta$

[TABLE]

and for all $\theta_{1}$ , $\theta_{2}\in\Theta$ ,

[TABLE]

with $h(u)\downarrow 0$ as $u\downarrow 0$ and $\frac{1}{n}\sum_{g=1}^{G}EA_{g}\leq A<\infty$ . We now establish (62) and (63).

Take (62). Fix $\theta\in\Theta$ . For brevity, suppress the dependence of $\widetilde{f}_{g}(\theta)$ on $\theta$ . Fix $\delta>0$ . Set $\varepsilon=\left(\delta/C\right)^{2}$ . Define $\widetilde{h}_{g}=\widetilde{f}_{g}1\left(\left\|\widetilde{f}_{g}\right\|\leq\sqrt{n\varepsilon}\right)$ . Then

[TABLE]

By the triangle inequality

[TABLE]

Take (64). Assumption (21) and the $C_{r}$ inequality allow us to deduce that

[TABLE]

for some $C<\infty$ . Using Jensen’s inequality, the assumption the clusters are independent and thus uncorrelated, the bounds $\left\|\widetilde{h}_{g}\right\|\leq\sqrt{n\varepsilon}$ and $\left\|\widetilde{h}_{g}\right\|\leq\left\|\widetilde{f}_{g}\right\|$ , (66), (5) with $r=2$ and the definition of $\varepsilon$ , we obtain that (64) is bounded by

[TABLE]

Take (65). Lemma 1 implies that $\left\|n_{g}^{-1}\widetilde{f}_{g}\right\|^{2}$ is uniformly integrable given Assumption (21). This means we can pick $B$ sufficiently large so that

[TABLE]

Pick $n$ large enough so that

[TABLE]

which is feasible by (6). Then (65) is bounded by

[TABLE]

using (67) and (5) with $r=2$ . We have shown that $E\left\|\widetilde{\Omega}_{n}(\theta)-E\widetilde{\Omega}_{n}(\theta)\right\|\leq 3\delta$ . Since $\delta$ is arbitrary, by Markov’s inequality, (62) is shown.

Take (63). Fix any $\theta_{1},\theta_{2}\in\Theta$ . Set $\widetilde{f}_{g}=\sup_{\theta\in\Theta}\left\|\widetilde{f}_{g}(\theta)\right\|$ . Using the triangle inequality and Assumption (19)

[TABLE]

Then

[TABLE]

Hence (63) holds with $A_{g}=2\widetilde{f}_{g}\left(\sum_{j=1}^{n_{g}}A(X_{gj})\right)$ .

It remains to show that $\frac{1}{n}\sum_{g=1}^{G}EA_{g}\leq A<\infty$ . Assumption (21) and the $C_{r}$ inequality allows us to deduce that $E\widetilde{f}_{g}^{2}\leq Cn_{g}^{2}$ . Applying Holder’s inequality

[TABLE]

Hence

[TABLE]

by Assumption (5) with $r=2$ . This establishes (63).

By showing (62) and (63) we have established (23).

The case with estimated mean (22) immediately follows from (23) and Theorem 5. $\blacksquare$

Proof of Theorem 7: Define

[TABLE]

Then

[TABLE]

where $n\text{var}\left(\overline{f}_{G}^{\ast}\right)=I_{p}$ .

Since $\widetilde{f}_{g}^{\ast}$ are independent but not identically distributed, we apply the multivariate Lindeberg-Feller central limit theorem (e.g. Hansen (2018) Theorem 6.15). Since $\text{var}\left(\sqrt{n}\overline{f}_{G}^{\ast}\right)=I_{p}$ a sufficient condition for the CLT (28) is that for all $\varepsilon>0$

[TABLE]

as $n\rightarrow\infty$ .

Fix $\varepsilon>0$ and $\delta>0$ . Pick $B$ sufficiently large so that

[TABLE]

which is feasible by Lemma 2 under (26). Pick $n$ large enough so that

[TABLE]

which is feasible by (25). Thus

[TABLE]

The second inequality is (70), the third is (69), and the final is (24). Since $\varepsilon$ and $\delta$ are arbitrary we have established (68) and hence (28). $\blacksquare$

The proofs of Theorems 8-13 are presented in the Supplemental Appendix.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Andrews, Donald W.K. (1992). Generic uniform convergence. Econometric Theory , 8 (2), 241-257.
2[2] Angrist, Joshua D. and J. S. Pischke (2009). Mostly Harmless Econometrics: An Empiricist’s Companion . Princeton: Princeton University Press.
3[3] Arellano, Manuel (1987). Computing robust standard errors for within-groups estimators. Oxford Bulletin of Economics and Statistics , 49, 431-434.
4[4] Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan (2004). How much should we trust differences-in-differences estimates?. The Quarterly Journal of Economics , 119(1), 249-275.
5[5] Bester, C. Alan, Timothy G. Conley, and Christian B. Hansen (2011). Inference with dependent data using cluster covariance estimators. Journal of Econometrics , 165(2), 137-151.
6[6] Cameron, A. Colin, and Douglas L. Miller (2011). Robust inference with clustered data. Handbook of Empirical Economics and Finance , ed. A. Ullah and D.E. Giles, New York: CRC Press, 1-28.
7[7] Cameron, A. Colin, and Douglas L. Miller (2015). A practitioner’s guide to cluster robust inference. Journal of Human Resources , 50, 317-372.
8[8] Cameron, A. Colin, Jonah B. Gelbach, and Douglas L. Miller (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics , 90, 414-427.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Asymptotic Theory for Clustered Samples

Abstract

1 Introduction

2 Cluster Sampling

3 Weak Law of Large Numbers

Assumption 1**.**

Theorem 1**.**

4 Rate of Convergence

5 Central Limit Theory

Assumption 2**.**

Theorem 2**.**

6 Cluster-Robust Variance Matrix Estimation

Theorem 3**.**

Theorem 4**.**

7 Uniform Laws of Large Numbers

Theorem 5**.**

Theorem 6**.**

8 Central Limit Theorem for Clustered Second Moments

Assumption 3**.**

Theorem 7**.**

Corollary 1**.**

9 Linear Regression and Two-Stage Least Squares

Theorem 8**.**

Theorem 9**.**

10 (Pseudo) Maximum Likelihood

Theorem 10**.**

Theorem 11**.**

11 Generalized Method of Moments

Theorem 12**.**

Theorem 13**.**

12 Appendix

Lemma 1**.**

Lemma 2**.**

Assumption 1.

Theorem 1.

Assumption 2.

Theorem 2.

Theorem 3.

Theorem 4.

Theorem 5.

Theorem 6.

Assumption 3.

Theorem 7.

Corollary 1.

Theorem 8.

Theorem 9.

Theorem 10.

Theorem 11.

Theorem 12.

Theorem 13.

Lemma 1.

Lemma 2.