Permutation inference with a finite number of heterogeneous clusters
Andreas Hagemann

TL;DR
This paper proposes a simple, robust permutation testing method for evaluating treatment effects across finite, heterogeneous clusters, effectively controlling size and accommodating variability among clusters.
Contribution
It introduces an easy-to-implement permutation procedure with level adjustments that reliably tests hypotheses in clustered experimental designs with heterogeneity.
Findings
Controls size asymptotically with level-adjusted permutation test
Performs well with at least four treated and control clusters
Robust to high variability among clusters
Abstract
I introduce a simple permutation procedure to test conventional (non-sharp) hypotheses about the effect of a binary treatment in the presence of a finite number of large, heterogeneous clusters when the treatment effect is identified by comparisons across clusters. The procedure asymptotically controls size by applying a level-adjusted permutation test to a suitable statistic. The adjustments needed for most empirically relevant situations are tabulated in the paper. The adjusted permutation test is easy to implement in practice and performs well at conventional levels of significance with at least four treated clusters and a similar number of control clusters. It is particularly robust to situations where some clusters are much more variable than others. Examples and an empirical application are provided.
| 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | ||||
| .10 | 4 | .0428 | ||||||||||
| 5 | .0317 | .0595 | ||||||||||
| 6 | .0238 | .0432 | .0660 | |||||||||
| 7 | .0181 | .0340 | .0500 | .0760 | ||||||||
| 8 | .0161 | .0303 | .0493 | .0600 | .0813 | |||||||
| 9 | .0153 | .0246 | .0400 | .0580 | .0740 | .0900 | ||||||
| 10 | .0129 | .0220 | .0366 | .0500 | .0700 | .0826 | .0926 | |||||
| 11 | .0153 | .0193 | .0313 | .0420 | .0606 | .0746 | .0853 | .0953 | ||||
| 12 | .0106 | .0193 | .0260 | .0420 | .0580 | .0673 | .0800 | .0926 | .0953 | |||
| .05 | 5 | .0158 | ||||||||||
| 6 | .0108 | .0227 | ||||||||||
| 7 | .0088 | .0200 | .0253 | |||||||||
| 8 | .0062 | .0120 | .0233 | .0306 | ||||||||
| 9 | .0113 | .0120 | .0213 | .0300 | .0393 | |||||||
| 10 | .0100 | .0113 | .0166 | .0286 | .0340 | .0420 | ||||||
| 11 | .0100 | .0080 | .0153 | .0240 | .0313 | .0393 | .0440 | |||||
| 12 | .0073 | .0080 | .0153 | .0213 | .0266 | .0366 | .0440 | .0491 | ||||
| .025 | 6 | .0043 | ||||||||||
| 7 | .0040 | .0086 | ||||||||||
| 8 | .0026 | .0086 | .0153 | |||||||||
| 9 | .0026 | .0066 | .0100 | .0146 | ||||||||
| 10 | .0026 | .0046 | .0093 | .0146 | .0166 | |||||||
| 11 | .0020 | .0033 | .0080 | .0106 | .0166 | .0180 | ||||||
| 12 | .0020 | .0033 | .0073 | .0093 | .0120 | .0173 | .0206 | |||||
| .01 | 7 | .0026 | ||||||||||
| 8 | .0013 | .0026 | ||||||||||
| 9 | .0013 | .0020 | .0033 | |||||||||
| 10 | .0013 | .0020 | .0033 | .0040 | ||||||||
| 11 | .0013 | .0020 | .0033 | .0040 | .0066 | |||||||
| 12 | .0013 | .0013 | .0026 | .0033 | .0053 | .0066 | ||||||
| .005 | 8 | |||||||||||
| 9 | .0013 | |||||||||||
| 10 | .0013 | .0013 | ||||||||||
| 11 | .0006 | .0013 | .0020 | |||||||||
| 12 | .0013 | .0020 | .0033 | |||||||||
| Note: means should be the second largest order statistic . More values are available at https://hgmn.github.io/ap. | ||||||||||||
| oracle | oracle | |||||||||||
| AP | IM | BCH | WCB | CRS | AP | IM | BCH | WCB | CRS | |||
| (size) | (power) | |||||||||||
| 1 | .0244 | .0086 | .0265 | .0392 | .0474 | .2826 | .1176 | .2930 | .3981 | .4437 | ||
| 3 | .0316 | .0287 | .0641 | .0538 | .0513 | .1214 | .0706 | .1433 | .1493 | .1627 | ||
| 5 | .0377 | .0507 | .0787 | .0635 | .0451 | .0549 | .0662 | .1086 | .0887 | .0792 | ||
| 7 | .0358 | .0475 | .0735 | .0634 | .0442 | .0438 | .0560 | .0924 | .0791 | .0659 | ||
| (power) | (power) | |||||||||||
| 1 | .5541 | .3142 | .5631 | .6234 | .6036 | .6227 | .4797 | .7001 | .7054 | .6799 | ||
| 3 | .1896 | .1263 | .2375 | .2435 | .2410 | .2445 | .1900 | .3448 | .3420 | .3056 | ||
| 5 | .0728 | .0889 | .1566 | .1325 | .1192 | .0982 | .1188 | .2214 | .1897 | .1565 | ||
| 7 | .0533 | .0707 | .1306 | .1110 | .0908 | .0715 | .0915 | .1715 | .1488 | .1168 | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods in Clinical Trials · Statistical Methods and Bayesian Inference · Advanced Causal Inference Techniques
\newcites
AppendixAdditional references
Permutation inference with a finite
number of heterogeneous clusters
Andreas Hagemann
University of Michigan, Stephen M. Ross School of Business, 701 Tappan Ave, Ann Arbor, MI 48109, USA. Tel.: +1 (734) 615-6663
[email protected] umich.edu/ hagem
(Date: . )
Abstract.
I introduce a simple permutation procedure to test conventional (non-sharp) hypotheses about the effect of a binary treatment in the presence of a finite number of large, heterogeneous clusters when the treatment effect is identified by comparisons across clusters. The procedure asymptotically controls size by applying a level-adjusted permutation test to a suitable statistic. The adjusted permutation test is easy to implement in practice and performs well at conventional levels of significance with at least four treated clusters and a similar number of control clusters. It is particularly robust to situations where some clusters are much more variable than others.
JEL classification: C01, C12, C21
Keywords: cluster-robust inference, randomization, permutation, Behrens-Fisher
I would like to thank co-editor Bryan Graham, three anonymous referees, Isaiah Andrews, Michal Kolesár, Aprajit Mahajan, and several seminar audiences for useful comments and discussions. All errors are my own.
1. Introduction
It has become widespread practice in economics to conduct inference that is robust to within-cluster dependence. Typical examples of clusters are states, counties, cities, schools, firms, or stretches of time. Units within the same cluster are likely to influence one another or are influenced by the same external shocks. Several analytical and computationally intensive procedures such as the bootstrap are available to account for the presence of data clusters. Most of these procedures achieve consistency by requiring the number of clusters to go to infinity. Numerical evidence by Bertrand et al. (2004), MacKinnon and Webb (2017), and others suggests that this type of asymptotics often translates into heavily distorted inference in empirically relevant situations when the number of clusters is small or the clusters are heterogenous. In both situations, the overall finding is that true null hypotheses are rejected far too often. In this paper, I introduce an adjusted permutation procedure that is able to asymptotically control the size of tests about the effect of a binary treatment in the presence of finitely many large and heterogeneous clusters. The procedure applies to difference-in-differences estimation and other situations where treatment occurs in some but not all clusters and the treatment effect of interest is identified by between-cluster comparisons.
The main theoretical insight of this paper is that classical permutation inference can be adjusted to test the null hypothesis of equality of means of two finite samples of mutually independent but arbitrarily heterogeneous normal variables. This runs counter to classical permutation testing (Hoeffding, 1952), where the data under the null are presumed to be exchangeable. The adjustment corrects the significance level of the test downwards to account for heterogeneity. I prove that this is possible for empirically relevant levels of significance if both samples consist of more than three observations. The corrections needed for all standard levels of significance are tabulated in the paper. I also show that if a random vector of interest converges weakly to multivariate normal with diagonal covariance matrix, then permutation inference remains approximately valid for that vector. To exploit this result in a cluster context, I construct asymptotically normal statistics from each cluster and then apply adjusted permutation inference to the collection of these statistics. The resulting permutation test is consistent against all fixed alternatives to the null, powerful against local alternatives, and is free of user-chosen parameters.
The strategy of using cluster-level estimates as the basis for a test goes back at least to Fama and MacBeth (1973), who without formal justification run tests on regression coefficients obtained from year-by-year cross-sectional regressions. Their approach is generalized and formalized by Ibragimov and Müller (2010, 2016), who construct statistics from cluster-level estimates and show that for certain combinations of numbers of clusters and significance levels these statistics can be compared to Student critical values. The Ibragimov-Müller test and the adjusted permutation test complement one another because they both rely on finite-sample inference with heterogeneous normal variables but apply to non-nested combinations of numbers of clusters and significance levels. The empirical example in this paper features a practically relevant situation where the Ibragimov-Müller test does not apply but the adjusted permutation test does. If both tests apply, the Monte Carlo results in this paper indicate that neither test dominates the other in terms of power but the adjusted permutation test has clear advantages if the underlying data are heavy tailed.
Several other papers show that inference with a fixed number of clusters is possible under a variety of conditions: Canay et al. (2017) permute the signs of cluster-level statistics under symmetry assumptions. This approach requires the parameter of interest to be identified within each cluster and clusters therefore have to be paired in an ad-hoc manner for difference-in-differences estimation. This pairing has a substantial impact on the test decision and requires a large number of choices on the part of the researcher. Bester et al. (2011) use standard cluster-robust covariance matrix estimators but adjust critical values under homogeneity assumptions on the clusters. Canay et al. (2021) show that certain cluster-robust versions of the wild bootstrap can be valid under strong homogeneity assumptions with a fixed number of clusters. In sharp contrast, the test developed here does not require pairing clusters or any other decisions on the part of the researcher and applies even if the clusters are arbitrarily heterogeneous.
I will use the following notation: is the indicator function, , and cardinality of a set is . The smallest integer larger than is and the largest integer smaller than is . Limits are as unless noted otherwise.
All proofs can be found in the online appendix.
2. Permutation inference with heterogenous symmetric variables
In this section I show that classical permutation inference can be adjusted to test for the equality of location of two finite samples of independent symmetric variables with heterogeneous scales. The discussion focuses on heterogeneous normal variables but several of the results apply more generally.
Suppose the random vector has entries for and for , where the are iid symmetric variables. The are not known and no estimates are assumed to be available. The number of variables is taken as fixed throughout this paper. The goal is to construct an -level permutation test of the hypothesis . This is a two-sample problem with “treatment” sample and “control” sample . The test statistic considered here is the comparison of means
[TABLE]
No standardization is needed.
Let be the group of permutations of the set . For , denote by the value the permutation assigns to for . The “group action” on in is the relabeling of the indices . A permutation test derives its critical values from the permutation statistics . Because is invariant to the ordering of the first and last entries of , it suffices to compute the for the set of group actions with unique combinations of and . One way of representing this set is
[TABLE]
Denote by the ordered values of as varies over and define critical values
[TABLE]
Classical permutation inference operates under the null hypothesis that has the same distribution as for all . In the present context this would be equivalent to assuming that and that all are identical under the null. An argument due to Hoeffding (1952) would then show that could be used as the critical value for an -level test against the alternative . If the null hypothesis is weakened to without restrictions on , a natural question to ask if there exists any order statistic , , that can be used as a critical value for an -level test even if the classical permutation hypothesis for all fails. As I will discuss now, the answer to this question is affirmative for empirically relevant choices of if and are larger than .
Because , it is always true that . The largest non-trivial critical value from is therefore the second largest order statistic . The following theorem shows that the probability that exceeds is necessarily small under . In fact, this probability is so small that is well below any standard choice of for most values of and . By monotonicity, the existence of a such that is then guaranteed.
Theorem 2.1** (Size for heterogeneous symmetric variables).**
Let with , , where and the are iid copies of a continuous random variable . If and have the same distribution, then
[TABLE]
A byproduct of the theorem is a bound for the case where the scales are replaced by positive random variables independent of . The are then called “scale mixtures” of a symmetric variable . The following corollary is immediately obtained from Theorem 2.1 by conditioning on a given set of random scales.111A referee points out that Székely (2006) studies one-sample Student -tests for similar classes of distributions. Székely does not deal with permutation inference and uses a fundamentally different proof technique but the results are also powers of two.
Corollary 2.2** (Size for symmetric scale mixtures).**
Suppose with , , where the are iid copies of a continuous random variable and ) is a possibly dependent random vector independent of with for . If and have the same distribution, then .
Theorem 2.1 shows that a test with critical value has size , , , , as increases from to . Consequently, a 10%-level permutation test that relies only on symmetry is available with and as small as . One can perform a 5%-level test with , a 5%-level two-sided test (see the discussion below (2.5) ahead) with , a 1%-level test with , and a 1%-level two-sided test with .
More generally, Theorem 2.1 implies that for many combinations of , , and there exist such that and . The largest such value of maximizes power while still controlling the size of the test. Finding this is theoretically and computationally challenging. However, computation can be simplified if is restricted to a single distribution. For normal distributions, the best possible is
[TABLE]
where I suppress the dependence on and to prevent notational clutter. By construction, controls the size of the permutation test not only for arbitrarily heterogeneous normal variables but also for the entire class of scale mixtures of normals. This class includes all Student and Laplace distributions, as well as many other standard distributions (see, e.g., Gneiting, 1997). Moreover, because the critical value is from a permutation distribution, the test also controls size for all exchangeable distributions. The remainder of the paper therefore focuses on this and heterogeneous normal but other choices of distributions are possible.
A convenient feature of is that it does not depend on the data and can therefore be tabulated. To this end, I use a location-scale invariance argument to reduce the inner supremum in (2.4) to a supremum over , simulate over large random grids on , and compute by iteratively searching over these grids. (See Online Appendix E for details.) The search is not exhaustive and does not guarantee that the target quantity in (2.4) is found. However, in experiments this method consistently replicated the theoretical result in Theorem 2.1 up to a small approximation error, which indicates—but does not unequivocally establish—that this approximation of is reliable.
Table 1 lists for common choices of as a function of and . As can be seen, the adjustment needed to make inference robust to variance heterogeneity is substantial if is very small but disappears quickly as increases. For example, for a robust 10%-level test requires using the 95.62% quantile of the unadjusted test but for the 91% quantile is already sufficient for a robust 10%-level test. For larger numbers of variables the need for adjustment nearly disappears at conventional levels of significance. This is also confirmed by results in Hagemann (2019), who shows that unadjusted permutation inference in this context with the statistic is consistent if the number of treated and control units grows in a balanced manner.
The test decision is now simple. For , choose for a feasible from Table 1 to ensure under . The existence of such an for the comparison-of-means test statistic is guaranteed by Theorem 2.1. For an -level test of the null hypothesis , reject in favor of the alternative if
[TABLE]
For a one-sided test of level against , reject if or, equivalently, . For a two-sided test of level against , reject if or . Test decisions can also be equivalently made with the p-value of the unadjusted test
[TABLE]
because if and only if for every . A -value for a two-sided test can be defined as Reject the null hypothesis if the -value does not exceed from Table 1 to perform an -level test.
Online Appendix A contains additional results on power, stochastic approximation of , and large sample approximation of . The next section applies Theorem 2.1 to situations where is the distributional limit of cluster-level statistics.
3. Permutation inference with heterogenous clusters
In this section, I establish large sample results for an adjusted permutation test with finitely many clusters under a single high-level condition. I then outline how these results can be applied in empirical practice.
\addline
Suppose data from large clusters (e.g., counties, regions, schools, firms, or stretches of time) are available. Observations are independent across clusters but dependent within clusters. An intervention took place during which clusters received treatment and clusters did not. The quantity of interest is a treatment effect or an object related to a treatment effect that can be represented by a scalar parameter . Because entire clusters receive treatment, this parameter is only identified up to a location shift within a treated cluster. Hence, only the left-hand side of
[TABLE]
can be identified from such a cluster. If the clusters have similar characteristics, then can be identified from an untreated cluster. Comparing the two clusters identifies .
The identification strategy outlined in the preceding paragraph is the basis for differences-in-differences estimation—arguably the most popular identification strategy in economics today—and a variety of other models. The purpose of this section is to use the results from Section 2 to develop a permutation test of the conventional (non-sharp) hypothesis
[TABLE]
or, equivalently, . The idea is to obtain independent estimates of and independent estimates of so that is approximately multivariate normal with diagonal covariance matrix. The following example outlines a simple situation where this is possible.
Example 3.1** (Difference in differences).**
Consider the regression model
[TABLE]
where indexes individual units, indexes time, indicates time periods after an intervention at a known time , the dummy indicates whether unit eventually received treatment, and the are individual fixed effects. Provided has conditional mean zero and the covariates vary before or after , the data identify in a treated cluster and in an untreated cluster. View each cluster as a separate regression and rewrite (3.1) as
[TABLE]
and use the least squares estimates of and as .
The cluster-level statistics can be combined with the results in the previous section to perform a consistent permutation test as the sample size grows large. The test is not limited to the constructed in the preceding example. Instead, the key high-level condition is that a centered and scaled version of some estimate converges to a -dimensional standard normal distribution,
[TABLE]
The may depend on or but are not presumed to be known or estimable by the researcher. This is an important feature of the test because consistent covariance matrix estimation would require knowledge of an explicit ordering of the dependence structure within each cluster. While ordering the data is straightforward for time-dependent data, it may be difficult or impossible to infer or credibly assume an ordering of the data within villages or schools. In contrast, (3.3) can be established under weak dependence assumptions where it is only presumed that there exists a possibly unknown ordering for which the dependence decays at a certain rate. El Machkouri et al. (2013) present easy-to-use moment bounds and limit theorems for this situation; see also Bester et al. (2011) for further results.
I now show that under the joint convergence (3.3) a permutation test based on comparison of means of and can be adjusted to be asymptotically of level with a fixed number of clusters. This is possible for if in Table 1 is available at the desired significance level . In that case, the test has power against fixed alternatives with and local alternatives converging to the null. In the latter situation, is fixed and implicitly depends on . The convergence in (3.3) is then no longer pointwise in but a statement about the sequence . As before, the test can be made two-sided to have power against fixed and local alternatives from either direction. Let .
Theorem 3.2** (Consistency and local power).**
Suppose (3.3) holds. If , then
[TABLE]
Let . If with , then . If and the are continuous and positive at , then
[TABLE]
Remarks*.*
(i) Because if and only if , where and is a -vector of ones, the root- rate in (3.3) and in the theorem can be replaced by any other rate as long as the asymptotic normal distribution in (3.3) is still attained. The theorem therefore covers several semiparametric or nonstandard estimators.
(ii) To test for a given , define and reject if . Consistency follows from part (i) of this remark and Theorem 3.2.
(iii) If evaluating all elements of is too costly, the computational burden can be reduced by working with a random sample of random draws from . As long as and then , the theorem and parts (i)-(ii) of this remark also hold for with the exception of the local power bound if happens to be an integer. In that case, the inequality (3.4) holds after subtracting from its right-hand side, where , the are independent standard normal, and is defined in (2.6). This corrects for the discreteness of the test. (See also Online Appendix A.)
Example 3.3** (Difference in differences, cont.).**
Suppose there are pre-intervention and post-intervention periods for unit . The data from the time periods available for unit are the -th cluster. Let . In the absence of covariates (i.e., ), each least squares estimate in (3.2) satisfies
[TABLE]
under . If the pre-intervention and post-intervention periods are long in the sense that and for , then condition (3.3) already holds if is independent across and has a non-degenerate normal limiting distribution for each . A large number of central limit theorems for time dependent data exist; see, e.g., White (2001). Alternatively, if relatively few post-intervention periods are available so that satisfies and for , the scale invariance of the test allows replacement of the in (3.3) by . Then (3.3) holds if and obeys a central limit theorem for . This argument also applies if relatively few pre-intervention periods are available with the roles of and reversed. If the pre-intervention and post-intervention periods are short, Theorem 2.1 implies that the permutation test can still be applied if is multivariate normal for .
The calculations in the preceding paragraph can be adjusted to include covariates. Similar calculations also apply if each cluster is a collection of individual-level data over time, although in that case more general limit theory is needed. See, e.g, Jenish and Prucha (2009) and El Machkouri et al. (2013) for appropriate results.
The model in (3.1) can be modified in several ways. For instance, cluster-specific can be assumed instead of a fixed . The null hypothesis is then for all and the test has power against the alternative without changes to estimation and inference. (Conversely, the parameter does not need to vary across clusters for the results to go through.) The method discussed here can also be applied in difference-in-difference designs with staggered adoption (see, e.g., de Chaisemartin and D’Haultfœille, 2020). However, as Roth et al. (2022) point out, cannot vary by cluster, which rules out heterogeneous trends in untreated potential outcomes across clusters.
Online Appendix B provides more practical guidance for the implementation of the adjusted permutation test and applies the test to several standard econometric models.
4. Numerical results
This section studies the behavior of the adjusted permutation test and related methods in a Monte Carlo experiment and in data from a randomized trial. The discussion focuses on one-sided tests to the right but the results apply more generally. Online Appendix C contains additional numerical examples and empirical applications.
Example 4.1** (Difference in differences, cont.).**
This example explores the behavior of the adjusted permutation (AP hereafter) test, the Ibragimov and Müller (2016, IM) test (see Online Appendix C for a description and more results), the Bester et al. (2011, BCH) test, and a clustered wild bootstrap (Cameron et al., 2008, WCB) in a version of a Monte Carlo experiment in Conley and Taber (2011). The BCH test estimates parameters by least squares in the pooled sample and standardizes this estimate with the usual cluster-robust covariance matrix with a degrees-of-freedom adjustment. The resulting statistic is compared to the quantile of distribution with degrees of freedom. BCH show that this test is valid for certain ranges of and under regularity conditions if the distribution of the covariates is very similar across clusters. The WCB takes the same statistic but compares it to the bootstrap distribution of the statistic obtained from the cluster-robust version of the wild bootstrap using the Rademacher distribution and with the null imposed. This procedure is outlined in detail in Cameron et al. (2008). It is valid with (Djogbenou et al., 2019) under mild homogeneity conditions and valid for fixed under strong homogeneity conditions (Canay et al., 2021). The bootstrap here uses 199 repetitions.
The data generating process is the model in (3.1) specialized to
[TABLE]
with , , , and . As before, is a post-intervention indicator and is a treatment indicator. There are pre-intervention and post-intervention periods, six clusters received treatment, and six did not. I consider for every and . The experiment varies and cluster heterogeneity as follows: for , the last clusters had and the remaining clusters had .
Table 2 shows the rejection frequencies of the four tests outlined above under the null and the alternative. Each entry was computed from 10,000 Monte Carlo simulations and all methods were faced with the same data. As can be seen, all tests were conservative when there was little heterogeneity (). However, the BCH test and the WCB were no longer able to control size as the heterogeneity increased. The over-rejection in both methods led to higher rejection frequencies under the alternative, which therefore should not be viewed as evidence of their power. The AP test rejected far more false nulls than the IM test when there was little heterogeneity. As the heterogeneity increased, the IM test had a slight advantage. The BCH test and the WCB performed well at . However, even then there was little cost to using the AP test. It rejected nearly as many false nulls as the BCH test and at most 11.55 percentage points fewer false nulls than the WCB but was able to control size.
Several other methods for inference specifically designed for difference in differences such as Donald and Lang (2007) and Conley and Taber (2011) are available. Here I focus only on methods that apply more broadly and that are valid with a fixed number of clusters. The test of Canay et al. (2017, CRS) technically applies here but requires matching each treated cluster with a control cluster. In the present example, there are potential matches and equally many potential tests. A single match is enough to perform the test but different matches can lead to different test outcomes. This arbitrariness can be unattractive in applied work because the number of ways in which tests can be selected (and potentially combined) is large. However, if a pilot study or pre-analysis plan prescribed the cluster pairs, the (randomized) CRS test would be asymptotically similar and therefore provides a useful benchmark for the AP test. To this end, Table 2 shows results of an oracle version of the CRS test that presumes that a pre-analysis plan is in place. As can be seen, the AP test compares well to the CRS test while completely avoiding the issue that different cluster pairs can lead to different test results.
Example 4.2** (Achievement awards; Angrist and
Lavy 2009).**
In this example, I reanalyze data from a randomized trial of Angrist and Lavy (2009) in Israel. Their intervention provided cash rewards to low-achieving high school students if they performed well on the Bagrut certification exams for university admission in Israel. I follow the analysis in Table 5 of Angrist and Lavy (2009) and focus on 32 schools in the sample for which Bagrut rates from 2000 to 2002 are available. Of these schools, 15 received treatment and 17 did not. Because 5 schools did not comply with treatment, the estimates below should be interpreted as intent-to-treat effects. Following Angrist and Lavy, I investigate the performance of girls in the June 2001 exams who were close to achieving Bagrut certification in the sense that they were ranked above the median of the credit-weighted January 2001 scores of girls. The sample also includes all girls who were above the median in 2000 and 2002. The 2948 girls who met these criteria had an above 50% chance of Bagrut certification. I view each school over time as a cluster, which yields an average cluster size of approximately 92 students.
Angrist and Lavy (2009)** report a large number of specifications. I consider a version of their fixed-effects model and estimate where indexes students, indexes time, indexes schools, indicates Bagrut status, is the treatment indicator, equals in 2001 and is [math] otherwise, equals in 2002 and is [math] otherwise, indicates whether a student is in the top quartile of the pre-Bagrut grade distribution of girls in the cohort, and is a school fixed effect. Angrist and Lavy estimate several related specifications by logit in their Table 5. They report heteroskedasticity-robust standard errors for that table and argue that clustering is accounted for by their fixed effects. For simplicity and ease of interpretation, I estimate the model by least squares. The model predicts an average increase in the probability of receiving Bagrut status by relative to a mean of with a robust standard error of . A null of no effect against the alternative that is positive is rejected at any conventional significance level if standard normal critical values are used. This is in line with Table 5, col. (3) of Angrist and Lavy (2009), who report significant effects ranging from to with standard errors ranging from to for this sample and several subsamples.
To apply the adjusted permutation test, I view each cluster as an individual regression and separately estimate each of the equations in
[TABLE]
Note that is now simply the constant term in each regression. The resulting test statistic can be viewed as an alternative point estimate of and is comparable in magnitude to the estimates reported in Angrist and Lavy (2009). However, as can be seen in Figure 1, which plots the permutation distribution from 100,000 draws together with the corresponding critical values, the adjusted permutation test only rejects the null of no effect in favor of a positive effect at the 10% level and barely does not reject at the 5% level. If the fixed effects in the regression do not fully account for the within-cluster dependence in the data, the positive effect for girls may therefore be far less significant than previously reported. This result in also line with Angrist and Lavy, who find substantial but statistically marginal positive effects for girls across a wide variety of plausible specifications when they use cluster-robust standard errors. Also note that the 5% and 10% level one-sided tests performed here are outside the feasible range of the Ibragimov and Müller (2016) test. For the Canay et al. (2017) test, there are ways of testing if 15 treated clusters are paired with 15 control clusters and two control clusters are dropped. In 1,000 randomly chosen unique pairings, the Canay et al. (2017) test rejected the null of no effect against for 425 pairings at the 5% level and in 48 pairings at the 1% level. Any desired conclusion could be reached by choosing a specific pairing.
**ONLINE SUPPLEMENTAL APPENDIX TO
“PERMUTATION INFERENCE WITH A FINITE
NUMBER OF HETEROGENEOUS CLUSTERS”†††Andreas Hagemann, University of Michigan.**
This supplemental appendix is organized as follows: Appendix A presents additional theoretical results, some of which are of potentially independent interest. Appendix B provides a step-by-step procedure for implementing the adjusted permutation test and applies that procedure in several examples. Appendix C contains additional numerical results and comparisons with the test of Ibragimov and Müller (2016). Appendix D contains proofs. Appendix E presents a simple algorithm for simulating critical values beyond those found in Table 1 in the main text.
Appendix A Additional theoretical results
I start with a discussion of the behavior of the test under the alternative . (Tests in the other direction follow by considering instead of .) Let and denote by the standard normal distribution function. The distribution function of is equal to and therefore has a continuous and strictly increasing inverse. The following result gives a simple lower bound on the power of a permutation test as a function of , , , and the standard deviations in the treatment group. Here I assume that the under consideration is feasible, i.e., the corresponding satisfies or, equivalently, . Otherwise the test becomes trivial because the null is never rejected.
Theorem A.1** (Power).**
Suppose with independent , . Let . Then, for every ,
[TABLE]
As can be expected, the power of the test is driven by the strength of the signal relative to the noise represented by the standard deviations . For example, a small treatment effect can be drowned out by large variation in the control group because will then be positive and large for most values of . However, the power of the test is not inherently limited. The integrand on the right is bounded by and converges to as pointwise for every . The integral and consequently the power of the permutation test therefore approach by dominated convergence as . Both the bound and this result can be generalized to the symmetric scale mixtures from Corollary 2.2; see Lemma D.1 for details.
Next, I discuss several aspects of the practical implementation of the permutation test (2.5). First, one can still perform an asymptotic -level test if the observed data or statistic converges in distribution to the considered in Theorem 2.1 or Corollary 2.2. The reason is that the that order and as varies over eventually coincide if sufficiently many entries of are smooth. The proof is a consequence of arguments in Canay et al. (2017).
Proposition A.2** (Large sample approximation).**
Let and let be as in (2.1). If has independent entries of which more than are continuously distributed, then
[TABLE]
Second, if evaluating over all elements of is too costly because is large, the computational burden can be reduced by working with a random sample of draws from the uniform distribution on . This is often referred to as “stochastic approximation.” The following result shows that the critical values and lead to identical test decisions for any and large as long as is not an integer. If is in fact an integer, the stochastic approximation can be marginally more conservative. The reason is that can vary discontinuously at integer values of . The stochastic approximation then hits the order statistic just above with nonzero probability. The same arguments apply if the identity transformation is always included in , which is common practice for randomization tests.
Proposition A.3** (Stochastic approximation).**
Let be an arbitrary random vector possibly depending on . Suppose is a collection of random draws from independent of . Then
[TABLE]
with equality unless . The result remains true if one of the members of is replaced by the identity with probability one.
As a referee points out, the choice of is important in practice. In particular, it seems if is large, then must be large as well to provide an accurate stochastic approximation of the test decision. However, this is only true if the -value , as defined in (2.6), is very close to . If is much larger than for a given , there is often enough information to conclude that is highly unlikely to be smaller than . The same is true if the direction of the inequalities is reversed. The reason is that and, for almost every realization of , the central limit theorem implies that converges to mean-zero normal with variance . It is therefore easy to test hypotheses of the form or with a very small error tolerance . For example, if for a given , one can check whether can be rejected at this . If not, one can add draws from until the decision becomes possible. This idea is, in fact, the basis for the widely-used algorithm of \citetAppendixdavidsonmackinnon2000 for determining a sufficient number of bootstrap repetitions in models where the bootstrap is expensive to compute. Their algorithm can be adapted to the present problem with only notational changes.
Algorithm A.4** (Choosing if is very large).**
Choose a starting value (e.g., 10,000), a step size (e.g., 1,000), a maximal number of permutations (e.g., 100,000), and an error tolerance (e.g., ).
- (1)
If , test the null hypothesis by rejecting in favor of if . Stop if the null is rejected and use as if it were . 2. (2)
If , test the null hypothesis by rejecting in favor of if . Stop if the null is rejected and use as if it were . 3. (3)
Stop if and use as if it were . Otherwise draw additional permutations from , set , and restart from step (1).
Finally, the two approximation results in Propositions A.2 and A.3 can be combined with Theorem 2.1 to obtain
[TABLE]
i.e., adjusted permutation inference with an asymptotically normally distributed vector with heterogeneous variances remains approximately valid even if the set of permutations is drawn at random. It should also be noted that Proposition A.3 is generic and can be restated for other statistics and finite groups with appropriate notational changes. Proposition A.2 can be extended to other statistics and groups under smoothness conditions.
Appendix B Additional examples
I first present a brief summary of how the permutation test can be implemented in practice. By Theorem 3.2, the following procedure provides an asymptotically -level test in the presence of a finite number of large clusters that are arbitrarily heterogeneous. The test is free of nuisance parameters, does not require matching clusters or any other decisions on part of the researcher, can be two-sided or one-sided in either direction, and is able to detect all fixed and -local alternatives.
Algorithm B.1** (Permutation test adjusted for cluster heterogeneity).**
- (1)
Order the data such that clusters received treatment and clusters did not. Compute for each and using only data from cluster an estimate of either or depending on whether received treatment or not so that the difference is the treatment effect of interest. (Examples are provided below and in the main text.) Define and compute . 2. (2)
For the desired , choose from Table 1. 3. (3)
Compute the set of permutations defined in (2.2). Alternatively, draw a large random sample of permutations and replace by in step (4). 4. (4)
Reject the null hypothesis of no effect of treatment against
- (a)
* if for a test with asymptotic level ,* 2. (b)
* if for a test with asymptotic level ,* 3. (c)
* if or for a test with asymptotic level ,*
where , defined in (2.3), is the -th largest value of the permutation distribution of .
I now discuss two additional examples of how the cluster-level statistics can be constructed such that the condition (3.3) required for Theorem 3.2 holds. For simplicity, the discussion focuses on (3.3) under the null hypothesis but the arguments apply more broadly.
Example B.2** (Regression with cluster-level treatment).**
Consider a linear regression model
[TABLE]
where indexes individuals within clusters . The parameter of interest is the coefficient on the treatment dummy indicating whether cluster received treatment or not. The regression also includes covariates that vary within each cluster and have coefficients that may vary across clusters. The condition identifies within a treated cluster and within an untreated cluster. The preceding display can then be written as
[TABLE]
View these as separate regressions and use the least squares estimates of the constants and as . Also note that permuting is identical to permuting the vector of the observed treatment indicators that labels each of these regressions as coming from either a treated or an untreated cluster. The same types of arguments as in Example 3.1 can be used to establish a central limit theorem for .
Under suitable conditions, the in this example can be interpreted as an average treatment effect in a potential outcomes framework. See, e.g., \citetAppendixsloczynski2018 and references therein for a precise discussion. The goal here is to make permutation inference about . This should not be confused with testing the “sharp” null hypothesis that the treatment and control potential outcomes under the intervention are identical. Testing sharp nulls is often associated with permutation testing and is a much stronger restriction than that the average effect on the outcomes be zero. \citetAppendixrosenbaum1984 explains how to use permutation inference to test sharp nulls in the presence of covariates under assumptions on the propensity score.
Example B.3** (Binary choice with cluster-level treatment).**
Consider a version of the model in Example B.2 as the latent model in a binary choice setting. Here has a known, smooth, and symmetric distribution function and is independent of . Only , , and are observed. Each cluster has observations and can be viewed as a separate binary choice model
[TABLE]
If the treatment effect of interest is for some , then corresponds to the null hypothesis of no treatment effect. Let and suppose the moment condition holds for every and . The corresponding -estimates for the -th cluster are zeros of Denote the derivative of with respect to by .
Using the same limit theory as outlined in Example 3.3, it is possible to argue under regularity conditions that converges pointwise in probability to a limit and (\hat{\theta}_{n,k},\hat{\beta}_{n,k})\mathchoice{\raisebox{0.0pt}{ \overset{\mathrm{P}}{\to}}}{\raisebox{-1.49994pt}{\overset{\raisebox{-2.5pt}{\scriptsize}}{\to} }}{}{}(\theta_{0},\beta_{0}). If is non-singular and , then
[TABLE]
where is a conformable vector with a in the first position and [math] otherwise. Condition (3.3) is satisfied if a central limit theorem applies to . Because this is a scaled average of mean-zero random vectors, the same references as in Example 3.3 can be used to establish a central limit theorem.
Appendix C Additional numerical results
This section presents a detailed comparison of the Ibragimov and Müller (2016) and adjusted permutation tests in Monte Carlo experiments and empirical examples.
Example C.1** (Equality of means).**
The adjusted permutation test developed here and the Ibragimov and Müller (2016) test both rely on results about the behavior of heterogeneous normal variables applied to certain test statistics. For the adjusted permutation test, this statistic is the comparison on means . For the Ibragimov-Müller test, it is the studentized two-sample statistic
[TABLE]
where and . This statistic is compared to the quantiles of the Student distribution with degrees of freedom. This example investigates the relative performance of the two tests.
As in Section 2, suppose has independent entries with distributed as . The results reported here use . To investigate the impact of heterogeneity on the two tests, I considered the following six configurations of :
- (a)
, 2. (b)
, 3. (c)
, , , , 4. (d)
, 5. (e)
, , , 6. (f)
, , .
Configurations (a), (d), (e), and (f) are taken from Ibragimov and Müller (2016).
Rows (a)-(f) of Figure 2 correspond to the six configurations (a)-(f) and show the rejection frequencies of the adjusted permutation test (black lines) and the Ibragimov-Müller test (grey) at the 5% level (dashed line) as increases. The null hypothesis is correct at . The columns correspond, from left to right, to the sample sizes , , and . Each horizontal coordiate was computed from 10,000 Monte Carlo replications. As can be seen, the variation in led to marked differences in power at different levels of heterogeneity. The adjusted permutation test was able to reject far more false nulls than the Ibragimov-Müller test for small when there were few large variances as in (b) and (c). For instance, in (b) with at the adjusted permutation test rejected in 47.62% of all cases whereas the Ibragimov-Müller test rejected in only 6.36% of all cases. This difference eventually disappeared for large . However, neither test is more powerful. With slightly different variances within or across groups as in (d) and (f), the Ibragimov-Müller test had an advantage when the sample sizes differed substantially. The differences between the two tests were much smaller for the other configurations. Other samples sizes (not shown) led to qualitatively similar results.
As a referee points out, it would be interesting to compare the performance of the adjusted permutation test and the Ibragimov-Müller test in fat-tailed settings. Just like the adjusted permutation test, the Ibragimov-Müller test can be used with mixtures of normals, which includes models with infinite variances. I therefore repeated the above experiments with standard Cauchy distributed instead of standard normal distributions, holding all else equal. The results are plotted in Figure 3. As can be seen, within the scope of the configurations for (a)-(f), the adjusted permutation test was more powerful than the Ibragimov-Müller test for every configuration at all sample sizes and for all values of . In sharp contrast to the situation with standard normal , this was true even when the samples sizes differed.
A reviewer also recommends comparing the conclusions of adjusted permutation inference and the Ibragimov-Müller test in empirical examples discussed in Ibragimov and Müller (2016), which include tests of hypotheses on January effects and a randomized trial of \citetAppendixbloometal2013.
Example C.2** (January effects; \citealtAppendixkeim1983).**
\citetAppendix
keim1983 investigates January effects in stock returns. He considers excess returns in portfolios constructed from firms in the top and bottoms decile of size, as measured by market value of equity on the New York Stock Exchange (NYSE) and American Stock Exchange (now called NYSE American) over the period 1963-1979. To test whether the January effect is time invariant, Ibragimov and Müller assume that the data are suitably approximated by a scale mixture of normals and implement their test by comparing the January excess returns for 1963-1969 to the January excess returns for 1970-1979. They do not reject the null hypothesis of time invariance at the 5% level but reject at the 10% level. The adjusted permutation test does not reject at either significance level.
Example C.3** (Modern management practices; \citealtAppendixbloometal2013).**
In this example, I reanalyze data form a randomized trial of \citetAppendixbloometal2013. Their intervention provided five months of extensive management consulting from a large international consulting firm to eleven randomly selected Indian textile plants. A control group of six randomly selected plants received only one month of diagnostic consulting. The experiment ran from 2008 to 2011 and several key performance measures were collected before, during, and after the intervention. These measures include data on quality defects, inventory, output, and total factor productivity. Here I focus on output because it is the only measure that has data for all 17 firms available. For the effect on output in their main results in their Table II, \citetAppendixbloometal2013 run a regression of the log of picks (one pick is a single rotation of a weaving shuttle) on a treatment dummy, time fixed effects, and firm fixed effects. They find a 9% increase in output as a result of the intervention.
\citetAppendix
bloometal2013 use, among other methods, the Ibragimov and Müller (2016) test to conduct inference. The adjusted permutation test also applies and can be computed as outlined in Examples 3.1 and 3.3. Both the Ibragimov-Müller and the adjusted permutation test find a significant positive effect on log output at the 5% level, which confirms that the results of \citetAppendixbloometal2013 remain valid even if methods designed for a small number of arbitrarily heterogeneous clusters are used.
Appendix D Proofs
Proof of Theorem 2.1 and Corollary 2.2.
Denote the distribution function of an arbitrary random variable by . We have if and only if . Because the test statistic is location invariant, assume without loss of generality that . Denote by the order statistics of . Then . Because and cannot be true at the same time and implies , it follows that equals
[TABLE]
Suppose , , where the is nonzero with probability one and the has a continuous distribution. The second line of the preceding display must then be zero conditional on and the same must therefore hold unconditionally. The first line conditional on for fixed scales is, by independence, equivalent to the statement ) with for . In the following, I will therefore work with first and return to the unconditional case later.
Let and . Symmetry of and independence of and imply
[TABLE]
Suppose . The two maxima and must satisfy
[TABLE]
Define . Note that the are independent across and symmetric because . The right-hand side of the preceding display then equals . Conclude from symmetry that . Repeat the argument with to obtain
[TABLE]
as desired. To see that this bound is tight, assume first that . Choose , , and let . Then . If , then almost surely as . If , then almost surely as . Conclude from dominated convergence that . If , switch and . This proves the theorem.
For the corollary, return to and redefine accordingly. It is still true that almost surely and therefore , as required for the corollary. ∎
Define and . Let be the distribution function of conditional on .
Lemma D.1**.**
Suppose with , , where the are iid copies of a random variable with continuous distribution function and ) is a random vector independent of with for . If and have the same distribution, then
[TABLE]
The right-hand side converges to as .
Proof of Lemma D.1..
This proof is similar to the proof of Theorem 2.1. As before, consider and assume without loss of generality the case so that has the same distribution as . Because , continuity implies
[TABLE]
Independence of and conditional on and continuity imply that there is an independent standard uniform such that the preceding display equals
[TABLE]
where the equality follows from Tonelli’s theorem. By independence, distribution function of conditional on is . The first result now follows because The second result follows from (D.1) as . ∎
Proof of Theorem A.1..
This follows immediately from Lemma D.1 by letting with probability one and . ∎
Proof of Proposition A.2.
Following Canay et al. (2017), I only have to show that for any two distinct , either for all or . Let and notice that implies that for at least two . By the pigeonhole principle, or must be continuously distributed. Then is continuously distributed by independence and therefore . ∎
Proof of Proposition A.3.
All limits are as . Let be a collection of draws from the uniform distribution on , in which case . For almost every realization of , the central limit theorem implies that converges to mean-zero normal with variance . Because , this variance can only be zero if . This occurs if and only if for all , which also implies for such .
By the equivalence of -values and critical values, if and only if and therefore
[TABLE]
Since converges almost surely to a (possibly degenerate) normal distribution function, for every and almost every realization of there is an (possibly depending on and ) such that the limit of is at most and is at least . If , then is eventually larger than every such . If , then is eventually smaller than . If , which cannot occur if , the preceding display converges almost surely to 0.5. Conclude that the preceding display converges almost surely to . The dominated convergence theorem then implies
[TABLE]
The right hand side is equal to if , which is the case if is not an integer because infinitesimal changes in cannot change . If is nonzero, then the preceding display is smaller than .
If then, both unconditionally and conditional on ,
[TABLE]
The proof now follows from the arguments for . ∎
Proof of Theorem 3.2.
Suppose . Let denote a -vector of ones and . Notice that if and only if
[TABLE]
Hence, it suffices to prove the result with in place of . Because , the desired result for follows from Proposition A.2 and Theorem 2.1.
Suppose . Let and . Then by the assumed continuity and the Slutsky lemma. By construction, is equivalent to . Proposition A.2 then implies
[TABLE]
Now apply the lower bound developed in Theorem A.1 to the right-hand side.
Suppose . Let so that is equivalent to . For a large , the probability that the latter event occurs is bounded above by
[TABLE]
The first term is bounded above by . This can be made as small as desired by choosing large enough because the continuous mapping theorem implies that is uniformly tight. By the properties of quantile functions, the second term in the preceding display is equal to
[TABLE]
Because for and for , uniform tightness of for every implies {\mathord{P}}(1\bigl{\{}T(gX_{n})+T(g\Delta_{n})-T(\Delta_{n})>-M\bigr{\}}=1)={\mathord{P}}(T(gX_{n})+T(g\Delta_{n})-T(\Delta_{n})>-M) converges to [math] for every given if . In addition, for and hence the preceding display is within of
[TABLE]
which equals zero if . Let and then in (D.2) to conclude if . Because and , this proves the result for . If or, equivalently, , then is the maximal order statistic and the power of the test is zero for any sample size. ∎
Appendix E Numerical computation of
This section provides two algorithms for the numerical computation of as in Table 1. For the algorithms, notice that it is of no loss of generality to assume that the standard deviations are restricted to the interval because both sides of can be divided by the largest standard deviation without altering the test decision.
Algorithm E.1** ( and small).**
- (1)
Choose , starting with . 2. (2)
Draw a large number of iid copies of a -vector with independent Beta* entries, e.g., Beta*. 3. (3)
For each , draw a large number of iid copies of and approximate by
[TABLE] 4. (4)
If there is an in for which the number from step (3) is larger than (or, alternatively, for a small tolerance ), let . If not, decrease by and restart at step (1). 5. (5)
Define .
Algorithm E.2** ( or large).**
- (1)
Choose a large number . Choose , starting with . 2. (2)
Draw a large number of iid copies of a -vector with independent Beta* entries, e.g., Beta*. 3. (3)
For each , draw a large number of iid copies of and approximate by
[TABLE] 4. (4)
If there is an in for which the number from step (3) is larger than (or, alternatively, for a small tolerance ), let . If not, decrease by and restart at step (1).
If , Table 1 uses two passes of Algorithm E.1 with and . The first pass computes steps (1)-(3) with . The second pass takes, for each , the top 1% values of that led to the highest rejections and computes steps (3)-(5) with . If , Table 1 uses two passes of Algorithm E.2 with , , and . The first pass computes steps (1)-(3) with . The second pass takes, for each , the top 1% values of that led to the highest rejections and computes steps (3)-(5) with . The Beta distribution is used here because highest rejection rates seem to occur near the boundaries of the parameter space where this distribution has most of its mass.
\bibliographystyleAppendix
chicago \bibliographyAppendixqspec.bib
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Angrist and Lavy (2009) Angrist, Joshua and Victor Lavy (2009). The effects of high stakes high school achievement awards: Evidence from a randomized trial. American Economic Review 99:4 , 301–331.
- 2Bertrand et al. (2004) Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan (2004). How much should we trust differences-in-differences estimates? Quarterly Journal of Economics 119:1 , 249–275.
- 3Bester et al. (2011) Bester, C. Alan, Timothy G. Conley, and Christian B. Hansen (2011). Inference with dependent data using cluster covariance estimators. Journal of Econometrics 165:2 , 137–151.
- 4Cameron et al. (2008) Cameron, A. Colin, Jonah B. Gelbach, and Douglas L. Miller (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics 90:3 , 414–427.
- 5Canay et al. (2017) Canay, Ivan A., Joseph P. Romano, and Azeem M. Shaikh (2017). Randomization tests under an approximate symmetry assumption. Econometrica 85:3 , 1013–1030.
- 6Canay et al. (2021) Canay, Ivan A., Andres Santos, and Azeem M. Shaikh (2021). The wild bootstrap with a “small” number of “large” clusters. Review of Economics and Statistics 103:2 , 346–363.
- 7Conley and Taber (2011) Conley, Timothy G. and Christopher R. Taber (2011). Inference with “difference in differences” with a small number of policy changes. Review of Economics and Statistics 93:1 , 113–125.
- 8de Chaisemartin and D’Haultfœille (2020) de Chaisemartin, Clement and Xavier D’Haultfœille (2020). Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review 110:9 , 2964–2996.
