Estimating the galaxy two-point correlation function using a split random catalog
E. Keih\"anen, H. Kurki-Suonio, V. Lindholm, A. Viitanen, A.-S., Suur-Uski, V. Allevato, E. Branchini, F. Marulli, P. Norberg, D. Tavagnacco,, S. de la Torre, J. Valiviita, M. Viel, J. Bel, M. Frailis, and A. G., S\'anchez

TL;DR
This paper proposes a split random catalog method to efficiently estimate the galaxy two-point correlation function, significantly reducing computation time while maintaining accuracy, crucial for large upcoming galaxy surveys.
Contribution
Introducing a split random catalog approach that optimizes the calculation of the galaxy two-point correlation function by reducing computational costs without sacrificing precision.
Findings
Splitting the random catalog into subcatalogs improves efficiency.
Excluding cross-subcatalog pairs maintains estimator accuracy.
Reduces computation time by over tenfold for large catalogs.
Abstract
The two-point correlation function of the galaxy distribution is a key cosmological observable that allows us to constrain the dynamical and geometrical state of our Universe. To measure the correlation function we need to know both the galaxy positions and the expected galaxy density field. The expected field is commonly specified using a Monte-Carlo sampling of the volume covered by the survey and, to minimize additional sampling errors, this random catalog has to be much larger than the data catalog. Correlation function estimators compare data-data pair counts to data-random and random-random pair counts, where random-random pairs usually dominate the computational cost. Future redshift surveys will deliver spectroscopic catalogs of tens of millions of galaxies. Given the large number of random objects required to guarantee sub-percent accuracy, it is of paramount importance to…
| method | time | mean variance | |||
|---|---|---|---|---|---|
| [s] | – | – | – | – | |
| [] | [] | [] | [] | ||
| 7889 | 1.01 | 4.42 | 1.23 | 7.01 | |
| 2306 | 1.01 | 4.39 | 1.23 | 7.04 | |
| 16.5 | 5.96 | 5.53 | 1.46 | 8.05 | |
| 2239 | 1.01 | 4.41 | 1.25 | 7.11 | |
| 812 | 1.02 | 4.41 | 1.25 | 7.42 | |
| 487 | 1.08 | 4.41 | 1.28 | 7.72 | |
| 1854 | 1.01 | 4.42 | 1.23 | 7.02 | |
| 763 | 1.00 | 4.42 | 1.23 | 7.01 | |
| 593 | 1.09 | 4.42 | 1.23 | 7.02 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: Department of Physics and Helsinki Institute of Physics, Gustaf Hällströmin katu 2, 00014 University of Helsinki, Finland 22institutetext: Scuola Normale Superiore, Piazza dei Cavalieri 7, I-56126 Pisa, Italy 33institutetext: Department of Mathematics and Physics, Roma Tre University, Via della Vasca Navale 84, I-00146 Rome, Italy 44institutetext: Dipartimento di Fisica e Astronomia - Alma Mater Studiorum Università di Bologna, via Piero Gobetti 93/2, I-40129 Bologna, Italy 55institutetext: INAF - Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, via Piero Gobetti 93/3, I-40129 Bologna, Italy 66institutetext: INFN - Sezione di Bologna, viale Berti Pichat 6/2, I-40127 Bologna, Italy 77institutetext: ICC & CEA, Department of Physics, Durham University, South Road, DH1 3LE UK 88institutetext: INAF, Osservatorio Astronomico di Trieste, via Tiepolo 11, I-34131 Trieste, Italy 99institutetext: Aix Marseille Univ, CNRS, CNES, LAM, Marseille, France 1010institutetext: SISSA, International School for Advanced Studies, Via Bonomea 265, 34136 Trieste TS, Italy 1111institutetext: INFN, Sezione di Trieste, Via Valerio 2, 34127 Trieste TS, Italy 1212institutetext: Aix Marseille Univ, Université de Toulon, CNRS, CPT, Marseille, France 1313institutetext: Max-Planck-Institut für extraterrestrische Physik, Postfach 1312, Giessenbachstr., 85741 Garching, Germany 1414institutetext: IFPU - Institute for Fundamental Physics of the Universe, Via Beirut 2, 34014 Trieste, Italy
Estimating the galaxy two-point correlation function using a split random catalog
E. Keihänen Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
H. Kurki-Suonio Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
V. Lindholm Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
A. Viitanen Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
A.-S. Suur-Uski Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
V. Allevato Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
E. Branchini Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
F. Marulli Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
P. Norberg Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
D. Tavagnacco Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
S. de la Torre Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
J. Valiviita Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
M. Viel Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
J. Bel Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
M. Frailis Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
A. G. Sánchez Estimating the galaxy two-point correlation function using a split random catalogEstimating the galaxy two-point correlation function using a split random catalog
(Received XXX / Accepted YYY)
The two-point correlation function of the galaxy distribution is a key cosmological observable that allows us to constrain the dynamical and geometrical state of our Universe. To measure the correlation function we need to know both the galaxy positions and the expected galaxy density field. The expected field is commonly specified using a Monte-Carlo sampling of the volume covered by the survey and, to minimize additional sampling errors, this random catalog has to be much larger than the data catalog. Correlation function estimators compare data–data pair counts to data–random and random–random pair counts, where random–random pairs usually dominate the computational cost. Future redshift surveys will deliver spectroscopic catalogs of tens of millions of galaxies. Given the large number of random objects required to guarantee sub-percent accuracy, it is of paramount importance to improve the efficiency of the algorithm without degrading its precision. We show both analytically and numerically that splitting the random catalog into a number of subcatalogs of the same size as the data catalog when calculating random–random pairs and excluding pairs across different subcatalogs provides the optimal error at fixed computational cost. For a random catalog fifty times larger than the data catalog, this reduces the computation time by a factor of more than ten without affecting estimator variance or bias.
Key Words.:
large-scale structure of Universe - Cosmology: observations - Methods: statistical - Methods: data analysis
1 Introduction
The spatial distribution of luminous matter in the Universe is a key diagnostic for studying cosmological models and the physical processes involved in the assembly of structure. In particular, light from galaxies is a robust tracer of the overall matter distribution, whose statistical properties can be predicted by cosmological models. Two-point correlation statistics are very effective tools for compressing the cosmological information encoded in the spatial distribution of the mass in the Universe. In particular, the two-point correlation function in configuration space has emerged as one of the most popular cosmological probes. Its success stems from the presence of characterized features that can be identified, measured, and effectively compared to theoretical models to extract clean cosmological information.
One such feature is baryon acoustic oscillations (BAOs), which imprint a characteristic scale in the two-point correlation function that can be used as a standard ruler. After the first detection in the two-point correlation function of SDSS DR3 and 2dFGRS galaxy catalogs (Eisenstein et al. 2005; Cole et al. 2005), the BAO signal was identified, with different degrees of statistical significance, and has since been used to constrain the expansion history of the Universe in many spectroscopic galaxy samples (see e.g., Percival et al. (2010) Blake et al. (2011), Beutler et al. (2011), Anderson et al. (2012), Anderson et al. (2014), Ross et al. (2015), Alam et al. (2017), Ross et al. (2017), Vargas-Magaña et al. (2018), Bautista et al. (2018), and Ata et al. (2018)). Several of these studies did not focus on the BAO feature only but also analyzed the anisotropies in the two-point correlation function induced by the peculiar velocities (Kaiser 1987), the so-called redshift space distortions (RSD), and by assigning cosmology-dependent distances to the observed redshifts (the Alcock & Paczyński (1979) test). For RSD analyses, see also, for example, Peacock et al. (2001), Guzzo et al. (2008), Beutler et al. (2012), Reid et al. (2012), de la Torre et al. (2017), Pezzotta et al. (2017), Zarrouk et al. (2018), Hou et al. (2018), and Ruggeri et al. (2019).
Methods to estimate the galaxy two-point correlation function (2PCF) from survey data are based on its definition as the excess probability of finding a galaxy pair. One counts from the data () catalog the number of pairs of galaxies with separation , where is a bin of separation vectors, and compares it to the number of pairs in a corresponding randomly generated () catalog and to the number of data-random pairs . The bin may be a 1D (), 2D, or a 3D bin. In the 1D case, is the length of the separation vector and is the width of the bin. From here on, ‘separation ’ indicates that the separation falls in this bin.
Several estimators of the 2PCF have been proposed by Hewett (1982), Davis & Peebles (1983), Hamilton (1993), and Landy & Szalay (1993), building on the original Peebles & Hauser (1974) proposal. These correspond to different combinations of the , , and counts to obtain a 2PCF estimate ; see Kerscher (1999) and Kerscher (2000) for more estimators. The Landy–Szalay (Landy & Szalay 1993) estimator
[TABLE]
(we call this method ‘standard LS’ in the following) is the most commonly used, since it provides the minimum variance when and is unbiased in the limit . Here is the size (number of objects) of the data catalog and is the size of the random catalog. We define . To minimize random error from the random catalog, should be used (for a different approach, see Demina et al. (2018)).
One is usually interested in only up to some (the maximum separation in the survey), and therefore pairs with larger separations can be skipped. Efficient implementations of the LS estimator involve pre-ordering of the catalogs through kd-tree, chain-mesh, or other algorithms (e.g., Moore et al. 2000; Alonso 2012; Jarvis 2015; Marulli et al. 2016) to facilitate this. The computational cost is then roughly proportional to the actual number of pairs with separation .
The correlation function is small for large separations, and in cosmological surveys is large enough so that for most pairs . The fraction of pairs with is therefore not very different from the fraction of or pairs with . The computational cost is dominated by the part proportional to the total number of pairs needed, , which in turn is dominated by the pairs as . The smaller number of pairs contribute much more to the error of the estimate than the large number of pairs, whereas the cost is dominated by . Thus, a significant saving of computation time with an insignificant loss of accuracy may be achieved by counting only a subset of pairs, while still counting the full set (up to ) of pairs.
A good way to achieve this is to use many small (i.e, low-density) catalogs instead of one large (high-density) catalog (Landy & Szalay 1993; Wall & Jenkins 2012; Slepian & Eisenstein 2015), or, equivalently, to split an already generated large catalog into small ones for the calculation of pairs while using the full catalog for the counts. This method has been used by some practitioners (e.g., Zehavi et al. (2011)111I. Zehavi, private communication), but this is usually not documented in the literature. One might also consider obtaining a similar cost saving by diluting (subsampling) the catalog for counts, but, as we show below, this is not a good idea. We refer to these two cost-saving methods as ‘split’ and ‘dilution’.
In this work we theoretically** **derive the additional covariance and bias due to the size and treatment of the catalog; test these predictions numerically with mock catalogs representative of next-generation datasets, such as the spectroscopic galaxy samples that will be obtained by the future Euclid satellite mission (Laureijs et al. 2011); and show that the ‘split’ method, while reducing the computational cost by a large factor, retains the advantages of the LS estimator.
We follow the approach of Landy & Szalay (1993; hereafter, LS93), but generalize it in a number of ways: In particular, since we focus on the effect of the random catalog, we do not work in the limit . Also, we calculate covariances, not just variances, and make fewer approximations (see Sect. 2.2).
The layout of the paper is as follows. In Sect. 2 we derive theoretical results for bias and covariance. In Sect. 3 we focus on the split LS estimator and its optimization. In Sect. 4 we test the different estimators with mock catalogs. Finally, we discuss the results and present our conclusions in Sect. 5.
2 Theoretical results: bias and covariance
2.1 General derivation
We follow the derivation and notations in LS93 but extend to the case that includes random counts covariance. We consider the survey volume as divided into microcells (very small subvolumes) and work in the limit , which means that no two objects will ever be located within the same microcell.
Here, , , and represent the relative deviation of the , , and counts from their expectation values (mean values over an infinite number of independent realizations):
[TABLE]
By definition . The factors , , and represent fluctuations in the pair counts, which arise as a result of a Poisson process. As long as the mean pair counts per bin are large () the relative fluctuations will be small. We calculate up to second order in , , and , and ignore the higher-order terms (in the limit , , so LS93 set at this point).
The expectation values for the pair counts are:
[TABLE]
where is the correlation function normalized to the actual number density of galaxies in the survey and
[TABLE]
is the fraction of microcell pairs with separation . Here if falls in the -bin, otherwise it is equal to zero.
The expectation value of the LS estimator (1) is
[TABLE]
A finite catalog thus introduces a (small) bias. (In LS93, , so the estimator is unbiased in this limit.) This expression is calculated to second order in , , and (we denote equality to second order by ‘’). Calculation to higher order is beyond the scope of this work. Since data and random catalogs are independent, .
We introduce shorthand notations for , for , and similarly for other terms.
For the covariance we get
[TABLE]
Terms with represent additional variance due to finite , and are new compared to those of LS93. Also, collects an additional contribution, which we denote by , from variations in the random field (see Sect. 2.3). The cross terms and , instead depend linearly on the random field, and average to the result. The additional contribution due to finite is thus
[TABLE]
From (2),
[TABLE]
and so on, so that the covariances of the deviations are obtained from
[TABLE]
2.2 Quadruplets, triplets, and approximations
We use
[TABLE]
to denote the fraction of ordered microcell triplets, where and . The notation means that only terms where all indices (microcells) are different are included. Here is of the same magnitude as but is larger.
Appendix A gives examples of how the and so on in (9) are calculated. These covariances involve expectation values , where is the number of objects (0 or 1) in microcell and so on, and only cases where the four microcells are separated pairwise by and are included. If all four microcells , , , and are different, we call this case a quadruplet; it consists of two pairs with separations and . If two of the indices, that is, microcells, are equal, we have a triplet with a center cell (the equal indices) and two end cells separated from the center by and .
We make the following three approximations:
For microcell quadruplets, the correlations between unconnected cells are approximated by zero on average. 2. 2.
Three-point correlations vanish. 3. 3.
The part of four-point correlations that does not arise from the two-point correlations vanishes.
With approximations (2) and (3), we have for the expectation value of a galaxy triplet
[TABLE]
where , and for a quadruplet
[TABLE]
We use ‘’ to denote results based on these three approximations. Approximation (1) is good as long as the survey size is large compared to . It allows us to drop terms other than in (12). Approximations (2) and (3) hold for Gaussian density fluctuations, but in the realistic cosmological situation they are not good: the presence of the higher-order correlations makes the estimation of the covariance of estimators a difficult problem. However, this difficulty applies only to the contribution of the data to the covariance, that is, to the part that does not depend on the size and treatment of the random catalog. The key point in this work is that while our theoretical result for the total covariance does not hold in a realistic situation (it is an underestimate), our results for the difference in estimator covariance due to different treatments of the random catalog hold well.
In addition to working in the limit (), LS93 considered only 1D bins and the case where (i.e., variances, not covariances) and made also a fourth approximation: for triplets (which in this case have legs of equal length) they approximated the correlation between the end cells (whose separation in this case varies between [math] and ) by . We use to denote the mean value of the correlation between triplet end cells (separated from the triplet center by and ). (For our plots in Sect. 4 we make a similar approximation of as Landy&Szalay, see Sect. 4.2.) Also, LS93 only calculated to first order in , whereas we do not make this approximation. Bernstein (1994) also considered covariances, and included the effect of three-point and four-point correlations, but worked in the limit ().
2.3 Poisson, edge, and terms
After calculating all the and so on (see Appendix A), (9) becomes
[TABLE]
for the standard LS estimator.
Following the definition of and in LS93, we define
[TABLE]
For their diagonals (), we write , , , , , , and . Thus, , and so on. (We use superscripts for the matrices, e.g., and subscripts for their diagonals, e.g., .)
Using these definitions, (13) becomes
[TABLE]
The new part in due to finite size of the random catalog is
[TABLE]
Thus only and are affected by (in our approximation its effect cancels in ). The results for , , and are exact. The result for involves all three approximations mentioned above, involves approximations (1) and (2), and involves approximation (1).
We refer to , , and as ‘Poisson’ terms and and as ‘edge’ terms (the difference between and is due to edge effects). While the Poisson terms are strongly diagonal dominated, the edge terms are not. Since , the terms are much larger than the edge terms, but they get multiplied by or . In the limit : , , ; and also are unaffected.
We see that - and - correlations arise from edge effects. If we increase the density of data or random objects, the Poisson terms decrease as but the edge terms decrease only as so the edge effects are more important for a higher density of objects.
Doubling the bin size (combining neighboring bins) doubles but makes four times as large, since triplets where one leg was in one of the original smaller bins and the other leg was in the other bin are now also included. Thus, the ratio and are not affected, but the dominant term in , is halved. Edge effects are thus more important for larger bins.
2.4 Results for the standard Landy–Szalay estimator
Inserting the results for and so on into Eqs. (5) and (6), we get that the expectation value of the standard LS estimator (1) is
[TABLE]
This holds also for large and in the presence of three-point and four-point correlations. A finite catalog thus introduces a bias ; the edge () part of the bias cancels in the limit.
For the covariance we get
[TABLE]
Because of the approximations made, this result for the covariance does not apply to the realistic cosmological case; not even for large separations , where is small, since large correlations at small increase the covariance also at large . However, this concerns only and . Our focus here is on the additional covariance due to the size and handling of the random catalog, which for standard LS is
[TABLE]
To zeroth order in the covariance is given by the Poisson terms and the edge terms cancel to first order in . This is the property for which the standard LS estimator was designed. To first order in , the terms contribute. This contribution involves the triplet correlation , which, depending on the form of , may be larger than or .
If we try to save cost by using a diluted random catalog with for pairs, is replaced by with in place of , but and are unaffected, so that the edge terms involving randoms no longer cancel. In Sect. 4 we see that this is a large effect. Therefore, one should not use dilution.
3 Split random catalog
3.1 Bias and covariance for the split method
In the split method one has independent smaller catalogs of size instead of one large random catalog . Their union, has a size of . The pair counts and are calculated as
[TABLE]
that is, pairs across different catalogs are not included in . The total number of pairs in is . Here, is equal to its value in standard LS.
The split Landy–Szalay estimator is
[TABLE]
Compared to standard LS, , , and are unaffected. We construct , , and from the standard LS results, bearing in mind that the random catalog is a union of independent catalogs, arriving at
[TABLE]
where
[TABLE]
The first is the same as in standard LS and dilution, but the second differs both from standard LS and from dilution, since it involves both and .
For the expectation value we get
[TABLE]
so that the bias is . In the limit the edge part cancels, leaving only the Poisson term.
The covariance is
[TABLE]
The change in the covariance compared to the standard LS method is
[TABLE]
which again applies in the realistic cosmological situation. Our main result is that in the split method the edge effects cancel and the bias and covariance are the same as for standard LS, except that the Poisson term from is replaced with the larger .
3.2 Optimizing computational cost and variance of the split method
The bias is small compared to variance in our application (see Fig. 1 for the theoretical result and Fig. 5 for an attempted bias measurement), and therefore we focus on variance as the figure of merit. The computational cost should be roughly proportional to
[TABLE]
and the additional variance due to finite catalog in the limit becomes
[TABLE]
Here, and are fixed by the survey and the requested binning, but we can vary and in the search for the optimal computational method. In the above we defined the ‘cost’ and ‘variance’ factors and .
We may ask two questions:
For a fixed level of variance , which combination of and minimizes computational cost ? 2. 2.
For a fixed computational cost , which combination of and minimizes the variance ?
The answer to both questions is (Slepian & Eisenstein 2015)
[TABLE]
Thus, the optimal version of the split method is the natural one where . In this case the additional variance in the limit becomes
[TABLE]
and the computational cost factor becomes
[TABLE]
meaning that pairs contribute twice as much as pairs to the variance and also twice as much computational cost is invested in them. The memory requirement for the random catalog is then the same as for the data catalog. The cost saving estimate above is optimistic, since the computation involves some overhead not proportional to the number of pairs.
For small scales, where , the situation is different. The greater density of pairs due to the correlation requires a greater density of the catalog so that the additional variance from it is not greater. From Eq. (19) we see that the balance of the and the contributions is different for large (the term vs. the other terms). We may consider recomputing for the small scales using a smaller and a larger catalog. Considering just the Poisson terms ( and or ) with a ‘representative’ value, (27) and (28) become and which modifies the above result (Eq. 29) for the optimal choice of and to
[TABLE]
This result is only indicative, since it assumes a constant for . In particular, it does not apply for , because then the approximation of ignoring the and terms in (19) is not good.
4 Tests on mock catalogs
4.1 Minerva simulations and methodology
The Minerva mocks are a set of 300 cosmological mocks produced with -body simulations (Grieb et al. 2016; Lippich et al. 2019), stored at five output redshifts . The cosmology is flat CDM with , and we use the outputs. The mocks have objects (“halos” found by a friends-of-friend algorithm) in a box of {\mathrm{\mathnormal{h}}}^{-1}\text{,}\mathrm{Mpc}$$ cubed.
To model the survey geometry of a redshift bin with at , we placed the observer at comoving distance {\mathrm{\mathnormal{h}}}^{-1}\text{,}\mathrm{Mpc} from the center of the cube and selected from the cube a shell $2201.34$–$2367.92${\mathrm{\mathnormal{h}}}^{-1}\text{\,}\mathrm{Mpc} from the observer. The comoving thickness of the shell is {\mathrm{\mathnormal{h}}}^{-1}\text{,}\mathrm{Mpc}$$. The resulting mock sub-catalogs have and are representative of the galaxy number density of the future Euclid spectroscopic galaxy catalog.
We ignore peculiar velocities, that is, we perform our analysis in real space. Therefore, we consider results for the 1D 2PCF . We estimated up to {\mathrm{\mathnormal{h}}}^{-1}\text{,}\mathrm{Mpc} using $\Delta r=1${\mathrm{\mathnormal{h}}}^{-1}\text{\,}\mathrm{Mpc} bins.
We chose standard LS with as the reference method. In the following, LS without further qualification refers to this. The random catalog was generated separately for each shell mock to measure their contribution to the variance. For one of the random catalogs we calculated also triplets to obtain the edge effect quantity .
While dilution can already be discarded on theoretical grounds, we show results obtained using dilution, since these results provide the scale for edge effects demonstrating the importance of eliminating them with a careful choice of method. For the dilution and split methods we also used , and tried out dilution fractions and split factors (chosen to have similar pairwise computational costs). In addition, we considered standard LS with , which has the same number of pairs as and , but only half the number of pairs; and standard LS with to demonstrate the effect of a small .
The code used to estimate the 2PCF implements a highly optimized pair-counting method, specifically designed for the search of object pairs in a given range of separations. In particular, the code provides two alternative counting methods, the chain-mesh and the kd-tree. Both methods measure the exact number of object pairs in separation bins, without any approximation. However, since they implement different algorithms to search for pairs, they perform differently at different scales, both in terms of CPU time and memory usage. Overall, the efficiency of the two methods depends on the ratio between the scale range of the searching region and the maximum separation between the objects in the catalog.
The kd-tree method first constructs a space-partitioning data structure that is filled with catalog objects. The minimum and maximum separations probed by the objects are kept in the data structure and are used to prune object pairs with separations outside the range of interest. The tree pair search is performed through the dual-tree method in which cross-pairs between two dual trees are explored. This is an improvement in terms of exploration time over the single-tree method.
On the other hand, in the chain-mesh method the catalog is divided in cubic cells of equal size, and the indexes of the objects in each cell are stored in vectors. To avoid counting object pairs with separations outside the interest range, the counting is performed only on the cells in a selected range of distances from each object. The chain-mesh algorithm has been imported from the CosmoBolognaLib, a large set of free software C++/python libraries for cosmological calculations (Marulli et al. 2016).
For our test runs we used the chain-mesh method.
4.2 Variance and bias
In Fig. 1 we show the mean (over the 300 mock shells) estimated correlation function and the scatter (square root of the variance) of the estimates using the LS, split, and dilution methods; our theoretical approximate result for the scatter for LS; and our theoretical result for bias for the different methods.
The theoretical result for the scatter is shown with and without the terms, which include the triplet correlation , for which we used here the approximation . This behaves as expected, that is, it underestimates the variance, since we neglected the higher-order correlations in the catalog. Nevertheless, it (see the dash-dotted line in Fig. 1) has similar features to the measured variance (dashed lines).
In Fig. 2 we plot the diagonals of the , , and quantities. This shows how their relative importance changes with separation scale. It also confirms that our initial assumption on small relative fluctuations is valid in this simulation case.
Consider now the variance differences (from standard LS with ), for which our theoretical results should be accurate. Figure 3 compares the measured variance difference to the theoretical result. For the diluted estimators and LS with the measured result agrees with theory, although clearly the measurement with just 300 mocks is rather noisy. For the split estimators and LS with the difference is too small to be appreciated with 300 mocks, but at least the measurement does not disagree with the theoretical result.
In Fig. 4 we show the relative theoretical increase in scatter compared to the best possible case, which is LS in the limit . Since we do not have a valid theoretical result for the total scatter, we estimate it by subtracting the theoretical difference from LS with from the measured variance of the latter.
At scales {\mathrm{\mathnormal{h}}}^{-1}\text{,}\mathrm{Mpc} the theoretical prediction is about the same for dilution and split and neither method looks promising for $r\ll 10${\mathrm{\mathnormal{h}}}^{-1}\text{\,}\mathrm{Mpc} where . This suggests that for optimizing cost and accuracy, a different method should be used for smaller scales than that used for large scales. The number of pairs with small separations is much less. Therefore, for the small-scale computation there is no need to restrict the computation to a subset of pairs, or alternatively, one can afford to increase . For the small scales, we may consider the split method with increased as an alternative to standard LS. We have the same number of pairs to compute as in the reference LS case, if we use and . We added this case to Fig. 4. It seems to perform better than LS at intermediate scales, but for the smallest scales LS has the smaller variance. This is in line with our conclusion in Sect. 3.2, which is that when , it is not optimal to split the catalog into small subsets.
We also compared the differences in the mean estimate from the different estimators to our theoretical results on the bias differences (see Fig. 5), but the theoretical bias differences are much smaller than the expected error of the mean from 300 mocks; and we simply confirm that the differences we see are consistent with the error of the mean and thus consistent with the true bias being much smaller. We also performed tests with completely random () mocks, and with a large number () of mocks confirmed the theoretical bias result for the different estimators in this case. Since the bias is too small to be interesting we do not report these results in more detail here.
However, we note that for the estimation of the 2D 2PCF and its multipoles, the 2D bins will contain a smaller number of objects than the 1D bins of these test runs and therefore the bias is larger. Using the theoretical results (17) or (24) the bias can be removed afterwards with accuracy depending on how well we know the true .
4.3 Computation time and variance
The test runs were made using a single full 24-core node for each run. Table 1 shows the mean computation time and mean estimator variance for different ranges for the different cases we tested. Of these ranges, the –{\mathrm{\mathnormal{h}}}^{-1}\text{,}\mathrm{Mpc}$$ is perhaps the most interesting, since it contains the important BAO scale. Therefore, we plot the mean variance at this range versus mean computation time in Fig. 6, together with our theoretical predictions. The theoretical estimate for the computation time for other dilution fractions and split factors is
[TABLE]
assuming . For standard LS with other random catalog sizes, the computation time estimate is
[TABLE]
5 Conclusions
The computational time of the standard Landy–Szalay estimator is dominated by the pairs. However, except at small scales where correlations are large, these make a negligible contribution to the expected error compared to the contribution from the and pairs. Therefore, a substantial saving of computation time with an insignificant loss of accuracy can be achieved by counting a smaller subset of pairs.
We considered two ways to reduce the number of pairs: dilution and split. In dilution, only a subset of the catalog is used for pairs. In split, the catalog is split into a number of smaller subcatalogs, and only pairs within each subcatalog are counted. We derived theoretical results for the additional estimator covariance and bias due to the finite size of the random catalog for these different variants of the LS estimator, extending in many ways the original results by Landy & Szalay (1993), who worked in the limit of an infinite random catalog. We tested our results using 300 mock data catalogs, representative of the – redshift range of the future Euclid survey. The split method maintains the property the Landy–Szalay estimator was designed for, namely cancelation of edge effects in bias and variance (for ), whereas dilution loses this cancellation and therefore should not be used.
For small scales, where correlations are large, one should not reduce counts as much. The natural dividing line is the scale where . Interestingly, the difference in bias and covariance between the different estimators (split, dilution, and LS) vanishes when . We recommend the natural version of the split method, , for large scales where . This leads to a saving in computation time by more than a factor of ten (assuming ) with a negligible effect on variance and bias. For small scales, where , one should consider using a larger random catalog and one can use either the standard LS method or the split method with a more modest split factor. Because the number of pairs with these small separations is much smaller, the computation time is not a similar issue as for large separations.
The results of our analysis will have an impact also on the computationally more demanding task of covariance matrix estimation. However, since in that case the exact computational cost is determined by the balance of data catalogs and random catalogs, which does not need to be the same as for the individual two-point correlation estimate, we postpone a quantitative analysis to a future, dedicated study. The same kind of methods can be applied to higher-order statistics (three-point and four-point correlation functions) to speed up their estimation (Slepian & Eisenstein 2015).
Acknowledgements.
We thank Will Percival and Cristiano Porciani for useful discussions. The 2PCF computations were done at the Euclid Science Data Center Finland (SDC-FI, urn:nbn:fi:research-infras-2016072529), for whose computational resources we thank CSC – IT Center for Science, the Finnish Grid and Cloud Computing Infrastructure (FGCI, urn:nbn:fi:research-infras-2016072533), and the Academy of Finland grant 292882. This work was supported by the Academy of Finland grant 295113. VL was supported by the Jenny and Antti Wihuri Foundation, AV by the Väisälä Foundation, AS by the Magnus Ehrnrooth Foundation, and JV by the Finnish Cultural Foundation. We also acknowledge travel support from the Jenny and Antti Wihuri Foundation. VA acknowledges funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 749348. FM acknowledges the grants ASI n.I/023/12/0 “Attivit‘a relative alla fase B2/C per la missione Euclid” and PRIN MIUR 2015 “Cosmology and Fundamental Physics: illuminating the Dark Universe with Euclid”.
Appendix A Derivation examples
As examples of how the variances of the different deviations, and so on, are derived, following the method presented in LS93, we give three of the cases here: (common for all the variants of LS), and and for the split method. The rest are calculated in a similar manner.
To derive the appearing in the of (15) we start from
[TABLE]
where both and sum over all microcell pairs; is the number of galaxies in microcell . There are three different cases for the terms depending on how many indices are equal ( and for all of them).
The first case (quadruplets, i.e., two pairs of microcells) is when are all different. We use to denote this part of the sum. There are (we work in the limit ) such terms and they have
[TABLE]
where ) is the density perturbation (relative to the actual mean density of galaxies in the survey), is the three-point correlation, and is the connected (i.e., the part that does not arise from the two-point correlation) four-point correlation. The fraction of microcell quadruplets (pairs of pairs) that satisfy and is . In the limit of large the number of other index quadruplets is negligible compared to those where all indices have different values, so we have
[TABLE]
For the connected pairs and we have and .
The second case (triplets of microcells) is when or is equal to or . We denote this part of the sum by
[TABLE]
It turns out that it goes over all ordered combinations of , where are all different, exactly once, so there are such terms (triplets). Here
[TABLE]
and
[TABLE]
The third case (pairs of microcells) is when and . This part of the sum becomes
[TABLE]
where
[TABLE]
and
[TABLE]
that is, the sum vanishes unless the two bins are the same, .
We now apply the three approximations listed in Sect. 2.2:
[TABLE]
and using from (3) we arrive at the result given in Eq. (15).
For the split method, we give the calculation for
[TABLE]
below.
First the :
[TABLE]
where
[TABLE]
is the number (0 or 1) of objects in microcell , and is the number of objects in microcell .
There are quadruplet terms, for which (see (37)) , and
[TABLE]
Triplets where or is equal to have , since always (we cannot have two objects, from and , in the same cell). There are triplet terms where or is equal to : for them , and
[TABLE]
where if then also since .
Pairs where or have , since again we cannot have two different objects in the same cell. Thus
[TABLE]
and we obtain that
[TABLE]
which is equal to the of standard LS.
Then the :
[TABLE]
where there are terms with giving
[TABLE]
and terms with giving
[TABLE]
Adding these up gives
[TABLE]
and
[TABLE]
This differs both from standard LS and from dilution, since it involves both and .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alam et al. (2017) Alam S. et al. 2017, MNRAS 470, 2617
- 2Alcock & Paczyński (1979) Alcock, C. & Paczyński, B. 1979, Nature 281, 358
- 3Alonso (2012) Alonso, D. 2012, ar Xiv:1210.1833
- 4Ata et al. (2018) Ata, M. et al. 2018, MNRAS 473, 4773
- 5Anderson et al. (2012) Anderson, L. et al. 2012, MNRAS 427, 3435
- 6Anderson et al. (2014) Anderson, L. et al. 2014, MNRAS 441, 24
- 7Bautista et al. (2018) Bautista, J.E. et al. 2018, Ap J 863, 110
- 8Bernstein (1994) Bernstein, G.M. 1994, Ap J 424, 569
