Computing 3 point correlation function randoms counts without the randoms catalogue
David W. Pearson, Lado Samushia

TL;DR
This paper introduces a fast, novel method for calculating three-point correlation function randoms counts directly from data, significantly reducing computational time without needing random catalogues.
Contribution
The authors develop a new approach that computes triplet counts using one-dimensional integrals, bypassing the traditional need for dense random catalogues in three-point statistics.
Findings
Speeds up three-point function calculations by orders of magnitude.
Eliminates the need for generating and counting random catalogues.
Maintains accuracy while reducing computational resources.
Abstract
As we move towards future galaxy surveys, the three-point statistics will be increasingly leveraged to enhance the constraining power of the data on cosmological parameters. An essential part of the three-point function estimation is performing triplet counts of synthetic data points in random catalogues. Since triplet counting algorithms scale at best as with the number of particles and the random catalogues are typically at least 50 times denser than the data; this tends to be by far the most time-consuming part of the measurements. Here we present a simple method of computing the necessary triplet counts involving uniform random distributions through simple one-dimensional integrals. The method speeds up the computation of the three-point function by orders of magnitude, eliminating the need for random catalogues, with the simultaneous pair and triplet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Computing 3 point correlation function
randoms counts without the randoms catalogue
David W. Pearson,1 & Lado Samushia1,2
1Department of Physics, Kansas State University, 116 Cardwell Hall, Manhattan, KS, 66506, USA
2National Abastumani Astrophysical Observatory, Ilia State University, 2A Kazbegi Ave., GE-1060 Tbilisi, Georgia E-mail: [email protected]
(Accepted 2019 May 01. Received 2019 April 23; in original form 2019 March 22)
Abstract
As we move towards future galaxy surveys, the three-point statistics will be increasingly leveraged to enhance the constraining power of the data on cosmological parameters. An essential part of the three-point function estimation is performing triplet counts of synthetic data points in random catalogues. Since triplet counting algorithms scale at best as with the number of particles and the random catalogues are typically at least 50 times denser than the data; this tends to be by far the most time-consuming part of the measurements. Here we present a simple method of computing the necessary triplet counts involving uniform random distributions through simple one-dimensional integrals. The method speeds up the computation of the three-point function by orders of magnitude, eliminating the need for random catalogues, with the simultaneous pair and triplet counting of the data points alone being sufficient.
keywords:
cosmology: miscellaneous – large-scale structure of Universe – methods: numerical
††pubyear: 2019††pagerange: Computing 3 point correlation function randoms counts without the randoms catalogue–Computing 3 point correlation function randoms counts without the randoms catalogue
1 Introduction
Current and past galaxy redshifts surveys have heavily relied upon the analysis of two-point statistics to constrain cosmological parameters down to the percent level (see e.g. Anderson et al., 2012; Chuang et al., 2016; Cuesta et al., 2016; Gil-Marín et al., 2016a, b; Alam et al., 2017). As upcoming galaxy redshift surveys seek to push constraints on cosmological parameters to the sub-percent level, three-point statistics – the three-point correlation function (3PCF) and the bispectrum – will begin to play a bigger role in analyses. Recent works have shown that the baryon acoustic oscillation features are detectable in both the 3PCF (Slepian & Eisenstein, 2015; Slepian et al., 2017a, b) and the bispectrum (Pearson & Samushia, 2018, 2019), hinting at the possibility of increased constraining power on cosmological parameters via the inclusion of three-point statistics. The large scale bispectrum has also been used to supplement the redshift-space distortion measurements from the power spectrum (Gil-Marín et al., 2015a, b; Gil-Marín et al., 2017).
Turning to small scales, the Halo Occupation Distribution (HOD) model (Scherrer & Bertschinger, 1991; Jing et al., 1998; Peacock & Smith, 2000; Seljak, 2000; Scoccimarro et al., 2001; Berlind & Weinberg, 2002; Zheng et al., 2009, and references therein) is a popular method of linking the galaxy and dark matter distributions (see e.g. Tinker et al., 2005; Guo et al., 2014, 2016; Rodríguez-Torres et al., 2016). Yuan, Eisenstein & Garrison (2018) have shown that the squeezed 3PCF can help tightly constrain the parameters of the HOD, making it very likely that the 3 point statistics will become important to the development of mock catalogs.
To measure the 3PCF from data, one must count the triangles of specific shapes and sizes from the data – e.g. the galaxy catalogue – as well as a set of unclustered random points – the random catalogue – a process which is naively in time complexity. The counts from the data are then compared to the expected mean numbers from the unclustered random points to get an estimate of the 3PCF (Peebles & Groth, 1975; Peebles, 1980), as the unclustered mean triplet counts are very sensitive to the geometry of the survey and its number density variations. The most popular estimator used for the 3PCF is that of Szapudi & Szalay (1998) due to its superior edge effect corrections (Kayo et al., 2004),
[TABLE]
where the combinations of s and s tell you how many vertices of the triangle come from either the data or the random points, respectively. The random catalogue usually contains 50 or more times as many objects as the data to make the shot-noise in this Monte-Carlo estimation subdominant to the variance in the data. Combine this with the complexity and the counts for , , and can consume a large amount of time.
While the computational complexity of the three-point statistics can be somewhat mitigated (Baldauf et al., 2015; Slepian & Eisenstein, 2015; Pearson & Samushia, 2018), their calculation still tends to take a significant amount of CPU time. Additionally, the studies which stand to benefit from the inclusion of the three-point statistics may need to measure them tens of thousands of times, making any potential reduction in the computational complexity a welcome improvement.
In this letter, we present a simple method of obtaining the counts involving the random points that does not require actually counting triangles from random catalogues. Since triplet-counting from random catalogues takes most of the computational time, this results in a significant reduction in required CPU hours. The method is only applicable to measurements of 3PCF from uniform periodic cubes, which means that, unfortunately, it does not apply to measurements from real survey data. There are, however, many stages in the cosmological analysis of the 3PCF that can be performed on uniform periodic cubes and do not need to account for survey geometry. These include the validation of theoretical templates – comparing model predictions to measurements from simulations and quantifying biases – and the HOD parameter fitting. For these applications our method allows one to set the number density of randoms arbitrarily high with no performance degradation.
2 Method
2.1 Pair counts
We start by reminding the reader why random pair counting is not really necessary when computing the two-point statistics from a periodic cube (see Roukema & Peterson, 1994, for an application of analytic two-point randoms counts to the angular correlation function). The number of unclustered pairs separated by a distance can be easily computed analytically. There are on average particles in a uniform periodic cube with volume and a number density of points . The volume between two concentric spheres of radii and – i.e. a spherical shell – is
[TABLE]
and the total number of pairs is
[TABLE]
since The cube is periodic there are no edge effects. We will now generalize this idea for triplet counts.
2.2 RRR counts
We need to estimate the number of triplets separated by in a uniform periodic cube with a number density of and volume . We will start by locating the number of points given and , which is achieved by finding the volume of the two overlapping spherical shells whose cross-sections are shown in Figure 1, and multiplying by our number density.
We will make use of the equation for the volume of overlap of two spheres with radii and separated by a distance (Weisstein, 1999),
[TABLE]
This equation is only valid if the spheres touch at one or more points, i.e. . Because of this, we need to explicitly define the volume outside of those bounds,
[TABLE]
From careful study of Figure 1, we can find that the volume of interest is given by first finding the overlap volume of the two outer spherical surfaces, then subtracting the overlap volume of the outer spherical surface of one shell with the inner spherical surface of the other and vice versa. However, this ends up removing the overlap volume of the two inner spherical surfaces twice, so we have to add one back. Mathematically, this can be expressed as
[TABLE]
Here, care must be taken that the of equation (5) is actually the smaller of the two spherical shell radii. For example, when considering isosceles or equilateral triangles, it is possible that is larger than , since may equal . Exercising this caution, along with the special considerations of equation (5) will yield the correct volume even in special cases where the overlap volume looks quite different than in Figure 1 – see e.g. Figure 2. This volume times is the number of points falling into the overlap region.
For a finite bin width, can be anywhere inside a spherical shell around so we will have to integrate equation (6) with respect to . Finally, we have to account for the fact that there are on average points . Combining all of this results in an exact expression for the expected number of counts without any shot-noise expressed as a simple one dimensional integral,
[TABLE]
Here is the number of unique permutations of the side lengths, with , , and for equilateral, isosceles, and general triangles, respectively. We note that for most triangles this reduces to a surprisingly simple expression,
[TABLE]
However, this expression will break down in special cases such as those shown in Figure 2. For this reason, we recommend simply evaluating the integral in equation (7) numerically.111The terms in equation (6) will at most be polynomials of degree 5 in , meaning a simple 3 point Gaussian quadrature rule is all that is needed.
2.3 DRR counts
For counts we have to replace the first vertex by a data point. Since data and random points are uncorrelated222This would not be the case for nontrivial survey geometries where the exact placement of data points with respect to the boundary makes a difference. between each other the formula for the counts coincides with the expression for in equation (7) with one minor change due to the potentially differing number densities of data and random points
[TABLE]
where is the average number density of data points – i.e. .
2.4 DDR counts
The procedure for predicting the counts is almost identical to that outlined in section 2.2, except for the last step where we integrate over . Since the data points are clustered, the distribution of around is not uniform and the number density depends on the distance resulting in
[TABLE]
Here is the nonuniform data number density, which can easily be found from the data pair counts by first computing the number density in each spherical shell
[TABLE]
and then linearly interpolating333Higher order interpolation schemes should also be fine to use here, but due to the need to also extrapolate in the first and last shell, linear order was used for simplicity. to the value at the particular needed for the numerical integration.
Note that we can no longer use a simple integer multiple to account for the unique permutations of the side lengths as the value of will change as we permute. Given this, there will be 1, 3 or 6 terms between the braces for equilateral, isosceles or general triangles, respectively. We note that when summing these terms, it is necessary to store the number of triangles as a floating point number to ensure accuracy. After the summing of all the needed terms in the braces, the number can be rounded to an integer if desired.
Equations (7), (9), and (10) express all terms in the estimator of equation (1) that involve random points without any shot-noise and in terms of simple one-dimensional integrals. counts do require the pair counts of data points, but by computing these along with the counts, no additional computational time is required. For the benefit of the community, we have publicly released our code for these predictive calculations.444https://github.com/dpearson1983/rawor
Currently, this library is available for use in C++ and Python, though we note that it is necessary to have the boost-python library to make use of the Python version.
3 Results
To test equations (7), (9) and (10) we compared their predictions with actual triplet counts. We generated 100 lognormal mock catalogues (Coles & Jones, 1991) – , – using the method described in Appendix A of Beutler et al. (2011) with a power spectrum from the camb software (Lewis et al., 2000) using the Planck cosmology (Planck Collaboration et al., 2018) to serve as proxy for data points555While lognormal mocks do not adequately reproduce the three-point clustering statistics observed in simulations or the real Universe, they do contain a non-zero three-point signal which is sufficient for the purposes of our testing. As we already had many lognormal mocks at our disposal, we utilized some of them out of convenience., and two sets of random distributions, 100 at and 100 the density of the data. We performed the direct counts in bins of width , for , and calculated our predictions. Finally, we separately averaged together all 100 counts and predictions in order to make our comparisons.
To perform the actual counts, we used a relatively straightforward, GPU accelerated, counting algorithm with periodic boundary conditions to remove any edge effects along with a few simple optimizations.666 While we do not release our exact code used here publicly, we utilized the same algorithm in a library for use in a different project. For those who may be interested, you can view the GPU implementation at https://github.com/dpearson1983/ganpcf/blob/master/source/ganpcf.cu, lines 353 – 396 and 426 – 561. We first bin the particles to a grid with spacing equal to , the maximum separation considered for any of the sides of the triangle, allowing us to limit our triplet searching to nearby particles. We also verify that the distance between the first and second vertex is less than before checking for the third vertex. While it would usually be faster to not repeat count the same triangle by simply relabeling which is the first, second and third vertex, due to the peculiarities of GPU memory accesses, we found that it was more efficient to do the repeat counting.
Figure 3 shows the difference of the average (top), (middle) and (bottom) counts with the predictions from equations (7), (9) and (10) divided by their standard deviations added in quadrature, with the and randoms cases on the left and right, respectively. The predictions match the actual counts remarkably well, especially as the number of random points increases.
The counts tend to show more significant deviations between the direct counts and our predictions at small triangle indices – i.e. small scales. We verified that these deviations are not systematic – i.e. they will fluctuate up and down with equal probability for independent realisations averaging to zero – and that they tend to zero in the limit of high number density. They look systematic because at small triangular indexes the measurements share the same small number of points and are strongly correlated – or anti-correlated depending on the shapes. Additionally, the deviations all occur when one side of the triangle is shorter than 5 Mpc, or 3 Mpc for the and randoms cases, respectively. The lognormal mocks were created using a grid spacing of 2 Mpc, where the number of galaxies was determined for each grid cell, then placed inside that cell in a uniform random fashion. We ran tests on mocks with a grid spacing of 0.5 Mpc but the same number density, and note a reduction in the deviations – see Figure 4.
To further test that these small scale deviations were due to some combination of low number densities and grid effects in the mocks, we generated haloes with an exactly known, Gaussian number density profile. These were purely artificial constructions where points were placed randomly a distance from the center with a probability proportional to the integral of the Gaussian function.777This gives you a function describing the number of points as a function of instead of just the number density. We then calculated the number density in thin spherical shells to verify that it followed the input Gaussian function. Since we then knew the exact number density profile, this eliminated the need to use the data-data pair counts so long as our was taken as the central particle of the halo. We could then use the Gaussian that described the number density in the integrals equation (10) and simply replace with 1. The random catalogues were set to the same number density – e.g. . The number density profile was such that particles were at most Mpc from the center with very few particles in the outer parts of the haloes, which led to seeing a similar, though significantly smaller, seemingly systematic effect for high triangle indices.
We show the results of using a higher resolution grid for creating the lognormal mocks and the Gaussian haloes in Figure 4. These results suggest that the seemingly systematic deviations in the counts versus predictions are a combined result of small numbers of data-data pairs causing correlations in the counts, and mock grid resolution, not a failing of the algorithm presented here.
4 Conclusions
We present a method for predicting the counts involving a random catalogue for the 3PCF analysis of simulated or mock data that is free of shot-noise and does not require random catalogues. We have shown that the predictions from equations (7), (9), and (10) agree remarkably well with the actual counts, while keeping the same computational complexity for arbitrarily high number densities of randoms.
The method only works for uniform periodic cubes which may lead the reader to believe that the method is not useful for the analysis of real survey data. This, however, is not true as there are many stages in the survey data analysis that require computing the 3PCF from a large number of periodic cubes.
The most obvious example is the validation of theoretical templates on -body simulations. To make sure that theoretical predictions of the 3PCF are sufficiently accurate – and to calibrate any systematic effects if they are not – they must be compared to the measurements from -body simulations with a known cosmology, or a hidden cosmology in the case of data challenges. For these validation tests to be meaningful, they need to be performed on periodic cubes to cleanly separate theoretical systematics from other observational effects. Since the cumulative volume of -body simulations used for this purpose needs to be much larger than the size of the data, the method presented in this paper can save a significant amount of computational time.
Acknowledgements
We wish to thank Federico Marulli, Michele Moresco and Zachary Slepian for their very insightful and useful comments on an earlier draft of this paper. LS is grateful for the support from NASA grant 12-EUCLID11-0004 and the DOE grant DE-SC0011840. NASA’s Astrophysics Data System Bibliographic Service and the arXiv e-print service were used for this work. Additionally, we wish to acknowledge gnuplot, a free, open-source plotting utility which was used to create all of our figures.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alam et al. (2017) Alam S., et al., 2017, MNRAS , 470, 2617 · doi ↗
- 2Anderson et al. (2012) Anderson L., et al., 2012, MNRAS , 427, 3435 · doi ↗
- 3Baldauf et al. (2015) Baldauf T., Mercolli L., Mirbabayi M., Pajer E., 2015, J. Cosmology Astropart. Phys. , 5, 007 · doi ↗
- 4Berlind & Weinberg (2002) Berlind A. A., Weinberg D. H., 2002, Ap J , 575, 587 · doi ↗
- 5Beutler et al. (2011) Beutler F., et al., 2011, MNRAS , 416, 3017 · doi ↗
- 6Chuang et al. (2016) Chuang C.-H., et al., 2016, MNRAS , 461, 3781 · doi ↗
- 7Coles & Jones (1991) Coles P., Jones B., 1991, MNRAS , 248, 1 · doi ↗
- 8Cuesta et al. (2016) Cuesta A. J., et al., 2016, MNRAS , 457, 1770 · doi ↗
