Graph Database Solution for Higher Order Spatial Statistics in the Era of Big Data
Cristiano G. Sabiu, Ben Hoyle, Juhan Kim, Xiao-Dong Li

TL;DR
This paper introduces a graph database algorithm for efficiently computing higher-order spatial correlation functions in large point datasets, enabling practical analysis of galaxy distributions at cosmological scales.
Contribution
The authors develop a novel graph-based algorithm utilizing kd-trees and graph databases to compute N-point correlation functions efficiently in big spatial data.
Findings
Significant computational speed-up over traditional methods.
Able to measure 3-point correlations beyond BAO scale in SDSS data.
First measurement of 4-point correlation function in large galaxy sample.
Abstract
We present an algorithm for the fast computation of the general -point spatial correlation functions of any discrete point set embedded within an Euclidean space of . Utilizing the concepts of kd-trees and graph databases, we describe how to count all possible -tuples in binned configurations within a given length scale, e.g. all pairs of points or all triplets of points with side lengths . Through bench-marking we show the computational advantage of our new graph based algorithm over more traditional methods. We show that all 3-point configurations up to and beyond the Baryon Acoustic Oscillation scale (200 Mpc in physical units) can be performed on current SDSS data in reasonable time. Finally we present the first measurements of the 4-point correlation function of 0.5 million SDSS galaxies over the redshift range .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Graph Database Solution for Higher Order Spatial Statistics in the Era of Big Data
Department of Astronomy, Yonsei University, 50 Yonsei-ro, Seoul 03722, Korea
Universitaets-Sternwarte, Fakultaet fur Physik, Ludwig-Maximilians Universitaet Muenchen, Scheinerstr. 1, D-81679 Muenchen, Germany
Max Planck Institute fur Extraterrestrial Physics, Giessenbachstr. 1, D-85748 Garching, Germany
Center for Advanced Computation, Korea Institute for Advanced Study, 85 Hoegi-ro, Dongdaemun-gu, Seoul 02455, Korea
School of Physics and Astronomy, Sun Yat-Sen University, Guangzhou 510297, People’s Republic of China
Abstract
We present an algorithm for the fast computation of the general -point spatial correlation functions of any discrete point set embedded within an Euclidean space of . Utilizing the concepts of kd-trees and graph databases, we describe how to count all possible -tuples in binned configurations within a given length scale, e.g. all pairs of points or all triplets of points with side lengths . Through bench-marking we show the computational advantage of our new graph based algorithm over more traditional methods. We show that all 3-point configurations up to and beyond the Baryon Acoustic Oscillation scale (200 Mpc in physical units) can be performed on current SDSS data in reasonable time. Finally we present the first measurements of the 4-point correlation function of 0.5 million SDSS galaxies over the redshift range .
cosmology: theory; methods: data analysis
††journal: ApJS
1 Introduction
Cosmology is now entering an era of big data. From large optical surveys like LSST111https://www.lsst.org, Euclid222https://www.euclid-ec.org and DESI333https://www.desi.lbl.gov to all sky 21cm radio surveys like SKA444https://www.skatelescope.org, Chime555https://chime-experiment.ca and Tianlai666http://tianlai.bao.ac.cn, where we will reach petabytes of data, it is clear that the future of cosmology will require fast and efficient algorithms for extracting scientifically meaningful information from the wealth of data collected.
A common statistical tool for compressing the spatial information of the galaxy distribution is the -point correlation functions (or its Fourier counterpart the poly-spectra). Even at 3rd order statistics we have seen their usefulness in constraining the statistical bias between the distribution of galaxies and dark matter (e.g., Fry & Gaztanaga, 1993; Jing & Börner, 1998; Frieman & Gaztañaga, 1999; Szapudi et al., 2000; Scoccimarro et al., 2001; Verde et al., 2002) as well as to place constraints on the amount of primordial non-Gaussianity in the cosmic microwave background (CMB; Komatsu et al., 2003; Planck Collaboration, et al., 2016).
The first configuration space galaxy 3-point correlation function (3pCF) measurements were made by Groth & Peebles (1977) and subsequently by Gott et al. (1991), where they probed the hierarchical clustering ansatz (Peebles, 1980) with galaxies. However, more recently the 3pCF has been measured using hundreds of thousands of galaxies and used to place constrains on the Halo Occupation Distribution (HOD), by breaking degeneracies between galaxy bias and the amplitude of density fluctuations (Guo et al., 2015), it has shown promise at discriminating different gravity models (Sabiu et al., 2016), helping to constrain non-Gaussianity in the galaxy distribution (Nishimichi et al., 2010) and characterizing the turbulence of the interstellar medium (Portillo et al., 2018).
Regarding large scale cosmology, Slepian et al. (2017a, b) made a detection of the Baryon Acoustic Oscillations (BAO) in the 3rd order spatial clustering statistics of the Sloan Digital Sky Survey (SDSS) galaxies. Although an earlier tentative detection of the 3pCF BAO was claimed by Gaztañaga et al. (2009) who used the SDSS DR7 sample of Luminous Red Galaxies.
Naively, the CFs require computing the distance from every galaxy to every other galaxy in an operation for the 2pCF or for the 3pCF. However, there are some algorithmic methods that can speed up this kind of computation.
Arranging the data in a hierarchical structure known as a ‘tree’, allows fast distance matching to be performed. Particularly k-d trees have been utilized for CFs in codes such as Ntropy (Moore et al., 2001; Gardner et al., 2007) and KSTAT (Sabiu, 2018; Sabiu et al., 2016).
Other novel methods have been developed to quickly measure the higher order statistics, including the position-dependent power spectrum (Chiang et al., 2014) and multipole expansions (Szapudi, 2004). More recently Slepian & Eisenstein (2015, 2016) have developed a method based on Fourier transforms and spherical harmonics that can be combined to form the multipole coefficients of the 3pCF.
In this work we will present a new algorithm for computing spatial correlations that is based on the concept of a graph database. A graph database is a type of NoSQL database, i.e. it does not rely on the data being described in a tabular, relational format. Rather, a graph database, or more specifically a graph-oriented database uses graph theory to store, map and query relationships of the data.
The algorithm that we propose makes no approximation or data compression and is designed to measure all triplet configurations up to large cosmic scales beyond the BAO distance.
The outline of this paper is as follows. In §2 we briefly review spatial correlation analysis. We detail the working of our new algorithm and explain the basic concepts behind graph databases in §3. In §4 we test our new algorithm, performing benchmark tests for speed and scalability. We also apply our algorithm to observational data in two example use cases; i) the large scale 3pCF ii) the four-point correlation function (4pCF). We then conclude in §5.
2 Correlation Functions
The spatial distribution of galaxies encodes a wealth of cosmological information. In an effort to condense the information of millions of 3D galaxy positions into a manageable form we rely on correlation functions.
The two-point correlation function (2pCF) is defined as
[TABLE]
where is the density contrast, related to the density as and denotes spatial averaging over .
In practice the 2pCF is calculated using estimators, the most popular of which being the “Landy-Szalay” estimator (Landy & Szalay, 1993);
[TABLE]
where is the number of data–data pairs, the number of data-random pairs, and is the number of random–random pairs, all separated by a displacement vector and properly normalised. The number of random particles used to define the unclustered reference sample is typically 20 times the number of data particles. This is done to reduce statistical fluctuation due to Poisson noise in the random pair counting. In the anisotropic analysis we decompose the vector into a length component and the angle between the pair vector and the line-of-sight direction.
2.1 Three-point correlation function
The 3pCF is defined as the joint probability of there being a galaxy in each of the volume elements , , and given that these elements are arranged in a configuration defined by the sides of the triangle, , , and . The joint probability can be written as
[TABLE]
The expression above consists of several parts: the sum of 2pCFs for each side of the triangle, , the full three-point correlation function, and the mean density of data points. We utilise the 3pCF estimator of Szapudi & Szalay (1998),
[TABLE]
where each term represents the normalised triplet counts in the data (D) and random (R) fields that satisfy a particular triangular configuration of our choice.
The 3pCF is a function of the three sides of the triangle and additionally it may also be computed in the anisotropic case, thus introducing two angles relative to the line of sight, and . The triangular configuration parameters can be see in Figure 1.
We could imagine, for example, performing an analysis in the anisotropic 5-parameter space () with corresponding (20, 20, 20, 10, 10) equally spaced bins. Considering only legal triangles and symmetries, the number of possible bins is 200,000. The large number of possible bins makes covariance estimation a serious issue for higher order statistics. Thankfully recent work has shown the possibility to reduce the number of bins using various compression schemes (Gualdi et al., 2018a, b; Child, Takada, Nishimichi, Sunayama, Slepian, Habib & Heitmann, 2018; Child, Slepian & Takada, 2018).
3 Algorithm Design
3.1 Graph Database
A graph database does not rely on the data being described in a tabular or relational format, as with more traditional database structures. Rather, a graph database uses graph theory to store, map and query relationships of the data.
Each data point is called a node and has a number of associated properties. On the right side of Figure 2 we see the main elements of the graph. In our work, the important properties of a node are 1) if the point is a galaxy or random 2) any weight associated with the data point e.g. FKP weights (Feldman, Kaiser & Peacock, 1994), angular systematic weights (Ross, et al., 2012), etc 3) the number of neighbour points within a fixed radius 4) if the data point is within a buffer region. The buffer region (4) is only required if the data has been decomposed into multiple domains which will be discussed later. The number of neighbours (3) is required solely to facilitate dynamical memory allocation.
The graph is constructed by visiting each data point and building a list of relationships to its neighbors within a distance, , see left panel of Figure 2. The list of neighbors can be obtained quickly using a kd-tree777We have used the open source solution called kdtree2 from Kennel (2004) https://github.com/jmhodges/kdtree2.
As we can see in the right panel of Figure 2, each node relationship contains the distance information and optionally the angle relative the line of sight direction, if anisotropic correlation analysis is required. It also contains the unique ID of the data point. In an effort to be memory efficient, we bin the distance and angular information immediately thus turning a single/double precision floating point number into a integer. Since the number of bins is typically of order 100, we can save it as a 8byte integer ( unique values), reducing the required memory significantly.
3.2 Querying the database: specific configurations
Unlike for more traditional databases, graph database queries work on relationships and properties thereof. As a simple example if we want to compute the 2-point correlation function we can see that it can be obtained immediately by counting all possible relationships where the distance property of the relationship matches our desired scale. This list of relationships can be queried further to find which ones correspond to the pairs of data (DD), the pairs of random points (RR) and the mixture of both (DR).
Specific -tuples can be computed by querying the relationships and relationships of relationships such that the distances satisfy and that the initial and final data point have the same unique ID, thus closing the tuple.
3.3 Querying the database: all configurations
In some circumstances we may wish to compute all binned configurations of an -tuple correlation function. As an example in the 3pCF, we would have to count all possible triplets of points with the sole criteria that . Thankfully this calculation can be performed rather efficiently by making use of a nice geometrical property of our database.
Between a node and one of its relationship data points there can be defined a region containing all possible triplets associated to the original pair with configuration . This is a simple geometrical property and can be more easily visualized in the left panel of Figure 3.
In the right panel of Figure 3 we can see illustratively how the triplets are counted. We firstly visit each Node, eg Node A and open its list of relationships. We then open the Node corresponding to the first relationship (neighboring point). This Node (e.g. Node B) also contains a list of relationships. If we match each of these lists on the unique ID of each data point, then we will have a new list containing all possible triplets, where two corners are Nodes A and B and the side lengths satisfy the constraint . The union of two ordered sets can be performed very quickly, thus we initially sort the neighbor lists by their unique ID. We can see why this is computationally efficient, because the length of the 3rd side of the triangle does not need to be calculated, since all distance between data pair have been already been measured. This is the main factor in this algorithm, all distances are precomputed and saved into memory.
Due to the precomputation of the distances and the specific design of the graph database, it becomes trivial to go beyond the 2- and 3pCF to arbitrary order in the -point statistics. This may allow us to estimate directly the covariance matrices of lower order statistics, since for example the cov
3.4 Domain Decomposition
The graph database can become memory intensive, e.g. for 10 million points with a number density, Mpc and analysis scale Mpc/, this could reach almost 1TB of memory. Thus if we want to query such a database quickly it would ideally be loaded into RAM memory. We could imagine constructing a large database (e.g. SQL, etc) that can be externally called or large HDF5 files that can be optimally constructed for fast random access calls. However, we decided to opt for a distributed memory scheme that would allow different parts of the graph database to reside on many compute nodes separately. Since no compute node can hold all of the database, we use a domain decomposition to spatially partition the data nodes over compute nodes. This can be seen in Figure 4. Each domain must also include a buffer region since nodes outside the domain may be required for the correlation function analysis. Therefore the buffer region extends a distance in each spatial direction. With this scheme we can be sure that all relevant data nodes are available in memory for the required domain correlation analysis.
3.5 Computational Structure
In Figure 5 we show the computational structure of the algorithm. The precomputed domains are read by compute nodes, i.e. one domain per compute node. This can be achieved using MPI as shown in the figure, or since each domain is independent, the code could be run many times with 1 MPI thread and the user then gathers the results individual for each of the domains. For now we will consider the MPI case.
Each MPI task loads the particle data into memory and proceeds to construct a kd-tree. Once the kd-tree has been constructed we spawn a number of OpenMP threads for parallel computation. The kd-tree is then queried for every point, and their neighbors within a distance having their distance computed and, optionally, the angle between the connecting vector and the line-of-sight direction. The distance and angle can be saved for each neighbor of each data point as an 8byte integer in a structure, as shown in the right panel of Figure 2. This is the construction of the graph database.
Once the graph database has been constructed, the code proceeds to query the graph and measure all the triplet configurations, as discussed above. Again we invoke a number of OpenMP threads to speed up the calculation on each compute node.
The MPI tasks are then reduced to one master node where the , , , counts are each combined together and normalized by the total possible number of triplet counts (including weights).
In this work we present results from our own custom graph database written in modern FORTRAN and implementing both OpenMP and MPI protocols. However we have also tested the popular open source graph database Neo4j888https://neo4j.com finding similar performance.
4 Testing and Benchmarking
For the purposes of benchmarking we focus on small scales and count all possible triangular configurations within Mpc. We use 4 random data samples with a varying number of data points, , and densities while keeping the volume constant at 1 Gpc3. We use samples with = 0.5, 1.0, 2.0, 4.0 million points. In Figure 6 we compare our new algorithm (black points) with the kd-tree algorithm, KSTAT (red points). The new algorithm is significantly faster than KSTAT and the scaling also shows a gentler slope with the number data points with an approximate scaling relation of .
The reason for the discrepancy between the two algorithms is due to the efficient information handling of the graph database. In the case of KSTAT, the algorithm must compute the anisotropic angles for every pair and triplet of data. However, the graph database only requires this calculation to be performed on pairs of data points and then to compute 3-point statistics it simply retrieves the relevant information from the database of relationships as described in §3.2.
4.1 Scalability
We investigate the performance of the algorithm with increasing parallelisation. Adopting a hybrid MPI + OpenMP scheme we perform a test using multiple Intel Xeon Phi 7250 nodes, each comprising 68 computational cores and 96 GB of RAM. Thus it is natural to set the number of OpenMP threads as 68 and create (and query) the graph database for each domain on separate compute nodes, as illustrated in Figure 5.
In the right panel of Figure 6 we show how the wall clock time scales with number of processors. We find reasonable scaling up to several thousand processors. The departure from the 1 to 1 scaling is due to imperfect load-balancing i.e. the distribution of work over the individual processors.
In galaxy clustering it is impossible to know a priori how much computational time will be required for different parts of the data, because of the complex survey geometry and the spatial clustering of the data itself. Thus if we want to avoid significant interprocessor communication we reply on an an initial domain decomposition scheme, which was described earlier in §3.4.
4.2 Example case I: 3-point BAO
We now investigate the performance of the algorithm on large scales. One of the goals of modern cosmology is to determine the expansion history of the Universe through distance measurements, which can be obtained by identifying the BAO ‘bump’ in the 2pCF (Eisenstein, et al., 2005; Anderson et al., 2012). It is also expected that this feature should be present in the higher order statistics.
We consider Data Release 12 of the Sloan Digital Sky Survey’s (SDSS; Eisenstein et al., 2011) Baryon Oscillation Spectrioscopic Survey Constant Stellar Mass (CMASS; Bolton et al., 2012; Dawson et al., 2013; Alam et al., 2015) sample999https://data.sdss.org/sas/dr12/boss/lss/. Specifically we use their publicly released observational catalogues for the Northern Galactic Patch (i.e. galaxy_DR12v5_CMASS_North) and 150 Quick-Particle-Mesh (QPM; White, Tinker & McBride, 2014) mock galaxy catalogues that mimic the observational geometry and selection effects. In the observational sample and in each mock catalogue there are approximately 500,000 galaxies and 2 million random points over the redshift range . In transforming from redshift space to comoving Cartesian coordinates we adopt a flat CDM cosmology model with and .
We proceed to measure the 3pCF beyond the BAO scale out to 200 Mpc. In the left panel of Figure 7 we show the 3pCF for the mean of the mock catalogues and for all possible binned triangular configurations with Mpc. This information is difficult to display in 2-dimensions so for clarity we can look at slices or projections through this space. In the right panel of Figure 7 we display the equilateral configurations for each mock individually (red lines), for the mean of mock catalogues (black dashed line) and for the observational sample (back circles). Although there is significant scatter among the mocks, the BAO ‘bump’ is clearly visible in the mean of the mock samples and also in the observational sample.
Each galaxy catalogue count took just 515 seconds on a 68 core Intel Xeon Phi processor. Although the full calculation of the counts took considerably longer at 4 hours on 27 Intel Xeon Phi nodes (1,836 computational cores).
4.3 Example case II: 4-point function
Following hot on the heels of Fry & Peebles (1978), we compute the 4-point correlation function from the mocks and observational catalogues presented in the previous section.
For simplicity we adopt the following estimator for the 4pCF,
[TABLE]
where DDDD and RRRR represent quadruplets of data points that satisfy the criteria for each side of the quadrilateral. Although in general the sides may be vectors and the 4 points may not occupy a single plane in 3D in which case the most general quadruplets have skewed quadrilateral configurations.
In this example we restrict ourselves to general equi-sided quadrilateral configurations and varying the side length from Mpc in 10 equally spaced bins. We proceed to measure the 4pCF in the observational catalogue and in each of the 150 mock galaxy catalogues. In Figure 8 we show the equilateral 3pCF and the box 4pCF up to 30 Mpc. It is clearly noticeable that the 4pCF of the mocks and the observational galaxies are in tension, while the 3pCF shows much better agreement. This is not too surprising given the approximate nature of the QPM method, which is already known to breakdown for 3rd order statistics on mildly non-linear scales (Kitaura, et al., 2016).
The computational time required for one catalogue was approximately 1 minute on 4 OMP threads, which included 1.8s to construct the graph, 9s to query the required triplet counts and a further 40s to obtain the quadruplets for the 4pCF.
5 Conclusions
We present a fast algorithm for the computation of all possible triplet configurations with of a discrete point set.
Through benchmarking we demonstrate that the algorithm scales well with increasing number of data points up to millions and can easily handle the current cosmological data sets.
We show reasonable parallel scalability through initial domain decomposition and load-balancing. Although there may be room for improvement with a dynamical load-balancing scheme.
The BAO at 3rd order is presented showing the visual BAO peak structure in both mocks and observational data. As it was not the primary aim of this work, we save interpretation of the BAO signal for future work.
We also show for the first time the 4pCF of SDSS galaxies. Again, as it was not our primary goal to make cosmological inferences, we will leave the interpretation of the 4pCF to forthcoming work.
We optimized the code to run on the Cray CS500 system Nurion at the Korea Institute of Science and Technology Information (KISTI) which comprises 570,020 compute cores. Running the code on a single catalogue of the SDSS BOSS DR12 sample containing 500,000 galaxies and 2 million random points we computed the full anisotropic 3pCF up to 200 Mpc. This calculation took 4 hours on 27 Intel Xeon Phi 7250 nodes (1,836 computational cores).
Finally, we present the publicly available code GRAMSCI (GRAph Made Statistics for Cosmological Information; https://bitbucket.org/csabiu/gramsci), under a GNU General Public License.
Acknowledgements
We thank Mijin Yoon for useful discussion and comments on the manuscript.
We use the Nurion supercomputing cluster at the Korea Institute of Science and Technology Information (KISTI).
CGS acknowledges financial support from the National Research Foundation (NRF; #2017R1D1A1B03034900)
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alam et al. (2015) Alam, S., Albareti, F. D., Allende Prieto, C., et al. 2015, Ap JS, 219, 12
- 2Anderson et al. (2012) Anderson, L., Aubourg, E., Bailey, S., et al. 2012, MNRAS, 427, 3435
- 3Bolton et al. (2012) Bolton, A. S., Schlegel, D. J., Aubourg, É., et al. 2012, AJ, 144, 144
- 4Chiang et al. (2014) Chiang, C.-T., Wagner, C., Schmidt, F., & Komatsu, E. 2014, J. Cosmology Astropart. Phys, 5, 048
- 5Child, Slepian & Takada (2018) Child H. L., Slepian Z., Takada M., 2018, ar Xiv e-prints, ar Xiv:1811.12396
- 6Child, Takada, Nishimichi, Sunayama, Slepian, Habib & Heitmann (2018) Child H. L., Takada M., Nishimichi T., Sunayama T., Slepian Z., Habib S., Heitmann K., 2018, ar Xiv e-prints, ar Xiv:1806.11147
- 7Dawson et al. (2013) Dawson, K. S., Schlegel, D. J., Ahn, C. P., et al. 2013, AJ, 145, 10
- 8Eisenstein, et al. (2005) Eisenstein D. J., et al., 2005, Ap J, 633, 560
