Improved renormalization group computation of likelihood functions for   cosmological data sets

Patrick McDonald

arXiv:1906.09127·astro-ph.CO·August 14, 2019

Improved renormalization group computation of likelihood functions for cosmological data sets

Patrick McDonald

PDF

TL;DR

This paper enhances a renormalization group method for efficiently computing likelihood functions in large cosmological data sets by reducing computational complexity and improving accuracy, enabling rapid analysis of million-cell data.

Contribution

The paper introduces a refined renormalization group approach that integrates out differences between specific adjacent cells, significantly improving computational efficiency and accuracy for large-scale likelihood evaluations.

Findings

01

Likelihood computation for 1 million cells takes 2 minutes on a laptop.

02

Method achieves linear scaling with data set size.

03

Potential for further optimization and parallelization.

Abstract

Evaluation of likelihood functions for cosmological large scale structure data sets (including CMB, galaxy redshift surveys, etc.) naturally involves marginalization, i.e., integration, over an unknown underlying random signal field. Recently, I showed how a renormalization group method can be used to carry out this integration efficiently by first integrating out the smallest scale structure, i.e., localized structure on the scale of differences between nearby data cells, then combining adjacent cells in a coarse graining step, then repeating this process over and over until all scales have been integrated. Here I extend the formulation in several ways in order to reduce the prefactor on the method's linear scaling with data set size. The key improvement is showing how to integrate out the difference between specific adjacent cells before summing them in the coarse graining step,…

Figures2

Click any figure to enlarge with its caption.

Equations44

L (θ ∣ o) = \int d ϕ L (θ, ϕ ∣ o) = \int d ϕ L (o ∣ θ, ϕ) L (ϕ ∣ θ),

L (θ ∣ o) = \int d ϕ L (θ, ϕ ∣ o) = \int d ϕ L (o ∣ θ, ϕ) L (ϕ ∣ θ),

L (θ ∣ o) = \int d ϕ e^{- \frac{1}{2} ϕ^{t} P^{- 1} ϕ - \frac{1}{2} Tr l n (2 π P) + l n L_{NG} (ϕ ∣ θ) + l n L (o ∣ θ, ϕ)},

L (θ ∣ o) = \int d ϕ e^{- \frac{1}{2} ϕ^{t} P^{- 1} ϕ - \frac{1}{2} Tr l n (2 π P) + l n L_{NG} (ϕ ∣ θ) + l n L (o ∣ θ, ϕ)},

L_{Gaussian} (θ ∣ o)

L_{Gaussian} (θ ∣ o)

I \equiv \int d ϕ e^{- S (ϕ)} \equiv \int d ϕ e^{- \frac{1}{2} ϕ^{t} Q^{- 1} ϕ - \frac{1}{2} Tr l n (2 π Q) - S_{I} (ϕ)} .

I \equiv \int d ϕ e^{- S (ϕ)} \equiv \int d ϕ e^{- \frac{1}{2} ϕ^{t} Q^{- 1} ϕ - \frac{1}{2} Tr l n (2 π Q) - S_{I} (ϕ)} .

S_{I}^{'} = \frac{1}{2} \frac{\partial S _{I}}{\partial ϕ ^{t}} Q^{'} \frac{\partial S _{I}}{\partial ϕ} - \frac{1}{2} Tr [Q^{'} \frac{\partial ^{2} S _{I}}{\partial ϕ \partial ϕ ^{t}}],

S_{I}^{'} = \frac{1}{2} \frac{\partial S _{I}}{\partial ϕ ^{t}} Q^{'} \frac{\partial S _{I}}{\partial ϕ} - \frac{1}{2} Tr [Q^{'} \frac{\partial ^{2} S _{I}}{\partial ϕ \partial ϕ ^{t}}],

L (θ ∣ o) = \int d ϕ \frac{e ^{- \frac{1}{2} ϕ^{t} P^{- 1} ϕ - \frac{1}{2} (o - R ϕ)^{t} N^{- 1} (o - R ϕ)}}{det ( 2 π P ) det ( 2 π N )} .

L (θ ∣ o) = \int d ϕ \frac{e ^{- \frac{1}{2} ϕ^{t} P^{- 1} ϕ - \frac{1}{2} (o - R ϕ)^{t} N^{- 1} (o - R ϕ)}}{det ( 2 π P ) det ( 2 π N )} .

Q^{- 1} (0) \equiv P^{- 1} + A_{⋆} .

Q^{- 1} (0) \equiv P^{- 1} + A_{⋆} .

S_{I} (0)

S_{I} (0)

S_{I} (λ) \equiv \frac{1}{2} δ^{t} A (λ) δ - b^{t} (λ) δ + N (λ) .

S_{I} (λ) \equiv \frac{1}{2} δ^{t} A (λ) δ - b^{t} (λ) δ + N (λ) .

A (0) \equiv R^{t} N^{- 1} R - A_{⋆}

A (0) \equiv R^{t} N^{- 1} R - A_{⋆}

b (0) \equiv R^{t} N^{- 1} (o - R ϕ_{0}) - P^{- 1} ϕ_{0}

b (0) \equiv R^{t} N^{- 1} (o - R ϕ_{0}) - P^{- 1} ϕ_{0}

N (0) \equiv \frac{1}{2} ϕ_{0}^{t} P^{- 1} ϕ_{0} + \frac{1}{2} (o - R ϕ_{0})^{t} N^{- 1} (o - R ϕ_{0}) + \frac{1}{2} Tr ln (2 π N) + \frac{1}{2} Tr ln (I + A_{⋆} P) .

N (0) \equiv \frac{1}{2} ϕ_{0}^{t} P^{- 1} ϕ_{0} + \frac{1}{2} (o - R ϕ_{0})^{t} N^{- 1} (o - R ϕ_{0}) + \frac{1}{2} Tr ln (2 π N) + \frac{1}{2} Tr ln (I + A_{⋆} P) .

A^{'} = A Q^{'} A

A^{'} = A Q^{'} A

b^{'} = A Q^{'} b

b^{'} = A Q^{'} b

N^{'} = \frac{1}{2} b^{t} Q^{'} b - \frac{1}{2} Tr [A Q^{'}] .

N^{'} = \frac{1}{2} b^{t} Q^{'} b - \frac{1}{2} Tr [A Q^{'}] .

L (θ ∣ o) = e^{\frac{1}{2} b^{t} (Q^{- 1} + A)^{- 1} b - N - \frac{1}{2} Tr l n (I + AQ)} .

L (θ ∣ o) = e^{\frac{1}{2} b^{t} (Q^{- 1} + A)^{- 1} b - N - \frac{1}{2} Tr l n (I + AQ)} .

Q^{- 1} (λ) = Q^{- 1} (0) + K (λ)

Q^{- 1} (λ) = Q^{- 1} (0) + K (λ)

Q (λ) = Q (0) W (λ)

Q (λ) = Q (0) W (λ)

Q^{- 1} (α) \equiv P^{- 1} + α K,

Q^{- 1} (α) \equiv P^{- 1} + α K,

\mathbf{K}=\left[\begin{array}[]{ccc}\mathbf{k}&0&...\\ 0&\mathbf{k}&...\\ ...&...&...\\ \end{array}\right]~{},

\mathbf{K}=\left[\begin{array}[]{ccc}\mathbf{k}&0&...\\ 0&\mathbf{k}&...\\ ...&...&...\\ \end{array}\right]~{},

\mathbf{k}=\left[\begin{array}[]{rr}1&-1\\ -1&1\\ \end{array}\right]~{}.

\mathbf{k}=\left[\begin{array}[]{rr}1&-1\\ -1&1\\ \end{array}\right]~{}.

Q^{'} = - QKQ = - (P^{- 1} + α K)^{- 1} K (P^{- 1} + α K)^{- 1} .

Q^{'} = - QKQ = - (P^{- 1} + α K)^{- 1} K (P^{- 1} + α K)^{- 1} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Improved renormalization group computation of likelihood functions for

cosmological data sets

Patrick McDonald

[email protected]

Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720, USA

Abstract

Evaluation of likelihood functions for cosmological large scale structure data sets (including CMB, galaxy redshift surveys, etc.) naturally involves marginalization, i.e., integration, over an unknown underlying random signal field. Recently, I showed how a renormalization group method can be used to carry out this integration efficiently by first integrating out the smallest scale structure, i.e., localized structure on the scale of differences between nearby data cells, then combining adjacent cells in a coarse graining step, then repeating this process over and over until all scales have been integrated. Here I extend the formulation in several ways in order to reduce the prefactor on the method’s linear scaling with data set size. The key improvement is showing how to integrate out the difference between specific adjacent cells before summing them in the coarse graining step, compared to the original formulation in which small-scale fluctuations were integrated more generally. I suggest some other improvements in details of the scheme, including showing how to perform the integration around a maximum likelihood estimate for the underlying random field. In the end, an accurate likelihood computation for a million-cell Gaussian test data set runs in two minutes on my laptop, with room for further optimization and straightforward parallelization.

I Introduction

McDonald (2019) presented a new method to evaluate large scale structure likelihood functions, inspired by renormalization group (RG) ideas from quantum field theory (e.g., Wilson and Kogut, 1974; Banks, 2008). This paper is a followup to that one, so some of the pedagogical discussion and derivations there will not be repeated here. To recap the basics: the fact that structure in the Universe starts as an almost perfectly Gaussian random field and evolves in a computable way on the largest scales (e.g., Peebles, 1993; McDonald and Roy, 2009; McDonald and Vlah, 2018) suggests a statistically rigorous first-principles likelihood analysis can be used to extract information on cosmological models from observational data sets (e.g., Bond et al., 1998; Wandelt et al., 2004; Kitaura and Enßlin, 2008; Enßlin et al., 2009; Enßlin, 2019). Generally, we have a data vector $\mathbf{o}$ , some relatively small number of global cosmological parameters we want to measure, $\bm{\theta}$ , and a random field we’d like to marginalize over, $\bm{\phi}$ . ( $\bm{\phi}$ could be a variety of different things, depending on the data set and theoretical setup, e.g., the underlying true temperature field for CMB, the linear regime density and/or potential fields for a galaxy redshift survey modeled by traditional perturbation theory, the evolving displacement field in the functional integral formulation of McDonald and Vlah (2018), etc.) Starting with Bayes’ rule $L(\bm{\theta},\bm{\phi}|\mathbf{o})L(\mathbf{o})=L(\mathbf{o}|\bm{\theta},\bm{\phi})L(\bm{\phi},\bm{\theta})=L(\mathbf{o}|\bm{\theta},\bm{\phi})L(\bm{\phi}|\bm{\theta})L(\bm{\theta})$ we obtain

[TABLE]

where I have dropped $L(\mathbf{o})$ which has no parameter dependence and the prior $L(\bm{\theta})$ which plays no role in this discussion because it can be pulled out of the integral. I have highlighted the usual cosmological form where some of the cosmological parameters determine a prior on the signal field, $L(\bm{\phi}|\bm{\theta})$ , and then there is some likelihood for the observable given $\bm{\theta}$ and $\bm{\phi}$ , $L(\mathbf{o}|\bm{\theta},\bm{\phi})$ . It is this $\bm{\phi}$ integral that we need to carry out. Generally, we can take at least part of $L(\bm{\phi}|\bm{\theta})$ , $L_{G}(\bm{\phi}|\bm{\theta})$ , to be Gaussian, defined by its covariance, $\mathbf{P}(\bm{\theta})$ . In this case we have

[TABLE]

where I have used $\ln\det(\mathbf{P})={\mathrm{Tr}}\ln(\mathbf{P})$ and defined $\ln L_{\rm NG}(\bm{\phi}|\bm{\theta})\equiv\ln L(\bm{\phi}|\bm{\theta})-\ln L_{G}(\bm{\phi}|\bm{\theta})$ . (Even for what we call non-Gaussian initial conditions (e.g., McDonald, 2008; Bartolo et al., 2010; Giannantonio and Porciani, 2010; Gong and Yokoyama, 2011; Alvarez et al., 2014; Moradinezhad Dizgah and Dvorkin, 2018), the observable can often if not always be written as a function of an underlying Gaussian random field, i.e., no $L_{\rm NG}$ needed, and in other scenarios like McDonald and Vlah (2018) where the natural $\bm{\phi}$ is not Gaussian, there is still a natural Gaussian piece.) Less generally but still often usefully (e.g., for primary CMB and large scale galaxy clustering ignoring primordial non-Gaussianity) we can take $\ln L_{\rm NG}=0$ and $L(\mathbf{o}|\bm{\theta},\bm{\phi})$ to be Gaussian by assuming $\mathbf{o}$ is linearly related to $\bm{\phi}$ , i.e., $\mathbf{o}=\bm{\mu}+\mathbf{R}\bm{\phi}+\bm{\epsilon}$ where $\bm{\mu}$ is the mean vector, $\mathbf{R}$ is a linear response matrix, and $\bm{\epsilon}$ is Gaussian observational noise with covariance matrix $\mathbf{N}$ . Then we have

[TABLE]

where in the last line the integration has been carried out analytically, with $\mathbf{C}\equiv\mathbf{N}+\mathbf{R}\mathbf{P}\mathbf{R}^{t}$ . Even this analytic integration does not really solve the Gaussian problem, however, as the time to calculate $\mathbf{C}^{-1}$ and $\det(\mathbf{C})$ (or its derivatives) by brute force numerical linear algebra routines scales like $N^{3}$ , where $N$ is the size of the data set, which becomes prohibitively slow for large data sets. The RG approach of McDonald (2019) addresses the Gaussian scenario by doing the $\bm{\phi}$ integral in a different way that produces the result directly as a number instead of these matrix expressions, and can also be applied to non-Gaussian scenarios. Note that, as discussed in McDonald (2019), the approach can also be used to directly compute derivatives of $\ln L(\bm{\theta}|\mathbf{o})$ with respect to $\bm{\theta}$ , not just the value at one choice of $\bm{\theta}$ , by passing the derivative inside the $\bm{\phi}$ integral to produce a new integral. Traditional power spectrum estimation can be done by taking $\bm{\theta}$ to parameterize $\mathbf{P}(\bm{\theta})$ by amplitudes in $k$ bands.

In spite of the fact that fairly fast methods to evaluate at least the Gaussian likelihood [Eq. (I)] have existed for a long time (e.g., Pen, 2003; Padmanabhan et al., 2003; Smith et al., 2007; Seljak et al., 2017; Font-Ribera et al., 2018), more often in practice data analysts compute summary statistics not explicitly based on likelihood functions (e.g., Aghanim et al., 2016; Beutler et al., 2017), calibrating their parameter dependence and covariance by computing the same statistics on mock data sets. It is not entirely clear why existing likelihood-based methods are not used more often, and in McDonald (2019) I was cautious about advocating immediate implementation of the RG approach. One question was if the prefactor on the linear scaling of computation time with data set size for this method might be so large as to make it significantly slower than others. This paper demonstrates that this is not a significant obstacle. At two minutes to accurately compute the likelihood function for a million-cell Gaussian test data set, the method is as fast as any that takes more than a few well-preconditioned conjugate gradient maximum likelihood solutions for the same data set (i.e., as fast as any method I know of, barring the possibility that my Julia implementation of conjugate gradient maximum likelihood is unfairly slow). The only reason not to implement this is if you believe the whole idea of likelihood-based analysis is a distraction. That would not necessarily be an entirely unreasonable position. E.g., if you believe that there is a lot of reliable cosmological constraining power to be gained from the deeply non-linear regime, heuristic summary statistics/“machine learning,” combined with exhaustive mocks/simulations is probably the only way to extract it. To me, however, the likelihood+RG approach proposed here seems like an appealing path to large scale analysis, especially for incorporating weakly nonlinear information (e.g., without the need to explicitly estimate a bispectrum and its covariance).

This paper lays out a series of essentially technical improvements to the basic approach presented in McDonald (2019). See that paper for a derivation of the general RG equation and some more pedagogical discussion. Some of that basics are explained in less detail here when they can be read there.

II Revised formulation

II.1 Master RG equation

Consider the general functional integral over some field $\bm{\phi}$ ,

[TABLE]

The connection to our cosmological likelihood functions, Eq. (2), is obvious, but not necessary for this subsection. Suppose that $\mathbf{Q}\rightarrow 0$ , i.e., $\mathbf{Q}^{-1}$ goes to infinity (all its eigenvalues). In that limit the $\mathbf{Q}$ part of $I$ becomes a representation of the delta function and it is clear that $I(\mathbf{Q}\rightarrow 0)\rightarrow\exp[-S_{I}(0)]$ , i.e., the integral can be done trivially. Generally, however, $\mathbf{Q}$ is not sufficiently small so if we want to do the integral this way we need to change $\mathbf{Q}$ to take it to zero. But we can’t simply change $\mathbf{Q}$ because that will change the value of $I$ , the integral we are trying to perform. If we want to change $\mathbf{Q}$ while preserving $I$ we need to simultaneously change $S_{I}$ . The renormalization group equation tells us how to do this. Guided by, e.g., Banks (2008), McDonald (2019) showed that we can preserve the value of $I$ if the following differential equation is satisfied:

[TABLE]

where we parameterize the evolution by $\lambda$ , i.e., $\mathbf{Q}=\mathbf{Q}(\lambda)$ , $S_{I}=S_{I}(\lambda)$ , and the prime means derivative with respect to $\lambda$ , where $\mathbf{Q}(\lambda=0)$ and $S_{I}(\lambda=0)$ represent the original elements of the integral. (Note that, relative to Eq. (7) of McDonald (2019), I have moved the normalization constant $\mathcal{N}$ into $S_{I}$ , after extracting ${\rm Tr}\ln\mathbf{Q}$ from it to keep the integral unit normalized when $S_{I}=0$ .) This formula is pure math, i.e., it assumes essentially nothing about $\mathbf{Q}$ , $\mathbf{Q}^{\prime}$ , and $S_{I}(\bm{\phi})$ . Typically $\lambda$ will represent a length scale, where structure in $\mathbf{Q}$ has already been erased on smaller scales, and $\mathbf{Q}^{\prime}$ is doing the job of erasing it on scale $\lambda$ , but Eq. (5) applies to any infinitesimal change in $\mathbf{Q}$ .

II.2 Application to Gaussian cosmological data

As in McDonald (2019), I will demonstrate the calculation for a purely Gaussian example, i.e., $S_{I}(\bm{\phi})$ at most quadratic in $\bm{\phi}$ . This is a special case only—Eq. (5) applies for any $S_{I}(\bm{\phi})$ . The likelihood function will be Eq. (I), except for simplicity I will set $\bm{\mu}=0$ , i.e., I take

[TABLE]

For the RG method to be efficient, the linear response matrix $\mathbf{R}$ and the observational noise $\mathbf{N}$ cannot be completely arbitrary. Ideally $\mathbf{R}$ should be fairly short range, e.g., a CMB telescope beam convolution or redshift space cells in which we have counted galaxies. Similarly, $\mathbf{N}$ should be short-range, e.g., diagonal for uncorrelated noise. The general approach can be adapted for special kinds of deviations from short range $\mathbf{R}$ or $\mathbf{N}$ , but I will assume they are short range here. I generally assume the problem can be formulated to make $\mathbf{P}$ translation invariant (i.e., diagonal in Fourier space), although slow evolution in statistics can easily be accommodated. It is potentially useful to change integration variables to $\bm{\delta}\equiv\bm{\phi}-\bm{\phi}_{0}$ , where $\bm{\phi}_{0}$ is some constant field specified by hand. We plan to make $\bm{\phi}_{0}$ the maximum likelihood field, but do not need to assume that. Substituting this into Eq. (6) and comparing to Eq. (4), understanding that $\bm{\phi}$ in Eq. (4) is a dummy variable so we can just as well replace it with $\bm{\delta}$ , we see that the general integral $I$ in Eq. (4) is equivalent to the the Gaussian cosmological $L(\bm{\theta}|\mathbf{o})$ if we define

[TABLE]

and

[TABLE]

The reason for subtracting $\frac{1}{2}\bm{\delta}^{t}\mathbf{A}_{\star}\bm{\delta}$ from $S_{I}(0)$ and adding it to the $\mathbf{Q}^{-1}$ term (adding zero overall, with $\mathbf{A}_{\star}$ an as yet unspecified matrix) will become clear below.

As in McDonald (2019), the evolving Gaussian $S_{I}(\lambda)$ is represented numerically by the evolving coefficients $\mathbf{A}(\lambda)$ , $\mathbf{b}(\lambda)$ , and $\mathcal{N}(\lambda)$ of the general form

[TABLE]

Comparison to Eq. (II.2) for $S_{I}(0)$ sets the initial conditions for $\mathbf{A}$ , $\mathbf{b}$ , and $\mathcal{N}$ :

[TABLE]

and

[TABLE]

Plugging Eq. (9) into Eq. (5) we find the flow equations for $\mathbf{A}$ , $\mathbf{b}$ , and $\mathcal{N}$ :

[TABLE]

Note that if $\phi_{0}$ is the maximum likelihood field (for given values of $\mathbf{P}$ , $\mathbf{R}$ , etc.), $\mathbf{b}=\mathbf{b}(0)=0$ . If the problem happened to be statistically homogeneous (translation invariant), we could set $\mathbf{A}_{\star}=\mathbf{R}^{t}\mathbf{N}^{-1}\mathbf{R}$ to make $\mathbf{A}=\mathbf{A}(0)=0$ . In that case there would be no evolution— $\mathcal{N}(0)$ would simply be the answer. This is the point of $\mathbf{A}_{\star}$ , i.e., if we choose it to be as close as possible to $\mathbf{R}^{t}\mathbf{N}^{-1}\mathbf{R}$ , we can reduce the RG evolution to be a minimal correction due to statistical inhomogeneities. The limitation, i.e., why $\mathbf{A}_{\star}$ generally can only approximate $\mathbf{R}^{t}\mathbf{N}^{-1}\mathbf{R}$ , is that $\mathbf{A}_{\star}$ must maintain the symmetries necessary to allow us to efficiently evaluate ${\rm Tr}\ln\left(\mathbf{I}+\mathbf{A}_{\star}\mathbf{P}\right)$ in Eq. (12), e.g., in Fourier space, to set the initial value of $\mathcal{N}$ .

In terms of these definitions, the result of formal analytic integration is

[TABLE]

We can use this formula once the components have been coarse-grained sufficiently to allow brute force linear algebra. To be clear: if we plug $\mathbf{A}(0)$ , $\mathbf{b}(0)$ , $\mathcal{N}(0)$ , and $\mathbf{Q}(0)$ into this equation, it becomes precisely the analytic integration result in Eq. (I) (with $\bm{\mu}=0$ ). The difference is that as these quantities evolve and are coarse grained their dimensions become smaller, with the result of the small-scale integration that has been performed stored in the simple number $\mathcal{N}$ . See McDonald (2019) for more discussion.

II.3 Integrating out the difference between adjacent cells

In McDonald (2019) I used

[TABLE]

where $\mathbf{K}(\lambda\rightarrow\infty)\rightarrow\infty$ to suppress fluctuations. I mentioned the potentially cleaner possibility

[TABLE]

where $\mathbf{W}(\lambda\rightarrow\infty)\rightarrow 0$ , e.g., $W(k,\lambda)\equiv e^{-k^{2}\lambda^{2}}$ . Either of these was envisioned to suppress fluctuations in a smooth, homogeneous way (i.e., with no explicit connection to the data cell structure), starting from small scales to large. Once fluctuations were sufficiently suppressed on the scale of data cells, adjacent cells were combined, i.e., adjacent elements in $\mathbf{b}$ and the corresponding $2\times 2$ block in $\mathbf{A}$ were summed. This worked well enough, but the number of elements that I needed to store in $\mathbf{A}$ , which determines the speed of computation, seemed surprisingly large.

Here I introduce a new possibility, to more explicitly integrate out the fluctuations between pairs of cells that we are going to combine (see Appendix A for an alternative version of this idea). Given covariance matrix $\mathbf{Q}^{1}$ for some vector, we know that the covariance for a new vector where each adjacent pair of elements is replaced by one element with its average, $\mathbf{Q}^{2c}$ , is simply given by the average of the appropriate $2\times 2$ blocks of $\mathbf{Q}^{1}$ , e.g., $Q^{2c}_{11}=\frac{1}{4}(Q^{1}_{11}+Q^{1}_{12}+Q^{1}_{21}+Q^{1}_{22})$ , $Q^{2c}_{12}=\frac{1}{4}(Q^{1}_{13}+Q^{1}_{14}+Q^{1}_{23}+Q^{1}_{24})$ , etc. This makes clear that if we define $\mathbf{Q}^{\prime}\propto\mathbf{Q}^{2}-\mathbf{Q}^{1}$ , where $\mathbf{Q}^{2}$ is the matrix of equivalent dimension to $\mathbf{Q}^{1}$ but with the $2\times 2$ blocks that will be compressed to $\mathbf{Q}^{2c}$ replaced by their average (e.g., $Q^{2}_{11}=Q^{2}_{12}=Q^{2}_{21}=Q^{2}_{22}=Q^{2c}_{11}$ ), we can straightforwardly evolve Eq. (5) from a starting $\mathbf{Q}^{1}$ to ending $\mathbf{Q}^{2}$ , followed by a coarse graining combination of cells, and repeat. Formally, for each iteration what we are doing is defining $\mathbf{Q}(\lambda)=\mathbf{Q}^{1}+\lambda(\mathbf{Q}^{2}-\mathbf{Q}^{1})$ so that $\mathbf{Q}^{\prime}\equiv d\mathbf{Q}/d\lambda=\mathbf{Q}^{2}-\mathbf{Q}^{1}$ , and solving the differential equation (5) for $\lambda$ running from 0 [where $\mathbf{Q}(\lambda=0)=\mathbf{Q}^{1}$ ] to 1 [where $\mathbf{Q}(\lambda=1)=\mathbf{Q}^{2}$ ].

The obvious problem here is that generally $\mathbf{Q}^{2}-\mathbf{Q}^{1}$ is a dense matrix, which we can’t have if the method is to be fast. The key to the RG approach working is that elements of $\mathbf{Q}^{2}-\mathbf{Q}^{1}$ will generally be small very far off-diagonal, i.e., physically we do not expect the correlation at wide separations to change much when the separation is changed by a small fractional amount. To put it another way, we do not expect to need to use small cells when measuring correlations at wide separations. This allows us to drop most elements of $\mathbf{Q}^{2}-\mathbf{Q}^{1}$ , keeping it, and $\mathbf{A}$ as influenced by it, sparse. The closest thing to an exception to this “no fine structure at large separations” rule that comes to mind is the BAO feature—a relatively narrow bump at wide separation. Considering such a thing, we observe that it is only necessary for $\mathbf{Q}^{\prime}$ to remain sparse, not strictly near-diagonal, i.e., we can if necessary include a strip of elements somewhere off-diagonal in $\mathbf{Q}^{\prime}$ , propagate this into $\mathbf{A}$ , etc., as long as there are not too many of these elements.

Operationally, this program is surprisingly straightforward. I start by computing one full row of $\mathbf{Q}(0)=\left(\mathbf{P}^{-1}+\mathbf{A}_{\star}\right)^{-1}$ . This is basically just a standard computation of a correlation function given a power spectrum, i.e., this matrix obeys translation invariance by construction, so its elements are a function only of separation, inverses can be done in Fourier space, and one row is all that is necessary to capture the full matrix. This $\mathbf{Q}(0)$ becomes $\mathbf{Q}^{1}$ described above and I compute the first two rows of $\mathbf{Q}^{2}$ (the $2\times 2$ block-averaged matrix) directly from it. From this I compute the full sparse $\mathbf{Q}^{\prime}$ including only elements above some threshold. I define the threshold to be some fraction of the maximum absolute value of $\mathbf{Q}^{\prime}$ , called $\epsilon_{\mathbf{Q}^{\prime}}$ , i.e., I keep elements with $|Q_{ij}^{\prime}|>\epsilon_{\mathbf{Q}^{\prime}}{\rm max}|\mathbf{Q}^{\prime}|$ . Note that this makes no assumption about the structure of $\mathbf{Q}^{\prime}$ , e.g., an off-diagonal stripe due to something like BAO will be propagated if it passes the threshold.

After evolving $\mathbf{A}$ , $\mathbf{b}$ , and $\mathcal{N}$ through Eqs. (13)-(15), they, along with $\mathbf{Q}$ as represented by a single row, are coarse-grained by factors of two (i.e., elements summed in the case of $\mathbf{b}$ and $\mathbf{A}$ and averaged in the case of $\mathbf{Q}$ ) and the next iteration proceeds exactly as before. All of the problem-specific details go into the construction of $\mathbf{Q}(0)$ , $\mathbf{A}(0)$ , $\mathbf{b}(0)$ , and $\mathcal{N}(0)$ —after that the algorithm proceeds essentially identically for any problem. After enough iterations the effective data set becomes small enough to finish the calculation by brute force using the analytic integral formula, Eq. (16).

Note that, while my test problems will be one dimensional, where factors of two coarse graining by combining adjacent pixels is the obvious thing to do, there is no obvious reason not to do this as well in higher dimensions. On a cartesian grid we can combine adjacent cells in one direction at a time. On a sphere, a hiearchical block of four HEALPixels Górski et al. (2002) can be combined in two steps of pair combinations. However, it should also be possible to generalize the method to combine more than two cells at a time. $\mathbf{Q}^{2}$ as discussed above just needs to represent the appropriately averaged covariance.

II.4 Sparsification

While the $\mathbf{Q}^{\prime}$ cut discussed above limits the range in $\mathbf{A}$ somewhat, in practice I find that the evolution of $\mathbf{A}$ produces many small elements that do not need to be fully propagated for accuracy and slow down the calculation significantly. In McDonald (2019) I maintained the sparsity of $\mathbf{A}$ by computing elements only out to some maximum separation, taken to be a multiple of the RG distance scale $\lambda$ . Here I suggest a potentially more generally adaptive method, along the lines of the element size cut discussed above involving $\epsilon_{\mathbf{Q}^{\prime}}$ . The key equation numerically is Eq. (13), because the matrix products there dominate the computation time. To control this, I introduce two more numerical parameters. When evaluating $\mathbf{A}\mathbf{Q}^{\prime}\mathbf{A}$ , I first trim $\mathbf{A}$ using another threshold parameter, $\epsilon_{\mathbf{A}}$ , again basing the cut on the absolute value of elements relative to the maximum absolute value. To be clear, I am not permanently dropping part of the stored, evolving $\mathbf{A}$ , only the matrix used to compute $\mathbf{A}\mathbf{Q}^{\prime}\mathbf{A}$ . I apply another similar cut defined by $\epsilon_{\mathbf{A}^{\prime}}$ to $\mathbf{A}^{\prime}=\mathbf{A}\mathbf{Q}^{\prime}\mathbf{A}$ , before using it to update $\mathbf{A}$ in each $\lambda$ step. In practice, for simplicity, I only use one of these two cuts at a time, finding the $\epsilon_{\mathbf{A}}$ cut to be slightly more efficient in my test problems.

II.5 Numerical demonstration

For numerical tests I use one dimensional scenarios similar to McDonald (2019). I use signal power spectrum $P(k)=A(k/k_{p})^{\gamma}\exp(-k^{2})$ with $\gamma=0$ or $-0.5$ , where $k$ is measured in units of the data cell size. I add unit variance noise to each cell. I generate mock data with $A_{0}=1$ and calculate likelihoods as a function of $A$ . I use pivot $k_{p}=0.1$ so that the $\gamma=-0.5$ case has both signal and noise dominated ranges of scales. To be sure the test covers both fine structure and edges, I create statistically inhomogeneous data sets where the rms noise level in every fourth cell is multiplied by a factor of 10, and the noise in the last quarter of the data vector is similarly multiplied.

It is more difficult to make a non-trivial test with the innovations in this paper, because if I assume periodic data with homogeneous noise so that I can compute the exact likelihood to compare to using FFTs, the obvious choice of $\mathbf{A}_{\star}$ sets $\mathbf{A}\equiv 0$ so the RG evolution is almost trivial. If I also find the maximum likelihood field to use for $\bm{\phi}_{0}$ , so that $\mathbf{b}\equiv 0$ , it is completely trivial. For this reason I only do tests with inhomogeneous data in this paper, first on data sets small enough to compute the exact likelihood by brute force linear algebra, demonstrating that the RG method works precisely in the appropriate limit of the numerical parameters, then with large data sets where the truth is determined by using much better than necessary values for numerical parameters.

After some experimentation, my standard numerical parameter settings are as follows: $\mathbf{A}_{\star}$ is set to $0.47N_{0}^{-1}$ , where $N_{0}$ is the noise power in the good part of the data—this sets the accumulated ${\rm Tr}[\mathbf{A}\mathbf{Q}^{\prime}]$ term in Eq. (15) to approximately zero (the results are insensitive to the exact value of $\mathbf{A}_{\star}$ , as long as it is reasonable). I specify the number of mid-point method $\lambda$ steps per factor of two coarse graining by a numerical parameter $N_{d\mathbf{Q}^{\prime}}$ . My standard setting is $N_{d\mathbf{Q}^{\prime}}=8$ (in an advanced version of the method, one could try to apply all the usual tricks for solving differential equations numerically). I set $\epsilon_{\mathbf{Q}^{\prime}}=0.02$ , and $\epsilon_{\mathbf{A}}=0.0005$ .

II.5.1 Small problems

I first do some tests with $N=16384$ , where we can still pretty quickly compute the exact likelihood by brute force linear algebra, shown in Fig. 1.

The results are good, by construction of course. Both using a maximum likelihood $\bm{\phi}_{0}$ and using $\mathbf{A}_{\star}$ to remove the mean effect of $\mathbf{A}$ from the evolution improve the accuracy at fixed parameter settings, although for these settings (which were driven by larger data sets) the difference is not critical. This example has $\gamma=-0.5$ , which is generally a little more difficult for the algorithm than $\gamma=0$ .

II.5.2 Large problems

If we are convinced that the algorithm works in the sense of producing accurate results in the appropriate limit of numerical parameters, we can do non-trivial large-scale tests by simply looking for convergence as the numerical parameters are changed, i.e., we assume that if there is convergence it is to the correct result. Figure 2 shows an $N=524288$ test, for $\gamma=-0.5$ again.

The results are again excellent. One might guess based on these figures that my numerical parameter settings are too conservative, i.e., that I could loosen them to achieve better speed. This is not actually true—there seems to be some cancelation of errors that makes the results in these particular examples so perfect, and they go bad very quickly if the parameters are loosened.

I stop at $N=2^{19}$ for these examples because careful testing on my laptop becomes tedious beyond this, especially running with extremely conservative parameter settings to be certain of the exact result. I have run up to two million cells with good looking results. A one million cell example runs in two minutes. At four million I start to exhaust the memory on my laptop in my current Julia implementation, although it would be possible to go somewhat further with more optimization. In any case, it is clear that billion cell data sets could be done comfortably on a supercomputer.

I tried evolving using $Q(\lambda,k)=Q(0,k)e^{-k^{2}\lambda^{2}}$ , more like in McDonald (2019), but with a maximum likelihood $\bm{\phi}_{0}$ , $\mathbf{A}_{\star}$ , and element size cuts as introduced in this paper, but was unable to come within a factor of ten of the performance of the pairwise suppression approach of this paper.

III Discussion

To summarize, I have suggested the following improvements to the basic RG approach of McDonald (2019):

•

Integrating out the difference between cells that are to be combined, rather than small-scale structure more generally, by defining $\mathbf{Q}^{\prime}$ directly to be proportional to the difference between the current and target covariance.

•

Shifting integration variables to integrate around a maximum likelihood signal field, if available, as $\bm{\phi}_{0}$ .

•

Subtracting a statistically homogeneous approximation out of the numerically evolving matrix $\mathbf{A}$ , through the definition of $\mathbf{A}_{\star}$ .

•

Cuts on matrix element size, specified by $\epsilon_{\mathbf{Q}^{\prime}}$ , $\epsilon_{\mathbf{A}}$ , etc., instead of a simple range cut.

The first of these is by far the most important. In the end it is clear that the algorithm is fast and straightforward enough for convenient practical data analysis.

It was surprising to me that the pair-oriented definition of $\mathbf{Q}^{\prime}$ made such a large (factor $\gtrsim 10$ ) difference in speed. While the the principle that if we know which cells we will combine we should focus on integrating out the difference between them seems good enough to expect some improvement, I would have been happy with a factor of two. It may be that I do not have the best possible implementation of the smooth cutoff option. In any case though, it seems like the pair-oriented approach is the way to go.

Of course it is only useful to integrate around a maximum likelihood field if that field can be found more quickly than the RG analysis could be done without it. This was the case in my tests, where finding the maximum likelihood field by conjugate gradient (CG) takes about 5% of the time in each likelihood computation. This might not always be the ratio, as my CG solution was massively accelerated by being able to multiply by things like $\mathbf{P}$ in Fourier space, including for preconditioning (e.g., without preconditioning finding a maximum likelihood field takes longer than the RG integration without it). If, e.g., the CG had to be done using less efficient spherical harmonic transforms, it might be faster not to use it. An interesting possibility is to use the RG method itself to find the maximum likelihood field. McDonald (2019) showed how to find the data-constrained mean of any function of $\bm{\phi}$ , with $\left<\bm{\phi}\right>$ itself being the simplest possible version of this. For a Gaussian problem $\left<\bm{\phi}\right>$ is the maximum likelihood field, while for a non-Gaussian problem it is not but would probably be a better starting point than the maximum likelihood field in that case anyway. Finding $\left<\bm{\phi}\right>$ can be piggybacked on a standard likelihood computation with minimal extra cost, but to get a speedup in likelihood calculations you would need to feed the result back into a recalculation. This would only be effective if a useful estimate of $\left<\bm{\phi}\right>$ could be found with looser numerical settings than would be required to do the calculation with $\phi_{0}=0$ , which seems quite possible. When, e.g., computing derivatives with respect to parameters, we would probably achieve most of the benefit by computing $\left<\bm{\phi}\right>$ only for the central model (remember that accurate results can be achieved for any $\bm{\phi}_{0}$ , it is just a question of how tight numerical settings need to be to do it).

Note that it may not always be beneficial to use $\mathbf{A}_{\star}\neq 0$ . There is no cost if all cells in a formal data vector have measurements, i.e., there are no zeros on the diagonal of $\mathbf{R}^{t}\mathbf{N}^{-1}\mathbf{R}$ , but if a substantial number of cells represent large holes in the data set or zero padding, so that these elements of $\mathbf{A}(0)$ can be dropped from sparse storage, setting $\mathbf{A}_{\star}\neq 0$ will remove this possibility. This must be considered on a problem-by-problem basis.

While my prototype code is already quite fast, at two minutes per likelihood evaluation per million cells, there is clearly more room for optimization. Most obviously, I am not taking advantage of the fact that $\mathbf{A}$ and $\mathbf{Q}^{\prime}$ are symmetric matrices at all, for no better reason than not knowing canned operations in Julia that will do this. Other simple improvements could be tuning of things like the cuts I’ve parameterized by $\epsilon_{\mathbf{Q}^{\prime}}$ , etc.. I kept these cuts constant for all iterations but this could be wasteful if the required cut value is set by coarser levels of the calculation that do not take much total time. A less obvious but I think promising optimization idea is the following: The effect of evolving Eq. (13) is non-linear in the $\mathbf{Q}^{\prime}$ matrix as initial changes in $\mathbf{A}$ are multiplied back together to find the next step, i.e., we get products of $\mathbf{Q}^{\prime}$ with itself. The required number of steps is surely set by the products of the largest elements of $\mathbf{Q}^{\prime}$ —the products of small elements are perturbatively much smaller. This suggests that $\mathbf{Q}^{\prime}$ could be split into two or more pieces based on element size. The piece(s) with larger elements, which would be very short-range (i.e., few elements, i.e., fast to multiply), could be evolved first, then longer-range the pieces with smaller elements evolved with fewer steps, possibly even one, because their self-products are negligible. As long as our set of $\mathbf{Q}^{\prime}$ steps integrates to $\mathbf{Q}_{2}-\mathbf{Q}_{1}$ , we are free to choose the details.

The next step is to implement this for realistic cosmological scenarios.

Acknowledgements.

I thank Zack Slepian and Uroš Seljak for helpful comments. This work was supported by the U.S. Department of Energy, Office of Science, Office of High Energy Physics, under Contract No. DE-AC02-05-CH11231.

Appendix A Alternative approach to integrating out differences between cells

Before realizing I could define $\mathbf{Q}^{\prime}$ by simply differencing the current and target $\mathbf{Q}$ s, I worked out a method for integrating out the difference between cells closer to the original approach in McDonald (2019). I include it here to promote broader understanding of the possibilities.

The RG integration will be controlled by a parameter $\alpha$ which starts at zero and is taken to $\infty$ . $\mathbf{Q}$ and $S_{I}$ become functions of this parameter, i.e.,

[TABLE]

with $\mathbf{K}$ a fixed matrix to be specified. Obviously we can suppress fluctuations between cells 1 and 2 by adding a term to $S(\phi)$ proportional to $(\phi_{1}-\phi_{2})^{2}$ . Repeating this over and over (e.g., $(\phi_{3}-\phi_{4})^{2}$ , etc.) is equivalent to making $\mathbf{K}$ the following block diagonal matrix:

[TABLE]

where

[TABLE]

I.e., by dialing $\alpha$ from 0 to $\infty$ in $\mathbf{Q}^{-1}=\mathbf{P}^{-1}+\alpha\mathbf{K}$ , we will have effectively integrated out the differences between adjacent pairs of cells. We now have

[TABLE]

Unlike in McDonald (2019), $\mathbf{K}$ is not exactly translation invariant, so we can’t simply compute $\left(\mathbf{P}^{-1}+\alpha\mathbf{K}\right)^{-1}$ in Fourier space. The structure of $\mathbf{Q}^{\prime}$ is the same everywhere, however, up to a distinction between odd and even cells, and it is limited to short range, so we can compute it by brute force inversion for a limited representative stretch of cells and then translate it everywhere.

This approach worked in preliminary tests, but not as efficiently as the one in the paper.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Mc Donald (2019) Patrick Mc Donald, “Renormalization group computation of likelihood functions for cosmological data sets,” Phys. Rev. D 99 , 043538 (2019) . · doi ↗
2Wilson and Kogut (1974) K. G. Wilson and J. Kogut, “The renormalization group and the ϵ italic-ϵ \epsilon expansion,” Phys. Rept. 12 , 75–199 (1974) . · doi ↗
3Banks (2008) T. Banks, Modern Quantum Field Theory, by Tom Banks, Cambridge, UK: Cambridge University Press, 2008 (2008).
4Peebles (1993) P. J. E. Peebles, Principles of Physical Cosmology by P.J.E. Peebles. Princeton University Press, 1993. ISBN: 978-0-691-01933-8 (1993).
5Mc Donald and Roy (2009) P. Mc Donald and A. Roy, “Clustering of dark matter tracers: generalizing bias for the coming era of precision LSS,” JCAP 8 , 020 (2009) , ar Xiv:0902.0991 [astro-ph.CO] . · doi ↗
6Mc Donald and Vlah (2018) Patrick Mc Donald and Zvonimir Vlah, “Large-scale structure perturbation theory without losing stream crossing,” Phys. Rev. D 97 , 023508 (2018) . · doi ↗
7Bond et al. (1998) J. R. Bond, A. H. Jaffe, and L. Knox, “Estimating the power spectrum of the cosmic microwave background,” Phys. Rev. D 57 , 2117–2137 (1998).
8Wandelt et al. (2004) Benjamin D. Wandelt, David L. Larson, and Arun Lakshminarayanan, “Global, exact cosmic microwave background data analysis using Gibbs sampling,” Phys. Rev. D 70 , 083511 (2004) , ar Xiv:astro-ph/0310080 [astro-ph] . · doi ↗