Regularized estimation for highly multivariate log Gaussian Cox   processes

Achmad Choiruddin; Francisco Cuevas-Pacheco; Jean-Fran\c{c}ois; Coeurjolly; Rasmus Waagepetersen

arXiv:1905.01455·stat.ME·May 7, 2019·Stat. Comput.

Regularized estimation for highly multivariate log Gaussian Cox processes

Achmad Choiruddin, Francisco Cuevas-Pacheco, Jean-Fran\c{c}ois, Coeurjolly, Rasmus Waagepetersen

PDF

TL;DR

This paper introduces stable and efficient methods for estimating parameters in complex multivariate log Gaussian Cox processes, enabling better analysis of high-dimensional point pattern data, exemplified by tropical rainforest ecology.

Contribution

It develops novel numerical algorithms for parameter estimation and model selection in highly multivariate log Gaussian Cox processes, addressing computational challenges.

Findings

01

Algorithms are numerically stable and efficient.

02

Applied successfully to tropical rainforest data.

03

Improves modeling of complex multivariate point patterns.

Abstract

Statistical inference for highly multivariate point pattern data is challenging due to complex models with large numbers of parameters. In this paper, we develop numerically stable and efficient parameter estimation and model selection algorithms for a class of multivariate log Gaussian Cox processes. The methodology is applied to a highly multivariate point pattern data set from tropical rain forest ecology.

Tables8

Table 1. Table 1: Averages of the minimized objective function Q ( 𝜽 ) 𝑄 𝜽 Q(\boldsymbol{\theta}) given by ( 5 ) and the computing time (in seconds) based on 200 simulations from a multivariate log Gaussian Cox process ( p = 5 , q = 2 formulae-sequence 𝑝 5 𝑞 2 p=5,q=2 ), modeled with q ∈ { 1 , 2 , 3 , 4 , 5 } 𝑞 1 2 3 4 5 q\in\{1,2,3,4,5\} , for two optimization methods.

Method	$q$
Method	1	2	3	4	5
	Minimized objective function
SQN	6.61	4.76	5.39	6.32	4.51
CBD	3.55	1.96	1.73	1.62	1.57
	Timings (seconds)
SQN	0.96	1.98	3.97	6.45	8.99
CBD	1.99	3.11	4.26	5.30	5.92

Table 2. Table 2: Average RMSEs for 𝜶 ^ 𝜶 ^ T , 𝝈 ^ 2 ^ 𝜶 superscript ^ 𝜶 T superscript ^ 𝝈 2 \hat{\boldsymbol{\alpha}}\hat{\boldsymbol{\alpha}}^{\mbox{\scriptsize\sf T}},\hat{\boldsymbol{\sigma}}^{2} , and 𝝍 ^ ^ 𝝍 \hat{\boldsymbol{\psi}} (see explanation in text) obtained from 200 simulations from a multivariate log Gaussian Cox process ( p = 5 , q = 2 formulae-sequence 𝑝 5 𝑞 2 p=5,q=2 ), modeled with q ∈ { 1 , 2 , 3 , 4 , 5 } 𝑞 1 2 3 4 5 q\in\{1,2,3,4,5\} . The estimates are obtained by minimizing ( 5 ) with two optimization methods. Last column shows the percentages of outlying parameter estimates removed in the RMSE calculation.

Method	$q$					Outliers (%)
Method	1	2	3	4	5	Outliers (%)
$\hat{𝜶} {\hat{𝜶}}^{T}$
SQN	0.41	0.93	1.10	1.17	1.09	10.3
CBD	0.41	0.25	0.29	0.32	0.39	0
$\hat{𝝈^{2}}$
SQN	0.58	0.54	0.44	0.89	0.98	1.1
CBD	0.34	0.18	0.28	0.39	0.50	0
$\hat{𝝍}$
SQN	0.0791	0.1752	0.1337	0.4091	0.4566	11.5
CBD	0.0050	0.0091	0.0110	0.0005	0.0004	0

Table 3. Table 3: Distribution of | q eff − 2 | subscript 𝑞 eff 2 |q_{\text{eff}}-2| (in %) over 200 simulations from a multivariate log Gaussian Cox process ( p = 5 , q = 2 formulae-sequence 𝑝 5 𝑞 2 p=5,q=2 ) using CBD for minimization.

	LSE				LASSO				LASSO
	$q \in 𝒒, λ = 0$				$q \in 𝒒, λ \in 𝝀$				$q = 5; λ \in 𝝀$
$\| q_{eff} - 2 \|$	0	1	2	3	0	1	2	3	0	1	2	3
Min	47	28	13	12	42	32	21	5	16	37	30	17
1-SE	46	32	22	0	15	20	65	0	10	22	65	3

Table 4. Table 4: Average RMSEs obtained from 200 simulations from a multivariate log Gaussian Cox process ( p = 5 , q = 2 formulae-sequence 𝑝 5 𝑞 2 p=5,q=2 ) for different methods of selecting q 𝑞 q and λ 𝜆 \lambda .

	$q = 2$		LSE		LASSO		LASSO
	$λ = 0$	$λ \in 𝝀$	$q \in 𝒒, λ = 0$		$q \in 𝒒, λ \in 𝝀$		$q = 5, λ \in 𝝀$
	Min	Min	Min	1-SE	Min	1-SE	Min	1-SE
$\hat{𝜶} {\hat{𝜶}}^{T}$	0.26	0.33	0.33	0.40	0.36	0.54	0.40	0.54
${\hat{𝝈}}^{2}$	0.42	0.54	0.54	0.58	0.56	0.75	0.63	0.76
$\hat{𝝍}$	0.04	0.05	0.05	0.02	0.03	0.01	0.04	0.01
$\hat{PV}$	0.28	0.31	0.32	0.35	0.33	0.41	0.37	0.42

Table 5. Table 5: Distribution of | q eff − 4 | subscript 𝑞 eff 4 |q_{\text{eff}}-4| from 200 simulations of a multivariate log Gaussian Cox process ( p = 10 𝑝 10 p=10 and q = 4 𝑞 4 q=4 ).

	LSE					LASSO					LASSO
	$q \in 𝒒, λ = 0$					$q \in 𝒒, λ \in 𝝀$					$q = 8; λ = 𝝀$
$\| q_{eff} - 4 \|$	0	1	2	3	4	0	1	2	3	4	0	1	2	3	4
Min	19	21	18	19	23	14	31	20	19	16	6	15	20	21	38
1-SE	27	36	20	12	5	22	37	21	8	12	21	22	25	11	21

Table 6. Table 6: Average of RMSEs obtained from 200 simulations from a multivariate log Gaussian Cox process ( p = 10 , q = 4 formulae-sequence 𝑝 10 𝑞 4 p=10,q=4 ) for different methods of selecting q 𝑞 q and λ 𝜆 \lambda .

	LSE		LASSO		$q = 8$ (LASSO)
	Min	1-SE	Min	1-SE	Min	1-SE
$\hat{𝜶} {\hat{𝜶}}^{T}$	0.50	0.67	0.44	0.48	0.78	0.51
${\hat{𝝈}}^{2}$	0.58	0.89	0.54	0.70	0.88	0.76
$\hat{𝝍}$	0.02	0.02	0.01	0.02	0.02	0.02
$\hat{PV}$	0.35	0.35	0.34	0.39	0.35	0.40

Table 7. Table 7: Distribution (in %) of estimated inter-species correlations corr [ Y i ( u ) , Y j ( u ) ] corr subscript 𝑌 𝑖 𝑢 subscript 𝑌 𝑗 𝑢 \mathrm{corr}[Y_{i}(u),Y_{j}(u)] and corr [ Z i ( u ) , Z j ( u ) ] corr subscript 𝑍 𝑖 𝑢 subscript 𝑍 𝑗 𝑢 \mathrm{corr}[Z_{i}(u),Z_{j}(u)] , i ≠ j 𝑖 𝑗 i\neq j , over different intervals [ Lower , Upper ] Lower Upper [\text{Lower},\text{Upper}] for the 86 species application using elastic net ( ξ = 0.5 𝜉 0.5 \xi=0.5 ) with q = 4 𝑞 4 q=4 and λ = 1.94 𝜆 1.94 \lambda=1.94 .

Lower	-1	-0.5	-0.2	0	0.2	0.5
Upper	-0.5	-0.2	0	0.2	0.5	1
$corr [Y_{i} (u), Y_{j} (u)]$	2	6	9	13	22	48
$corr [Z_{i} (u), Z_{j} (u)]$	0	2	15	60	19	4

Table 8. Table 8: Distribution of estimated PV i ( 0 ) subscript PV 𝑖 0 \mathrm{PV}_{i}(0) for 86 species application using elastic net ( ξ = 0.5 𝜉 0.5 \xi=0.5 ) with q = 4 𝑞 4 q=4 and λ = 1.94 𝜆 1.94 \lambda=1.94 .

Interval	0-0.25	0.25-0.5	0.5-0.75	0.75-1
Number of species	46	20	10	10
Species (%)	53	23	12	12

Equations111

Z_{i} (u) = μ_{i} (u) + Y_{i} (u) + U_{i} (u), u \in R^{2} .

Z_{i} (u) = μ_{i} (u) + Y_{i} (u) + U_{i} (u), u \in R^{2} .

Y_{i} (u) = l = 1 \sum q α_{i l} E_{l} (u)

Y_{i} (u) = l = 1 \sum q α_{i l} E_{l} (u)

g_{ij}(t)=\exp\big{[}\sum_{l=1}^{q}\alpha_{il}\alpha_{jl}r_{l}(t;\phi_{l})+1(i=j)\sigma^{2}_{i}c_{i}(t;\psi_{i})\big{]}

g_{ij}(t)=\exp\big{[}\sum_{l=1}^{q}\alpha_{il}\alpha_{jl}r_{l}(t;\phi_{l})+1(i=j)\sigma^{2}_{i}c_{i}(t;\psi_{i})\big{]}

\overset{g}{^}_{ij} (t) = \frac{1}{2 π t} u \in X_{i} \cap W, v \in X_{j} \cap W, u \neq = v \sum \frac{k _{b} ( t - ∥ u - v ∥ )}{ρ ^ _{i} ( u ) ρ ^ _{j} ( v ) ∣ W \cap W _{u - v} ∣}, t > 0,

\overset{g}{^}_{ij} (t) = \frac{1}{2 π t} u \in X_{i} \cap W, v \in X_{j} \cap W, u \neq = v \sum \frac{k _{b} ( t - ∥ u - v ∥ )}{ρ ^ _{i} ( u ) ρ ^ _{j} ( v ) ∣ W \cap W _{u - v} ∣}, t > 0,

β_{ij} (α, σ^{2})

β_{ij} (α, σ^{2})

β_{ii} (α, σ^{2})

Q (θ) = i, j = 1 \sum p ∥ Y_{ij} - X_{ij} (ϕ, ψ) β_{ij} (α, σ^{2}) ∥^{2},

Q (θ) = i, j = 1 \sum p ∥ Y_{ij} - X_{ij} (ϕ, ψ) β_{ij} (α, σ^{2}) ∥^{2},

Y_{ij} = (w_{ij 1} lo g \overset{g}{^}_{ij} (t_{1}), \dots, w_{ij L} lo g \overset{g}{^}_{ij} (t_{L}))^{\mbox T},

Y_{ij} = (w_{ij 1} lo g \overset{g}{^}_{ij} (t_{1}), \dots, w_{ij L} lo g \overset{g}{^}_{ij} (t_{L}))^{\mbox T},

r (t_{k}; ϕ) = (r_{1} (t_{k}; ϕ_{1}), \dots, r_{q} (t_{k}; ϕ_{q})) .

r (t_{k}; ϕ) = (r_{1} (t_{k}; ϕ_{1}), \dots, r_{q} (t_{k}; ϕ_{q})) .

PV_{i} (t)

PV_{i} (t)

= \frac{\sum _{l = 1}^{q} α _{i l}^{2} r _{l} ( t ; ϕ _{l} )}{\sum _{l = 1}^{q} α _{i l}^{2} r _{l} ( t ; ϕ _{l} ) + σ _{i}^{2} c _{i} ( t ; ψ _{i} )}, ∥ h ∥ = t .

cov {Y_{i} (u), Y_{j} (u)} = α_{i .} α_{j .}^{\mbox T}

cov {Y_{i} (u), Y_{j} (u)} = α_{i .} α_{j .}^{\mbox T}

cov {Z_{i} (u), Z_{j} (u)} = α_{i .} α_{j .}^{\mbox T} + 1 [i = j] σ_{i}^{2} .

cov {Z_{i} (u), Z_{j} (u)} = α_{i .} α_{j .}^{\mbox T} + 1 [i = j] σ_{i}^{2} .

Q_{λ} (θ) = Q (θ) + λ i = 1 \sum p l = 1 \sum q p (α_{i l})

Q_{λ} (θ) = Q (θ) + λ i = 1 \sum p l = 1 \sum q p (α_{i l})

Q_{λ, i} (α_{i \cdot}, σ_{i}^{2}) = 2 j = 1 j \neq = i \sum p ∥ Y_{ij} - \tilde{X}_{ij} α_{i \cdot} ∥^{2} + ∥ Y_{ii} - X_{ii} β_{ii} (α, σ^{2}) ∥^{2} + λ l = 1 \sum q p (α_{i l})

Q_{λ, i} (α_{i \cdot}, σ_{i}^{2}) = 2 j = 1 j \neq = i \sum p ∥ Y_{ij} - \tilde{X}_{ij} α_{i \cdot} ∥^{2} + ∥ Y_{ii} - X_{ii} β_{ii} (α, σ^{2}) ∥^{2} + λ l = 1 \sum q p (α_{i l})

\overset{σ}{^}_{i}^{2}

\overset{σ}{^}_{i}^{2}

∥ Y_{ii} - X_{ii} β_{ii} (α, σ^{2}) ∥^{2} \approx ∥ Y_{ii} - \tilde{X}_{ii}^{k} [α_{i \cdot}^{\mbox T}, σ_{i}^{2}]^{\mbox T} ∥^{2},

∥ Y_{ii} - X_{ii} β_{ii} (α, σ^{2}) ∥^{2} \approx ∥ Y_{ii} - \tilde{X}_{ii}^{k} [α_{i \cdot}^{\mbox T}, σ_{i}^{2}]^{\mbox T} ∥^{2},

Q_{λ, i} (α_{i \cdot}, σ_{i}^{2})

Q_{λ, i} (α_{i \cdot}, σ_{i}^{2})

= j = 1 \sum p ∥ Y_{ij}^{*} - X_{ij}^{*} α_{i \cdot} ∥^{2} + λ l = 1 \sum q p (α_{i l}),

Y_{ij}^{*}

Y_{ij}^{*}

X_{ij}^{*}

Y_{ii}^{*}

X_{ii}^{*}

\hat{α}_{i \cdot}

\hat{α}_{i \cdot}

α_{i \cdot}^{(k + 1)} = α_{i \cdot}^{(k)} + t (\hat{α}_{i \cdot} - α_{i \cdot}^{(k)}) .

α_{i \cdot}^{(k + 1)} = α_{i \cdot}^{(k)} + t (\hat{α}_{i \cdot} - α_{i \cdot}^{(k)}) .

CV (λ, q) = \frac{1}{K} c = 1 \sum K CV_{c},

CV (λ, q) = \frac{1}{K} c = 1 \sum K CV_{c},

(λ_{opt}, q_{opt}) = m = 1, \dots, M, n = 1, \dots, N arg min CV (λ_{m}, q_{n}) .

(λ_{opt}, q_{opt}) = m = 1, \dots, M, n = 1, \dots, N arg min CV (λ_{m}, q_{n}) .

CV (λ, q) \leq CV (λ_{opt}, q_{opt}) + SE (λ_{opt}, q_{opt}),

CV (λ, q) \leq CV (λ_{opt}, q_{opt}) + SE (λ_{opt}, q_{opt}),

SE (λ_{opt}, q_{opt}) = \frac{\sum _{c = 1}^{K} ( CV _{c} - CV ( λ , q ) ) ^{2}}{( K - 1 ) K} .

SE (λ_{opt}, q_{opt}) = \frac{\sum _{c = 1}^{K} ( CV _{c} - CV ( λ , q ) ) ^{2}}{( K - 1 ) K} .

\displaystyle\mathrm{RMSE}(\hat{\omega})=\sqrt{\mathbb{E}\big{(}(\hat{\omega}-\omega)^{2}\big{)}}.

\displaystyle\mathrm{RMSE}(\hat{\omega})=\sqrt{\mathbb{E}\big{(}(\hat{\omega}-\omega)^{2}\big{)}}.

α^{\mbox T}

α^{\mbox T}

ϕ

ϕ

σ^{2}

α

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Regularized estimation for highly multivariate log Gaussian Cox processes

Achmad Choiruddin

Department of Mathematical Sciences, Aalborg University

Francisco Cuevas-Pacheco

Department of Mathematical Sciences, Aalborg University

Jean-François Coeurjolly

Department of Mathematics, Université du Québec à Montréal

Rasmus Waagepetersen

Department of Mathematical Sciences, Aalborg University

Abstract

Statistical inference for highly multivariate point pattern data is challenging due to complex models with large numbers of parameters. In this paper we develop numerically stable and efficient parameter estimation and model selection algorithms for a class of multivariate log Gaussian Cox processes. The methodology is applied to a highly multivariate point pattern data set from tropical rain forest ecology.

Key words: cross pair correlation, elastic net, LASSO, log Gaussian Cox process, multivariate point process, proximal Newton method.

1 Introduction

Highly multivariate point pattern data are becoming increasingly common. Tropical rain forest ecologists, for example, collect data on locations of thousands of trees belonging to hundreds of species. Likewise, huge space-time data sets regarding scene, time and type of crimes are recorded and made publicly available for many major cities across the world. Research on statistical methodology for multivariate point patterns has mainly considered bivariate or trivariate point patterns. Some exceptions are Diggle et al. (2005) and Baddeley et al. (2014) who considered four- and six-variate multivariate Poisson processes and more recently Jalilian et al. (2015) and Waagepetersen et al. (2016) who considered five- and nine-variate multivariate Cox processes. A truly high-dimensional analysis was conducted by Rajala et al. (2018) who introduced a multivariate Gibbs point process and applied it to a point pattern data set containing locations of 83 species of rain forest trees.

A particular challenge regarding modeling of highly multivariate point patterns is that models easily become very complex with large numbers of parameters. To enhance interpretability of fitted models and numerical stability of estimation, Rajala et al. (2018) used regularization methods such as the group lasso. The possibility of using regularization was also mentioned in the discussion of Waagepetersen et al. (2016) in the context of multivariate log Gaussian Cox processes.

The type of multivariate log Gaussian Cox process considered by Waagepetersen et al. (2016) and reviewed in Section 2 has a simple and natural interpretation and e.g. enables the user to decompose variation according to different sources and to group different types of point patterns according to similarities in their spatial distributions, see Waagepetersen et al. (2016) for details. However, the fitting of these models is very challenging in the highly multivariate case due to model complexity. In Section 3 of this paper, we develop a numerically stable and efficient parameter estimation methodology by introducing regularization and using efficient convex optimization algorithms. We test the methodology in a simulation study in Section 4 and apply it to a tropical rain forest data in Section 5. Section 6 contains some concluding remarks.

2 Multivariate log Gaussian Cox processes

A multivariate log Gaussian Cox point process (see Møller et al., 1998) is a multivariate point process $\mathbf{X}=(X_{1},\ldots,X_{p})$ , $p>1$ , where each component $X_{i}$ , $i=1,\ldots,p$ , is a Cox process driven by a log Gaussian random intensity function $\Lambda_{i}$ . Conditionally on the $\Lambda_{i}$ , the $X_{i}$ are independent Poisson point processes each with intensity function $\Lambda_{i}$ . As in Waagepetersen et al. (2016), we assume that the random intensity functions are of the form $\Lambda_{i}(\mathbf{u})=\exp[Z_{i}(\mathbf{u})]$ with

[TABLE]

The terms $\mu_{i}$ are deterministic and typically given in terms of regressions on observed covariates. The terms $Y_{i}$ and $U_{i}$ are zero-mean Gaussian fields. The $Y_{i}$ can be mutually correlated while the $U_{i}$ are assumed to be independent. The $U_{i}$ are assumed to be stationary with variances $\sigma_{i}^{2}>0$ and correlation functions $c_{i}$ , $i=1,\ldots,p$ . For the $Y_{i}$ we assume that

[TABLE]

where $q\geq 1$ , $\boldsymbol{\alpha}=[\alpha_{ij}]_{ij}$ is a $p\times q$ real valued coefficient matrix, and the $E_{l}$ , $l=1,\ldots,q$ , are independent zero-mean stationary Gaussian fields with variance one. In our applications we also consider the case $q=0$ meaning that the $Y_{i}$ are omitted in (1). The $Y_{i}$ can be interpreted as effects of unobserved spatial covariates while the $U_{i}$ represent sources of clustering which are specific to each type of points. We denote by $r_{l}$ the correlation function of $E_{l}$ . For the correlation functions $r_{l}$ and $c_{i}$ we introduce isotropic parametric models $r_{l}(\cdot;\phi_{l})=r(\|\cdot\|/\phi_{l})$ and $c_{i}(\cdot;\psi_{i})=c(\|\cdot\|/\psi_{i})$ , where $\phi_{l}$ and $\psi_{i}$ are correlation scale parameters. Specifically, we consider in this paper exponential correlation functions $r(t)=c(t)=\exp(-t)$ , $t\geq 0$ , although many other choices are available (Chilès and Delfiner, 1999).

2.1 Intensity function and pair correlation function

Let $\boldsymbol{\alpha}_{i\cdot}$ denote the $i$ th row of $\boldsymbol{\alpha}$ . Following Møller et al. (1998), the intensity function of $X_{i}$ is $\rho_{i}(\mathbf{u})=\exp\big{[}\mu_{i}(\mathbf{u})+\boldsymbol{\alpha}_{i\cdot}\boldsymbol{\alpha}^{{\mbox{\scriptsize\sf T}}}_{i\cdot}/2+\sigma^{2}_{i}/2\big{]}$ while the cross pair correlation function for the pair $X_{i}$ and $X_{j}$ is

[TABLE]

for $t\geq 0$ . Consider two spatial locations $\mathbf{u}$ and $\mathbf{v}$ . Then $\rho_{j}(\mathbf{v})g_{ij}(\|\mathbf{v}-\mathbf{u}\|)$ represents the cross-Palm intensity function (Coeurjolly et al., 2017) and can be interpreted as the intensity function of $X_{j}$ conditional on that $\mathbf{u}\in X_{i}$ . Hence $g_{ij}(\|\mathbf{v}-\mathbf{u}\|)>1$ ( $<1$ ) implies that presence of a point from $X_{i}$ at $\mathbf{u}$ increases (decreases) the intensity of $X_{j}$ at $\mathbf{v}$ . Thus $\sum_{l=1}^{q}\alpha_{il}\alpha_{jl}r_{l}(t)<0$ ( $>0$ ) implies repulsion (attraction) between points of $X_{i}$ and $X_{j}$ at lag $t$ . Similarly, a large value of $\sum_{l=1}^{q}\alpha_{il}^{2}r_{l}(t)+\sigma^{2}_{i}c_{i}(t)$ leads to strong attraction among points of $X_{i}$ separated by a lag $t$ .

Non-parametric kernel estimates of the $g_{ij}$ are given by

[TABLE]

where $W$ is the observation window, $k_{b}$ is a kernel function depending on a smoothing parameter $b>0$ , $|\cdot|$ denotes area and $W_{\mathbf{h}}$ denotes the translate of $W$ by the vector $\mathbf{h}\in\mathbb{R}^{2}$ (Møller and Waagepetersen, 2003). The quantities $\hat{\rho}_{i}$ and $\hat{\rho}_{j}$ are estimates of the intensity functions of $X_{i}$ and $X_{j}$ , typically obtained from regression models depending on observed covariates through maximizing the composite likelihood (see e.g. Waagepetersen, 2007; Møller and Waagepetersen, 2007) or its regularized versions (e.g. Thurman et al., 2015; Choiruddin et al., 2018).

2.2 Least squares estimation

Let $\boldsymbol{\theta}$ be the parameter vector consisting of the components of $\boldsymbol{\alpha}$ , $\boldsymbol{\sigma}^{2}=(\sigma_{1}^{2},\ldots,\sigma_{p}^{2})^{{\mbox{\scriptsize\sf T}}}$ , $\boldsymbol{\phi}=(\phi_{1},\ldots,\phi_{q})^{{\mbox{\scriptsize\sf T}}},$ and $\boldsymbol{\psi}=(\psi_{1},\ldots,\psi_{p})^{{\mbox{\scriptsize\sf T}}}$ . Let further

[TABLE]

The objective function used by Waagepetersen et al. (2016) for parameter estimation is of the form

[TABLE]

where

[TABLE]

$\hat{g}_{ij}(t_{k})$ , $k=1,\ldots,L$ , are obtained using (3) for lags $0<t_{1}<t_{2}<\ldots<t_{L}$ and the $w_{ij}\geq 0$ are non-negative weights. The matrix $X_{ij}(\boldsymbol{\phi},\boldsymbol{\psi})$ is $L\times q$ ( $i\neq j$ ) or $L\times(q+1)$ ( $i=j$ ) with rows $\sqrt{w_{ijk}}\mathbf{r}(t_{k};\boldsymbol{\phi})$ ( $i\neq j$ ) or $\sqrt{w_{iik}}[\mathbf{r}(t_{k};\boldsymbol{\phi}),c_{i}(t_{k};\psi_{i})]$ ( $i=j$ ), $k=1,\ldots,L$ , where

[TABLE]

Waagepetersen et al. (2016) minimized $Q(\boldsymbol{\theta})$ using a standard quasi-Newton method.

2.3 Inference regarding multivariate dependence structure

The model (1) enables us to decompose the covariances of the latent Gaussian fields $Z_{i}$ into contributions from the common fields $E_{l}$ and the type-specific fields $U_{i}$ . Specifically, Waagepetersen et al. (2016) considered for each type $i$ and lag $t$ the proportion of variance (PV) due to the common fields:

[TABLE]

These are useful e.g. for grouping species based on how much of the variation is due to common factors respectively type-specific factors. Furthermore, from $\boldsymbol{\alpha}$ and $\boldsymbol{\sigma}^{2}$ we can compute the matrix of lag zero inter-type covariances $\boldsymbol{\alpha}\boldsymbol{\alpha}^{\mbox{\scriptsize\sf T}}$ due to the common latent fields with $ij$ th entry

[TABLE]

as well as the lag zero covariances between the fields including both common and type-specific effects,

[TABLE]

A row $\boldsymbol{\alpha}_{i\cdot}$ informs on the dependence of $X_{i}$ on the common latent fields. Considering the norms of differences $\|\boldsymbol{\alpha}_{i.}-\boldsymbol{\alpha}_{j.}\|$ , we are able to group the different types of point patterns according to their dependence on the latent factors $E_{l}$ .

As discussed in Waagepetersen et al. (2016), the distribution of our multivariate log Gaussian Cox process is invariant to 1) simultaneous permutation of columns in $\boldsymbol{\alpha}$ and corresponding $\phi_{i}$ ’s and 2) multiplication of a column in $\boldsymbol{\alpha}$ by $-1$ . Thus we can not identify individual parameters $\alpha_{il}$ and $\phi_{l}$ without imposing constraints on the parameter space.

In our simulation studies in Section 4, we therefore follow Waagepetersen et al. (2016) by restricting attention to identifiable functions of $\boldsymbol{\alpha}$ and $\boldsymbol{\psi}$ such as the aforementioned proportions of variances and covariances and norms of differences between rows of $\boldsymbol{\alpha}$ . In the application, we also consider the percentage of zero entries when $\boldsymbol{\alpha}$ is estimated using elastic net regularization with $\xi>0$ , see next section. The more zeros, the less complex is the dependence structure of the multivariate log Gaussian Cox process.

3 Regularized least squares estimation

The parameter vector $\boldsymbol{\theta}$ is of potentially very high dimension, especially due to the many components of the $p\times q$ parameter matrix $\boldsymbol{\alpha}$ . To enhance interpretability and numerical stability of estimation we suggest to introduce regularization and thus consider the regularized least squares criterion

[TABLE]

where $Q(\boldsymbol{\theta})$ is given by (5), $\lambda$ is a nonnegative tuning parameter and $p(\cdot)$ is a convex penalty function. We consider in the following the elastic net penalization (Zou and Hastie, 2005) $p(\alpha_{il})=(1-\xi)\alpha_{il}^{2}/2+\xi|\alpha_{il}|$ , $0\leq\xi\leq 1$ , which embraces LASSO (Tibshirani, 1996) and ridge regression (Hoerl and Kennard, 1988) techniques by setting $\xi=1$ or $\xi=0$ respectively.

Using regularization in a related factor analysis was previously suggested by Choi et al. (2010). Their simpler setting corresponds to directly observing vectors $(Z_{i}(u_{k}))_{i=1}^{p}$ , $k=1,\ldots,n$ , where $Z_{i}(u_{k})$ is modeled as in (1) but with zero spatial correlation. In contrast, our $Z_{i}$ are unobserved with spatial correlation modeled via the correlation functions $r_{l}$ and $c_{i}$ . Thus the computational methodology suggested by Choi et al. (2010) is not applicable in our situation.

To minimize (7) with respect to $\boldsymbol{\theta}$ , we employ a cyclical block descent algorithm where $\boldsymbol{\sigma}^{2}$ , $\boldsymbol{\alpha}$ , $\boldsymbol{\phi}$ and $\boldsymbol{\psi}$ are updated in turn. The updating is iterated until relative function convergence of the criterion (7). The details of the block updates are given in the following two sections and Appendices A-B. Pseudo-code for the full algorithm is given in Appendix B.3.

3.1 Update for $\boldsymbol{\sigma}^{2}$ and $\boldsymbol{\alpha}$

Our strategy for updating $\boldsymbol{\sigma}^{2}$ and $\boldsymbol{\alpha}$ is to use for $i=1,\ldots,p$ , a least squares update of $\sigma^{2}_{i}$ followed by an update of $\boldsymbol{\alpha}_{i\cdot}$ using a cyclical coordinate descent algorithm. The motivation for updating rows $\boldsymbol{\alpha}_{i\cdot}$ instead of other subsets of $\boldsymbol{\alpha}$ is that the update of $\boldsymbol{\alpha}_{i\cdot}$ , keeping all other parameters fixed, is quite close to a standard least squares problem, as will be evident in the following.

The relevant part of the objective function for the updates of $\sigma_{i}^{2}$ and $\boldsymbol{\alpha}_{i\cdot}$ given all other parameters is

[TABLE]

where the $l$ th column of $\tilde{X}_{ij}$ is the $l$ th column of $X_{ij}$ multiplied by $\alpha_{jl}$ . In other words, for $i\neq j$ , $\tilde{X}_{ij}=X_{ij}\text{Diag}(\alpha_{j1},\ldots,\alpha_{jq})$ where $\text{Diag}(\alpha_{j1},\ldots,\alpha_{jq})$ is the diagonal matrix with diagonal entries $\alpha_{j1},\ldots,\alpha_{jq}$ . For ease of notation we here omit the dependence of $\tilde{X}_{ij}$ and $X_{ii}$ on the fixed parameters $\boldsymbol{\psi}$ and $\boldsymbol{\phi}$ . Note that (8) is equivalent to a standard least squares objective function for $\boldsymbol{\alpha}_{i\cdot}$ except for the middle term that depends on $\alpha_{il}^{2}$ , $l=1,\ldots,q$ , cf. (4).

The minimization of $Q_{\lambda,i}$ with respect to $\sigma^{2}_{i}$ only involves the middle term in (8). This is a standard least squares problem except that we require $\sigma^{2}_{i}$ to be non-negative. Thus,

[TABLE]

An explicit formula for this update is given in Appendix B.1.

To update $\boldsymbol{\alpha}_{i\cdot}$ (given $\sigma_{i}^{2}$ and all other parameters), we use a so-called proximal Newton update (Lee et al., 2014, and Appendix A) where the middle term in (8) is replaced by a quadratic approximation around the current value $\boldsymbol{\alpha}_{i\cdot}^{(k)}$ . We denote by $\hat{Q}_{\lambda,i}(\boldsymbol{\alpha}_{i\cdot},\sigma^{2}_{i}|\boldsymbol{\alpha}_{i\cdot}^{(k)})$ the resulting approximate objective function (to be detailed in the next paragraph). Since $\hat{Q}_{\lambda,i}(\boldsymbol{\alpha}_{i\cdot},\sigma^{2}_{i}|\boldsymbol{\alpha}_{i\cdot}^{(k)})$ is a regularized linear least squares objective function, minimization can be performed using a standard coordinate descent algorithm (see e.g. Hastie et al., 2015).

A very simple quadratic approximation of the middle term of (8) is

[TABLE]

where $\tilde{X}_{ii}^{k}=X_{ii}\text{Diag}\big{\{}\alpha_{i1}^{(k)},\ldots,\alpha_{iq}^{(k)},1\big{\}}$ . Nevertheless, the curvature of this quadratic approximation does not match the curvature of the original term at $\boldsymbol{\alpha}_{i\cdot}^{(k)}$ . Instead we use a second-order Taylor approximation as detailed in the Appendix A.1 which results in the explicit expression for $\hat{Q}_{\lambda,i}(\boldsymbol{\alpha}_{i\cdot},\sigma^{2}_{i}|\boldsymbol{\alpha}_{i\cdot}^{(k)})$ given by

[TABLE]

where

[TABLE]

and $X_{ii,\cdot(1:q)}$ denotes the first $q$ columns in $X_{ii}$ .

We obtain

[TABLE]

using coordinate descent with an explicit formula for the updates given in Appendix B.2. Further, define for some $t>0$ ,

[TABLE]

Thus, $\boldsymbol{\alpha}_{i\cdot}^{(k+1)}$ is obtained using $(\hat{\boldsymbol{\alpha}}_{i\cdot}-\boldsymbol{\alpha}_{i\cdot}^{(k)})$ as a search direction with step size controlled by $t$ . Following Lee et al. (2014, Proposition 2.3), one can show (see Appendix A.2) that $Q_{i,\lambda}(\boldsymbol{\alpha}_{i\cdot}^{(k+1)})<Q_{i,\lambda}(\boldsymbol{\alpha}_{i\cdot}^{(k)})$ if $t$ is small enough. That is, if the minimization of $\hat{Q}_{i,\lambda}$ is combined with a line search the resulting update is guaranteed to decrease the objective function $Q_{i,\lambda}$ written in (8).

3.2 Update for $\boldsymbol{\psi}$ and $\boldsymbol{\phi}$

To update $\boldsymbol{\phi}$ and $\boldsymbol{\psi}$ given all other parameters, we first reparameterize the objective function in terms of ${\bf f}=(\log\phi_{1},\ldots,\log\phi_{q})^{\mbox{\scriptsize\sf T}}$ and ${\bf s}=(\log\psi_{1},\ldots,\log\psi_{p})^{\mbox{\scriptsize\sf T}}$ . We then update ${\bf f}$ and ${\bf s}$ in turn using a standard quasi-Newton update as implemented in the optim routine in the R language with method bfgs (Broyden-Fletcher-Goldfarb-Shanno update). Finally, we transform back using the exponential to get updates of $\boldsymbol{\phi}$ and $\boldsymbol{\psi}$ .

We also tried other options: joint update of $(\boldsymbol{\phi},\boldsymbol{\psi})$ without log-transformation but introducing box constraints to avoid negative values and joint quasi-Newton update of the log-transformed parameters $(\bf f,\bf s)$ . For simulated data examples, the option with separate updates of $\bf f$ and $\bf s$ performed best.

3.3 Initialization

We initialize the components $\boldsymbol{\alpha}$ by a sample of independent random normals with mean zero and standard deviation 0.05 while we choose 1 for the initial values of the components in $\boldsymbol{\sigma}^{2}$ . For $\boldsymbol{\phi}$ and $\boldsymbol{\psi}$ we choose initial values that depend on the scale of the observation window to avoid that the corresponding covariance functions become essentially constant equal to zero (too small initial values) or to one (too large initial values). For the unit square observation window, for example, the initial values for $\boldsymbol{\phi}$ and $\boldsymbol{\psi}$ were chosen randomly from the uniform distribution on $[0.01,0.05]$ . Regarding the choice of weights $w_{ijk}$ introduced in Section 2.2, we follow arguments by Waagepetersen et al. (2016) and fix, for $i,j=1,\ldots,p$ and $k=1,\ldots,L$ , $w_{ijk}=\hat{g}_{ij}(t_{k})/2$ for $i\neq j$ and $w_{iik}=\hat{g}_{ii}(t_{k})$ .

3.4 Strategy to determine $q$ and regularization parameters $\lambda$ and $\xi$

In our applications we consider just a few values $\xi=0$ (ridge), $\xi=0.5$ (mix of ridge and LASSO, i.e. elastic net) and $\xi=1$ (LASSO). For each of the values of $\xi$ we use a two-dimensional $K$ -fold cross validation approach to select optimal values $\lambda_{\text{opt}}$ and $q_{\text{opt}}$ among prespecified values $\lambda_{1},\ldots,\lambda_{M}$ and $q_{1},\ldots,q_{N}$ (e.g. Hastie et al., 2013, Chapter 7). The procedure is as follows.

Step 1.

We split indices $ijk$ ( $i,j=1,\ldots,p$ and $k=1,\ldots,L$ ) into $K$ sets $S_{1},\ldots,S_{K}$ (see details below). 2. Step 2.

For each $\lambda\in\{\lambda_{1},\ldots,\lambda_{M}\}$ and $q\in\{q_{1},\ldots,q_{N}\}$ , we obtain an estimate $\boldsymbol{\hat{\theta}}_{c}$ by minimizing equation (7) with $w_{ijk}$ replaced by 0 for $ijk\in S_{c},c=1,\ldots,K$ . The cross validation score for $\lambda$ and $q$ is then obtained by

[TABLE]

where $\mathrm{CV}_{c}=\sum_{ijk\in S_{c}}(Y_{ijk}-\hat{Y}_{ijk}(\boldsymbol{\hat{\theta}}_{c}))^{2}$ and $\hat{Y}_{ij}(\boldsymbol{\hat{\theta}}_{c})=X_{ij}(\hat{\boldsymbol{\phi}}_{c},,\hat{\boldsymbol{\psi}}_{c})\boldsymbol{\beta}_{ij}(\hat{\boldsymbol{\alpha}}_{c},\hat{\boldsymbol{\sigma}}^{2}_{c})$ . 3. Step 3.

To obtain $\lambda_{\text{opt}}$ and $q_{\text{opt}}$ , we minimize $\mathrm{CV}(\lambda,q)$ w.r.t $\lambda$ and $q$ , i.e.,

[TABLE]

The sets $S_{c}$ in Step 1 need to be chosen carefully. First, since $\log(\hat{g}_{ijk})$ and $\log(\hat{g}_{ijk^{\prime}})$ are strongly correlated when $k$ and $k^{\prime}$ are close, we leave out blocks of consecutive indices. Second, we do not include diagonal indices $iik$ in the sets $S_{c}$ since values $Y_{iik}$ include contributions from the type-specific random fields. The diagonal values thus do not provide so much information about $q$ and omission of these values further makes the estimation procedure less stable regarding $\boldsymbol{\sigma}^{2}$ and $\boldsymbol{\psi}$ . So, to determine each subset $S_{c}$ , we arrange the $ijk$ with $i<j$ lexicographically in a vector $(121,122,\ldots)$ and split this vector into consecutive blocks of length $b$ . These blocks are then assigned to the different $S_{c}$ at random.

The one standard error (1-SE) rule is an alternative way to select $\lambda$ and $q$ based on the CV scores obtained from (12) (e.g. Hastie et al., 2013). In case of $q$ fixed, the 1-SE rule chooses the largest $\lambda$ for which the CV score is less than the smallest CV score plus one standard deviation. In the case where both $\lambda$ and $q$ is to be selected, we adapt the 1-SE rule by starting with $(\lambda_{\text{opt}},q_{\text{opt}})$ given by (13) and then choosing $(\lambda,q)$ to be the smallest $q$ and largest $\lambda$ possible such that the following condition holds:

[TABLE]

where

[TABLE]

Hence, the 1-SE rule attempts to select the most simple model whose CV score is within one standard error of the minimal CV score.

Finally, note that when $\xi=0.5$ or $\xi=1$ and $\lambda>0$ is chosen, the resulting estimate of $\boldsymbol{\alpha}$ may contain columns that consist entirely of zeros. The effective number $q_{\text{eff}}$ of columns in $\boldsymbol{\alpha}$ then becomes smaller than $q_{\text{opt}}$ .

4 Simulation study

We conduct two simulation studies to evaluate the regularized least squares technique for parameter estimation and the cross-validation (CV) method to select $q$ and $\lambda$ . The setting of the first study corresponds to the simulation study in Waagepetersen et al. (2016). We first compare the estimates obtained using the new cyclical block descent (CBD) algorithm developed in Section 3 with the method proposed by Waagepetersen et al. (2016). In this regard, we consider values of $q=1,\ldots,5$ and for comparison purposes, we fix $\lambda=0$ since regularization was not used in Waagepetersen et al. (2016). Next we consider only the new algorithm with the objective of comparing different CV options for selecting $q$ and $\lambda$ , cf. Section 3.4, and to study the effect of regularization. The second study has the same objective but with a more complex setting for the simulations. In both simulation studies we use $K=8$ for the CV and we only consider the LASSO option ( $\xi=1$ ) for regularization.

To asses the parameter estimates, we consider the root mean squared errors (RMSEs) of the estimates. For a real parameter $\omega$ and estimate $\hat{\omega}$ , the RMSE is

[TABLE]

For each of the parameter matrices/vectors $\boldsymbol{\alpha}\boldsymbol{\alpha}^{{\mbox{\scriptsize\sf T}}}$ , $\boldsymbol{\sigma}^{2}$ , $\boldsymbol{\psi}$ , or the vector of proportions of variances at lag 0 (PV), we evaluate the average of RMSEs for the components in these quantities. For example, we compute the average of RMSEs for each entry in the $p\times p$ matrix $\boldsymbol{\alpha}\boldsymbol{\alpha}^{\mbox{\scriptsize\sf T}}$ .

4.1 Comparison of methods for least squares estimation

The first study follows the one in Waagepetersen et al. (2016) for which 200 point patterns in $W=[0,1]^{2}$ are generated from multivariate log Gaussian Cox processes as defined in Section 2, with $p=5$ and $q=2$ . The true parameters are: $\boldsymbol{\sigma}^{2}=(1,1,1,1,1),\;\boldsymbol{\psi}=(0.01,0.02,0.02,0.03,0.04),\;\boldsymbol{\phi}=(0.02,0.1)$ and

[TABLE]

The trend models $\mu_{i}(u)=m_{i}$ are set such that the expected number of points is 1000 for each $i=1,\ldots,5$ . A uniform kernel with bandwidth 0.005 is used for the non-parametric estimation of the cross pair correlation function at $L=25$ equispaced lags between 0.025 and 0.25.

For each simulation we compare two methods for minimizing (7) with $\lambda=0$ and $q\in\{1,\cdots,5\}$ :

The standard quasi-newton (SQN) optimization algorithm considered by Waagepetersen et al. (2016) and implemented in the R package optimx. This algorithm updates all parameters jointly. 2. 2.

The new CBD algorithm described in Section 3.

The comparison is in terms of minimization of the objective function, computing time and RMSEs.

Table 1 reports the averages of the values of the minimized objective functions and the computational times over the 200 simulations. All timings are carried out on a Dell R740 2 x 14 cores (Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz) 768 GB RAM 2x200gb SSD 960 GB NVME. CBD performs considerably better in terms of minimizing the objective function than SQN. SQN is somewhat faster than CBD for small $q$ but slower for larger $q$ . The computing times for SQN grow quite quickly with increasing $q$ while the computing times seems more stable for CBD.

The RMSE results are shown in Table 2. For the calculation of the RMSEs, we exclude small percentages of very extreme parameter estimates. These percentages are reported in the last column of Table 2. CBD performs better than SQN since smaller RMSEs are obtained and there are no outlying parameter estimates. For SQN quite large percentages of extreme parameter estimates are observed.

4.2 Assessment of cross-validation and regularization methods with $p=5$

In this section we continue with the simulations from the previous setting but restrict attention to CV selection of $q$ and $\lambda$ using CBD for optimization with the LASSO regularization ( $\xi=1$ ). We select values of $q$ in $\boldsymbol{q}=\{1,2,3,4,5\}$ and values of $\lambda$ in $\boldsymbol{\lambda}=\{0,10^{-3},\ldots,5\}$ which has 20 elements and where the non-zero values of $\boldsymbol{\lambda}$ grow log-linearly from $\log 10^{-3}$ to $\log 5$ . We consider three situations: (1) we select $q$ from $\boldsymbol{q}$ with $\lambda=0$ fixed, thus least squares estimation (LSE) is performed; (2) we search for the jointly optimal $(q,\lambda)$ ; (3) we fix $q=5$ and select $\lambda$ from $\boldsymbol{\lambda}$ . Recall that the selection of a relatively big $\lambda$ may lead to zero columns in the $\boldsymbol{\alpha}$ estimate. We therefore consider the effective $q_{\text{eff}}$ as defined in Section 3.4. Thereby we can also evaluate the selection of $q$ in situation (3). In case of (2) we both consider the minimum CV (Min) and the 1-standard error (1-SE) rules to select $q$ and $\lambda$ .

Table 3 shows the distribution of absolute distance between $q_{\text{eff}}$ and the true $q=2$ . For LSE, using the Min rule, $q_{\text{eff}}$ coincides with the true $q$ for 47% of the simulations and differs at most by 1 from the true $q$ in 75% of the simulations. The results with the 1-SE rule are similar with percentages $46$ and $78$ . LASSO with Min rule for joint selection of $(q,\lambda)$ performs similarly to LSE with the corresponding percentages 42 and 74 %. With fixed $q=5$ the percentages are reduced to 16% and 53 %. Using 1-SE rule, the LASSO forces many columns to be zero leading to quite small percentages where $|q_{\text{eff}}-2|\leq 1$ .

RMSEs are reported in Table 4 for all three situations. In addition, in the first columns, we consider the case fixed $q=2$ assuming the true $q$ is known. We first note that LASSO gives worse results than LSE when $q=2$ is fixed. In general, for unknown $q$ , LSE and LASSO perform quite similarly when the Min rule is used. The results are worse when 1-SE is used and in particular for LASSO. When $q$ is fixed to $5$ and only $\lambda$ is selected the results are worse than for LASSO with $q$ selected by the Min rule while the results with $q=5$ are similar to LASSO with $q$ selected by the 1-SE rule.

The overall impression is that LSE performs slightly better than LASSO, especially in estimating $\boldsymbol{\alpha}\boldsymbol{\alpha}^{\mbox{\scriptsize\sf T}}$ . This may indicate that when $p$ is relatively small, selection of $q$ with $\lambda=0$ (LSE) already gives sparse results. Another reason that LASSO does not improve RMSE may be that the true $\boldsymbol{\alpha}$ is not that sparse having only 40% zero components. Thus the bias introduced by regularization is not counterbalanced by a reduction in variance. Also, the 1-SE rule does not seem preferable in this situation. In the next section we consider a more complex setting with $p=10$ .

4.3 Assessment of cross-validation and regularization methods with $p=10$

In this experiment, we study a more complex situation with a higher $p$ and more variation in the parameters. We simulate 200 point patterns from a multivariate log Gaussian Cox process with $p=10$ , $q=4$ , $W=[0,1]^{2}$ , and parameters

[TABLE]

and $\boldsymbol{\psi}$ equal to

[TABLE]

The settings for the trend models, the kernel estimation and the cross validation are as in the previous simulation study except that $\boldsymbol{q}=\{0,\ldots,8\}$ . In $\boldsymbol{\alpha}$ , 40% of the components are zeros and 20% are of absolute value less than 0.15. The remaining components have absolute value greater than 0.7.

Table 5 shows the distribution of the absolute distance $|q_{\text{eff}}-4|$ between $q_{\text{eff}}$ and the true $q=4$ . Considering first the Min rule, with LSE, $q_{\text{eff}}$ concurs with the true $q$ in 19% of the simulations and differs at most by 2 from the true $q$ in 58% of the simulations. The corresponding percentages are 14% and 65 % for LASSO, and 6% and 41 % for LASSO with $q=8$ fixed. In this situation, the 1-SE rule seems advantageous for selecting $q$ . For example, the percentage of $q_{\text{eff}}$ ’s which differ from the true $q$ by at most 2 improves from 58% to 83 % for LSE, from 65% to 80 % for LASSO, and from 41% to 68 % for LASSO with fixed $q=8$ .

Table 6 details the RMSE results. The superiority of the 1-SE rule when selecting $q$ does not translate into better results in terms of RMSE except for LASSO with fixed $q=8$ where better results are obtained with 1-SE than with Min. The best results are obtained with LASSO using the Min rule for selecting $q$ and $\lambda$ . This indicates that regularization is indeed helpful in complex settings with relatively large $p$ .

Based on the simulation studies, for analyzing highly multivariate point pattern data, we recommend to use regularization with the Min rule for selecting $q$ and $\lambda$ .

5 Application

In a 50-hectare $1,000\;\mathrm{m}\times 500\;\mathrm{m}$ region of the tropical moist forest of Barro Colorado Island (BCI) in central Panama, censuses have been carried out where all free-standing woody stems with at least 10 mm diameter at breast height were identified, tagged, and mapped, resulting in maps of over 350,000 individual trees with around 300 species (see e.g. Hubbell and Foster, 1983; Condit et al., 1996; Condit, 1998). In addition, 13 spatial covariates are also available containing topological attributes and soil nutrients (see Figure 5). Our main objective is to study the impact of regularization and the computational feasibility of our method. We first consider 9 tree species, Psychotria, Protium t., Capparis, Protium p., Swartzia, Hirtella, Tetragastris, Garcinia, Mourmiri, with intermediate abundances ranging from 2500 to 7500 and previously analyzed by Waagepetersen et al. (2016). The plots of locations of each species are shown in Figure 6. The main aim of this analysis is to compare the results with our new algorithm to those obtained by Waagepetersen et al. (2016). Secondly, to test our algorithm in a more challenging situation, we analyze a highly multivariate point pattern involving species of trees with at least 400 individuals, resulting in 86 species.

For each species, we use maximum composite likelihood to fit log-linear regression models involving the spatial covariates for the $\mu_{i}$ -terms in (1). We then estimate the cross pair correlation function using (3). Therefore, the variation due to observed covariates are filtered out and the non-parametric estimates of cross pair correlation function hence capture the residual correlation due to unobserved covariates, species-specific factors, and any other sources.

5.1 Application with 9 species

For each value of $\xi=0,0.5,1$ we apply $8$ -fold CV to select $q$ and $\lambda$ where $\lambda\in\boldsymbol{\lambda}=\{0,10^{-3},\ldots,5\}$ as in the simulation studies and $q\in\boldsymbol{q}=\{0,\ldots,9\}$ . The upper left plot in Figure 1 shows for each $\xi$ , $\min_{\lambda\in\boldsymbol{\lambda}}\mathrm{CV}(q,\lambda)$ as a function of $q$ . For comparison with Waagepetersen et al. (2016) we also show in this plot $\mathrm{CV}(q,0)$ against $q$ (LSE). A general pattern for ridge, elastic net and LASSO is that the cross validation scores decrease quite quickly as a function of $q$ until around $q=4$ and after that the CV scores stabilize or decrease slowly. The CV scores for ridge ( $\xi=0$ ) are consistently smaller than those for elastic net ( $\xi=0.5$ ) and LASSO ( $\xi=1$ ). Hence we select $\xi=0$ . The minimal CV score for $\xi=0$ is obtained with $q=9$ . However, in the interest of model simplicity, we choose $q=4$ and $\lambda=0.29$ since the decrease in CV score is rather minor from $q=4$ to $q=9$ .

For comparison, the minimal CV score with LASSO is obtained with $q=8$ and $\lambda=0.11$ . However, in this case, the resulting effectively selected $q_{\text{eff}}$ is three since the resulting estimate of $\boldsymbol{\alpha}$ has 5 zero columns. In case of LSE ( $\lambda=0$ ), the CV procedure chooses $q=1$ . The second-smallest CV with LSE is obtained with $q=4$ which was the value chosen in Waagepetersen et al. (2016). The difference in cross validation results for LSE compared with Waagepetersen et al. (2016) is due to our new more efficient optimization algorithm, cf. the comparison in Section 4.1.

The middle plot in Figure 1 is an image plot of the CV scores for ridge ( $\xi=0$ ) where darker color corresponds to smaller CV score. The development of the CV scores across values of $q$ for fixed $\lambda$ appears quite erratic with several local minima. In contrast, for each $q$ there appears to be a well-defined minimum for $\lambda$ . As an example, the right plot in Figure 1 shows $\mathrm{CV}(4,\lambda)$ plotted against $\log\lambda$ (where we replace the undefined $\log 0$ by $\log 5e-4$ ). The computing time required to run the CV method with $\xi=0$ is $2.4$ hours with the same processor as used in the simulation study. Approximately $16$ seconds is required to estimate the parameters for the 9-species application using ridge with $q=4$ and $\lambda=0.29$ .

The results regarding the multivariate dependence structure of the 9 species are qualitatively similar to those obtained by Waagepetersen et al. (2016). The estimated inter-species correlations $\mathrm{corr}\{Z_{i}(u),Z_{j}(u)\}$ , cf. (6), are shown in the left plot of Figure 2. Most of the pairs of species have a positive correlation. However, the correlations between Psychotria and the other species are mainly close to zero. The right plot in Figure 2 shows a hierarchical clustering of the species based on the estimated coefficient rows $\boldsymbol{\alpha}_{i\cdot}$ , where Psychotria appears to form its own cluster in agreement with the estimated inter-species correlations. This clustering may have some relation to the families of species as shown by the cluster of Protium p., Protium t. and Tetragastris which come from the same family (see Table LABEL:supp-tab:86spec in the supplementary material).

5.2 Application with 86 tree species

For the 86-species application, we apply the 8-fold CV procedure with $\xi=0,0.5,1$ and $\lambda\in\{0,10^{-3},\ldots,5\}$ as in the previous section and $q\in\{0,\ldots,10\}$ . Figure 3 is similar to Figure 1. The left plot shows that consistently smaller CV scores are obtained with elastic net ( $\xi=0.5$ ) and the smallest CV score is obtained with $q=4$ . The remaining plots are obtained with $\xi=0.5$ . The image plot of cross validation scores in the middle plot looks much smoother than in the 9 species case. The right plot shows a well defined minimum for $\lambda=1.94$ given $q=4$ .

The computing time for the CV is 7.6 hours for $\xi=0.5$ and the computing time to estimate the parameters for the chosen $q=4$ and $\lambda=1.94$ is 3.2 minutes. Out of $4\times 86$ parameters in the estimated $\boldsymbol{\alpha}$ , 13 were set to zero by the elastic net regularization. We thereby model $86\times 87/2=3741$ distinct pair and cross pair correlation functions using only $6\times 86-13+4=507$ parameters. Thus we have indeed obtained a sparse model for the given data.

The distribution of estimated PVs is shown in Table 8. Most species ( $53\%$ ) have estimated proportions of variances due to common factors less than 0.25.

Table 7 shows the distribution of estimated inter-species correlations due to common latent fields and the combination of common and species-specific fields (see Section 2.3) across 6 intervals. Most estimated correlations are positive. However, the correlations decrease a lot in absolute value when the species-specific fields are included (last row of Table 7).

Figure 4 shows a clustering of species based on estimated $\boldsymbol{\alpha}_{i\cdot}$ , $i=1,\ldots,86$ . The leaves are marked with species life form. There may be some indication that species of life form “Tree” (life form number 4) tend to cluster together. However, one should be careful with this interpretation since apparent patterns like this could be due to sampling variation.

6 Conclusion

We developed in this study a regularized estimation method for highly multivariate point patterns modeled by multivariate log Gaussian Cox processes. The procedure is numerically stable and performs well both in the considered simulations and applications. In our truly highly multivariate second application, we were able to fit a sparse model for a multivariate point pattern with 86 types of points.

An interesting application of obtained estimates is to group types of points according to their estimated dependence on common latent fields as expressed by the rows $\boldsymbol{\alpha}_{i\cdot}$ . Hence a further development could be to consider an extension of the so-called fused LASSO (Tibshirani et al., 2005) by introducing regularization for differences $\boldsymbol{\alpha}_{i\cdot}-\boldsymbol{\alpha}_{j\cdot}$ . A further possibility would be to consider a sparse group LASSO (Simon et al., 2013) to obtain estimates of $\boldsymbol{\alpha}$ with some zeros of $\alpha_{il}$ as developed in this paper and, in addition, with entire rows of zeros implying independence of corresponding types of points and all other types of points.

Acknowledgements The research by A. Choiruddin, F. Cuevas-Pacheco, and R. Waagepetersen is supported by The Danish Council for Independent Research — Natural Sciences, grant DFF – 7014-00074 ”Statistics for point processes in space and beyond”, and by the ”Centre for Stochastic Geometry and Advanced Bioimaging”, funded by grant 8721 from the Villum Foundation.

The BCI forest dynamics research project was made possible by National Science Foundation grants to Stephen P. Hubbell: DEB-0640386, DEB-0425651, DEB-0346488, DEB-0129874, DEB-00753102, DEB-9909347, DEB-9615226, DEB-9615226, DEB-9405933, DEB-9221033, DEB-9100058, DEB-8906869, DEB-8605042, DEB-8206-992, DEB-7922197, support from the Center for Tropical Forest Science, the Smithsonian Tropical Research k+1 Institute, the John D. and Catherine T. MacArthur Foundation, the Mellon Foundation, the Celera Foundation, and numerous private individuals, and through the hard work of over 100 people from 10 countries over the past two decades. The plot project is part of the Center for Tropical Forest Science, a global network of large-scale demographic tree plots.

The BCI soils data set were collected and analyzed by J. Dalling, R. John, K. Harms, R. Stallard and J. Yavitt with support from NSF DEB021104, 021115, 0212284, 0212818 and OISE 0314581, STRI and CTFS. Paolo Segre and Juan Di Trani provided assistance in the field. The covariates dem, grad, mrvbf, solar and twi were computed in SAGA GIS by Tomislav Hengl (http://spatial-analyst.net/). We thank Dr. Joseph Wright for sharing data on dispersal modes and life forms for the BCI tree species.

Appendix A Proximal Newton Method

Suppose we want to find the solution of

[TABLE]

where the function $f(\cdot)$ can be separated into two parts: the function $a(\cdot)$ which is a convex and twice continuously differentiable loss function and the function $c(\cdot)$ which is a convex but not necessarily differentiable penalty function. The proximal-Newton method is an iterative optimization algorithm that uses a quadratic approximation of the differentiable part $a(\cdot)$ :

[TABLE]

where $\boldsymbol{\theta}^{(k)}$ is the current value of $\boldsymbol{\theta}$ , $\nabla a(\cdot)$ is the first derivative of $a(\cdot)$ and $H(\cdot)$ is an approximation to the Hessian matrix $\nabla^{2}a(\cdot)$ . Letting $\tilde{\boldsymbol{\theta}}=\operatorname*{arg\,min}_{\boldsymbol{\theta}}\hat{f}(\boldsymbol{\theta})$ , the next value of $\boldsymbol{\theta}$ is obtained as

[TABLE]

for some $t>0$ . That is, $\tilde{\boldsymbol{\theta}}$ is used to construct a search direction for the $k+1$ th value of $\boldsymbol{\theta}$ . Theoretical results in Lee et al. [2014] show that $t$ can be chosen so that $f(\boldsymbol{\theta}^{(k+1)})<f(\boldsymbol{\theta}^{(k)})$ . The matrix $H(\cdot)$ can be chosen in various ways, see Lee et al. [2014] and Hastie et al. [2015] for more details.

In the following sections, we adapt the proximal Newton method to minimization of our objective function.

A.1 Quadratic approximation for updating $\boldsymbol{\alpha}_{i\cdot}$

Let us first regard (8) as a function of $\boldsymbol{\alpha}_{i\cdot}$ ,

[TABLE]

To minimize (8), we consider the proximal Newton method stated in (15). In particular, we approximate $b(\boldsymbol{\alpha}_{i\cdot})$ by a quadratic approximation around the current value $\boldsymbol{\alpha}_{i\cdot}^{(k)}$ :

[TABLE]

Here, the first derivative is

[TABLE]

while $H(\boldsymbol{\alpha}_{i\cdot}^{(k)})$ is an approximation of the second derivative,

[TABLE]

where $D(\boldsymbol{\alpha}_{i\cdot}^{(k)})=\text{Diag}(\alpha_{i1}^{(k)},\ldots,\alpha_{iq}^{(k)})$ , $X_{ii,\cdot(1:q)}$ denotes the first $q$ columns in $X_{ii}$ , and $C(\boldsymbol{\alpha}^{(k)}_{i\cdot})=4\text{Diag}\bigg{(}X_{ii,\cdot(1:q)}^{\mbox{\scriptsize\sf T}}\Big{(}Y_{ii}-X_{ii}\boldsymbol{\beta}_{ii}(\boldsymbol{\alpha}^{(k)},\boldsymbol{\sigma}^{2})\Big{)}\bigg{)}$ . Specifically,

[TABLE]

To ease the presentation and computation, we write $\hat{b}(\boldsymbol{\alpha}_{i\cdot})$ from (17) in the form of a least squares problem

[TABLE]

where

[TABLE]

Replacing $b$ in (16) with $\hat{b}$ we obtain the approximate objective function $\hat{Q}_{\lambda,i}(\boldsymbol{\alpha}_{i\cdot}|\boldsymbol{\alpha}_{i\cdot}^{(k)})$ given in (9). Since (9) is a standard regularized least squares problem, we minimize (9) using a coordinate descent algorithm to obtain $\hat{\boldsymbol{\alpha}}_{i\cdot}$ as detailed in Section B.2.

A.2 Theoretical result regarding proximal Newton update

Let $\Delta(\boldsymbol{\alpha}_{i\cdot}^{(k)})=\hat{\boldsymbol{\alpha}}_{i\cdot}-\boldsymbol{\alpha}_{i\cdot}^{(k)}$ where $\hat{\boldsymbol{\alpha}}_{i\cdot}$ is the minimizer of (9) and according to a line search strategy let

[TABLE]

for some $t>0$ . Following the proof of Proposition 2.3 in Lee et al. [2014], we can verify the following theorem.

Theorem 1

Let $H(\boldsymbol{\alpha}_{i\cdot}^{(k)})=8D(\boldsymbol{\alpha}_{i\cdot}^{(k)})X_{ii}^{\mbox{\scriptsize\sf T}}X_{ii}D(\boldsymbol{\alpha}_{i\cdot}^{(k)})$ . Then

[TABLE]

Thus, by Theorem 1, if $H(\boldsymbol{\alpha}_{i\cdot}^{(k)})$ is positive definite, we can choose $t>0$ so that $Q_{i,\lambda}(\boldsymbol{\alpha}_{i\cdot}^{(k+1)},\sigma^{2}_{i})<Q_{i,\lambda}(\boldsymbol{\alpha}_{i\cdot}^{(k)},\sigma^{2}_{i})$ . That is, the update of $\boldsymbol{\alpha}_{i\cdot}$ results in a decrease of the objective function (8).

Appendix B Algorithm

In our block descent algorithm, we minimize (7) with respect to $\boldsymbol{\sigma}^{2},\boldsymbol{\alpha},\boldsymbol{\phi}$ , and $\boldsymbol{\psi}$ in turn. For $i=1,\ldots,p$ , we first update $\sigma^{2}_{i}$ by minimizing (8) using least squares estimation followed by an update of $\boldsymbol{\alpha}_{i\cdot}$ by minimizing (9) using a coordinate descent method. We denote by $X_{ij,\cdot k}$ the $k$ th column of $X_{ij}$ for $k=1,\ldots,q$ ( $i\neq j$ ) or $k=1,\ldots,q+1$ ( $i=j$ ). We detail, respectively in Appendices B.1 and B.2, the updates of $\sigma^{2}_{i}$ and the coordinate descent updates of $\alpha_{il}$ for $l=1,\ldots,q$ . A summary of the final algorithm is given by Appendix B.3.

B.1 Update of $\sigma_{i}^{2}$

The parameter $\hat{\sigma}_{i}^{2}$ is updated using least squares methods. More precisely, the gradient of (8) with respect to $\sigma_{i}^{2}$ is

[TABLE]

By solving $\frac{\partial Q_{\lambda,i}(\boldsymbol{\alpha}_{i\cdot},\sigma_{i}^{2})}{\partial\sigma_{i}^{2}}=0$ , we obtain the update

[TABLE]

where $\max\{a,0\}$ is used to avoid negative results of the update.

B.2 Update of $\alpha_{il}$

Let $r_{ij}=Y^{*}_{ij}-\sum_{\begin{subarray}{c}k=1\\ k\neq l\end{subarray}}^{q}X^{*}_{ij,\cdot k}\alpha_{ik}$ , where $Y^{*}_{ij}$ and $X^{*}_{ij}$ are specified in (10). Then we rewrite (9) as

[TABLE]

The gradient with respect to $\alpha_{il}$ is

[TABLE]

Following the main argument by Friedman et al. [2010], the coordinate-wise update for $\alpha_{il}$ is of the form

[TABLE]

where $S(A,\lambda\xi)=\text{sign}(A)(|A|-\lambda\xi)_{+}$ .

B.3 Algorithm to update $\boldsymbol{\alpha},\boldsymbol{\sigma}^{2},\boldsymbol{\phi},\boldsymbol{\psi}$

For a given $q$ and sequence of $\lambda$ values $0\leq\lambda_{1},\ldots,\lambda_{M}$ , the overall procedure to estimate the parameters: $\boldsymbol{\alpha},\boldsymbol{\sigma}^{2},\boldsymbol{\phi},\boldsymbol{\psi}$ is described by Algorithm 1. Note that estimates obtained with $\lambda_{s-1}$ are used as initial values for the estimation with $\lambda_{s}$ , $s=2,\ldots,M$ .

Appendix C Plots and detail information of BCI data used in the analysis

Plots of 13 spatial covariates used for analysis are depicted in Figure 5. Figure 6 shows locations of the 9 selected tree species.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Baddeley et al. [2014] Adrian Baddeley, Aruna Jammalamadaka, and Gopalan Nair. Multitype point process analysis of spines on the dendrite network of a neuron. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 63(5):673–694, 2014.
2Chilès and Delfiner [1999] Jean-Paul Chilès and Pierre Delfiner. Geostatistics: modeling spatial uncertainty . Probability and Statistics. Wiley, New York, 1999.
3Choi et al. [2010] Jang Choi, Gary Oehlert, and Hui Zou. A penalized maximum likelihood approach to sparse factor analysis. Statistics and its Interface , 3(4):429–436, 2010.
4Choiruddin et al. [2018] Achmad Choiruddin, Jean-François Coeurjolly, and Frédérique Letué. Convex and non-convex regularization methods for spatial point processes intensity estimation. Electronic Journal of Statistics , 12(1):1210–1255, 2018.
5Coeurjolly et al. [2017] Jean-François Coeurjolly, Jesper Møller, and Rasmus Waagepetersen. A tutorial on Palm distributions for spatial point processes. International Statistical Review , 85(3):404–420, 2017.
6Condit [1998] R. Condit. Tropical Forest Census Plots . Springer-Verlag and R. G. Landes Company, Berlin, Germany and Georgetown, Texas, 1998.
7Condit et al. [1996] Richard Condit, Stephen P Hubbell, and Robin B Foster. Changes in tree species abundance in a neotropical forest: impact of climate change. Journal of tropical ecology , 12(2):231–256, 1996.
8Diggle et al. [2005] Peter Diggle, Pingping Zheng, and Peter Durr. Nonparametric estimation of spatial segregation in a multivariate point process: bovine tuberculosis in Cornwall, UK. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 54(3):645–658, 2005. ISSN 1467-9876. doi: 10.1111/j.1467-9876.2005.05373.x. URL http://dx.doi.org/10.1111/j.1467-9876.2005.05373.x .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Regularized estimation for highly multivariate log Gaussian Cox processes

Abstract

1 Introduction

2 Multivariate log Gaussian Cox processes

2.1 Intensity function and pair correlation function

2.2 Least squares estimation

2.3 Inference regarding multivariate dependence structure

3 Regularized least squares estimation

3.1 Update for σ2\boldsymbol{\sigma}^{2}σ2 and α\boldsymbol{\alpha}α

3.2 Update for ψ\boldsymbol{\psi}ψ and ϕ\boldsymbol{\phi}ϕ

3.3 Initialization

3.4 Strategy to determine qqq and regularization parameters λ\lambdaλ and ξ\xiξ

4 Simulation study

4.1 Comparison of methods for least squares estimation

4.2 Assessment of cross-validation and regularization methods with p=5p=5p=5

4.3 Assessment of cross-validation and regularization methods with p=10p=10p=10

5 Application

5.1 Application with 9 species

5.2 Application with 86 tree species

6 Conclusion

Appendix A Proximal Newton Method

A.1 Quadratic approximation for updating αi⋅\boldsymbol{\alpha}_{i\cdot}αi⋅​

A.2 Theoretical result regarding proximal Newton update

Theorem 1

Appendix B Algorithm

B.1 Update of σi2\sigma_{i}^{2}σi2​

B.2 Update of αil\alpha_{il}αil​

B.3 Algorithm to update α,σ2,ϕ,ψ\boldsymbol{\alpha},\boldsymbol{\sigma}^{2},\boldsymbol{\phi},\boldsymbol{\psi}α,σ2,ϕ,ψ

Appendix C Plots and detail information of BCI data used in the analysis

3.1 Update for $\boldsymbol{\sigma}^{2}$ and $\boldsymbol{\alpha}$

3.2 Update for $\boldsymbol{\psi}$ and $\boldsymbol{\phi}$

3.4 Strategy to determine $q$ and regularization parameters $\lambda$ and $\xi$

4.2 Assessment of cross-validation and regularization methods with $p=5$

4.3 Assessment of cross-validation and regularization methods with $p=10$

A.1 Quadratic approximation for updating $\boldsymbol{\alpha}_{i\cdot}$

B.1 Update of $\sigma_{i}^{2}$

B.2 Update of $\alpha_{il}$

B.3 Algorithm to update $\boldsymbol{\alpha},\boldsymbol{\sigma}^{2},\boldsymbol{\phi},\boldsymbol{\psi}$