Langevin diffusions on the torus: estimation and applications

Eduardo Garc\'ia-Portugu\'es; Michael S{\o}rensen; Kanti V. Mardia,; Thomas Hamelryck

arXiv:1705.00296·stat.ME·September 22, 2020·Stat. Comput.

Langevin diffusions on the torus: estimation and applications

Eduardo Garc\'ia-Portugu\'es, Michael S{\o}rensen, Kanti V. Mardia,, Thomas Hamelryck

PDF

1 Repo

TL;DR

This paper develops methods for estimating Langevin diffusions on the torus with applications to directional data and molecular dynamics, introducing computationally feasible likelihood approximations and demonstrating their effectiveness through simulations and real data.

Contribution

It introduces novel likelihood approximation techniques for Langevin diffusions on the torus, enabling practical estimation and application to complex directional data.

Findings

01

Approximate likelihoods perform well in simulations.

02

Methods successfully model protein backbone angles.

03

Software package sdetorus implements the proposed methods.

Abstract

We introduce stochastic models for continuous-time evolution of angles and develop their estimation. We focus on studying Langevin diffusions with stationary distributions equal to well-known distributions from directional statistics, since such diffusions can be regarded as toroidal analogues of the Ornstein-Uhlenbeck process. Their likelihood function is a product of transition densities with no analytical expression, but that can be calculated by solving the Fokker-Planck equation numerically through adequate schemes. We propose three approximate likelihoods that are computationally tractable: (i) a likelihood based on the stationary distribution; (ii) toroidal adaptations of the Euler and Shoji-Ozaki pseudo-likelihoods; (iii) a likelihood based on a specific approximation to the transition density of the wrapped normal process. A simulation study compares, in dimensions one and two,…

Tables5

Table 1. Table 1: Relative efficiencies for WN diffusion with p = 1 𝑝 1 p=1 and μ = π 2 𝜇 𝜋 2 \mu=\tfrac{\pi}{2} . Boldface highlights the highest relative efficiencies.

	$α = 0.5$ , $σ = 1$				$α = 1$ , $σ = 1$
$Δ$	E	SO	WOU	PDE	E	SO	WOU	PDE
$0.05$	0.9799	0.9392	0.9608	0.7596	0.9888	0.9229	0.9241	0.7276
$0.20$	0.9631	0.8554	0.8878	0.8319	0.9937	0.8425	0.8422	0.7852
$0.50$	0.8941	0.7444	0.9016	0.9340	0.6907	0.9904	1.0000	0.9826
$1.00$	0.5685	0.7504	0.8978	1.0000	0.5329	0.9763	0.9972	0.9969
	$α = 0.5$ , $σ = 2$				$α = 1$ , $σ = 2$
$Δ$	E	SO	WOU	PDE	E	SO	WOU	PDE
$0.05$	0.9688	0.9972	0.9700	0.9098	0.9392	0.9431	0.9205	0.8760
$0.20$	0.7586	0.9740	0.8586	0.7805	0.8040	0.8670	0.9319	0.9380
$0.50$	0.6272	0.9565	1.0000	0.8535	0.6297	0.7321	0.8368	1.0000
$1.00$	0.2784	0.6904	1.0000	0.8578	0.6090	0.8437	0.7823	0.8964

Table 2. Table 2: Relative efficiencies for WN diffusion with p = 2 𝑝 2 p=2 , 𝝁 = ( π 2 , − π 2 ) 𝝁 𝜋 2 𝜋 2 \boldsymbol{\mu}=\left(\tfrac{\pi}{2},-\tfrac{\pi}{2}\right) , α 1 = α 2 = α subscript 𝛼 1 subscript 𝛼 2 𝛼 \alpha_{1}=\alpha_{2}=\alpha , α 3 = α 2 subscript 𝛼 3 𝛼 2 \alpha_{3}=\tfrac{\alpha}{2} , and 𝚺 = σ 2 𝐈 𝚺 superscript 𝜎 2 𝐈 \boldsymbol{\Sigma}=\sigma^{2}\mathbf{I} . Boldface highlights the highest relative efficiencies.

	$α = 1$ , $σ = 1$			$α = 2$ , $σ = 1$
$Δ$	E	SO	WOU	E	SO	WOU
$0.05$	0.9765	0.9244	0.8999	0.9920	0.8452	0.8460
$0.20$	0.9985	0.8214	0.8229	0.7234	0.9978	0.9993
$0.50$	0.5679	0.9868	0.9972	0.4370	1.0000	0.9980
$1.00$	0.4296	0.9872	0.9998	0.3467	1.0000	0.9970
	$α = 1$ , $σ = 2$			$α = 2$ , $σ = 2$
$Δ$	E	SO	WOU	E	SO	WOU
$0.05$	0.9297	1.0000	0.9422	0.9635	0.8752	0.8793
$0.20$	0.8249	0.9573	0.9916	0.6017	0.7333	1.0000
$0.50$	0.6050	0.6607	1.0000	0.3797	0.6406	1.0000
$1.00$	0.5254	0.5432	1.0000	0.2690	0.4214	1.0000

Table 3. Table 3: Comparative of estimation methods for the WN process in p = 1 , 2 𝑝 1 2 p=1,2 . The number of stars ranges from one to five. The more stars, the better performance in the category. The first three columns give the behaviour of the tpd approximation when t 𝑡 t is small, intermediate, and large, respectively.

Tpd	$t \to 0$	$t \in ℝ^{+}$	$t \to \infty$	Comput.
approx.	$t \to 0$	$t \in ℝ^{+}$	$t \to \infty$	expediency
E	★★★★★	★★	★	★★★★★
SO	★★★★	★★★	★★★	★★★
WOU	★★★★	★★★★	★★★★★	★★★★
PDE	★★★	★★★★★	★★★★★	★

Table 4. Table 4: Relative efficiencies for the WC diffusion ( p = 1 𝑝 1 p=1 ) with μ = π 2 𝜇 𝜋 2 \mu=\tfrac{\pi}{2} . Boldface highlights the highest relative efficiencies.

	$α = 0.5$ , $σ = 1$			$α = 1$ , $σ = 1$
$Δ$	E	SO	PDE	E	SO	PDE
$0.05$	0.9277	1.0000	0.7682	0.5715	0.5938	0.9309
$0.20$	0.5968	0.7315	1.0000	0.3418	0.3524	1.0000
$0.50$	0.3548	0.4264	1.0000	0.2923	0.3030	1.0000
$1.00$	0.3068	0.3295	1.0000	0.2865	0.2774	1.0000
	$α = 0.5$ , $σ = 2$			$α = 1$ , $σ = 2$
$Δ$	E	SO	PDE	E	SO	PDE
$0.05$	0.9686	0.8947	0.8646	0.7325	0.6734	1.0000
$0.20$	0.8114	0.8720	1.0000	0.0213	0.1196	1.0000
$0.50$	0.1867	0.3634	1.0000	0.0258	0.0880	1.0000
$1.00$	0.1417	0.2396	1.0000	0.0441	0.0750	1.0000

Table 5. Table 5: Relative efficiencies for the mivM diffusion with p = 2 𝑝 2 p=2 , 𝐌 = ( π 2 , π 2 ; − π 2 , − π 2 ) 𝐌 𝜋 2 𝜋 2 𝜋 2 𝜋 2 \mathbf{M}=\left(\tfrac{\pi}{2},\tfrac{\pi}{2};-\tfrac{\pi}{2},-\tfrac{\pi}{2}\right) , 𝐀 = ( 3 4 , 3 4 ; 3 2 , 3 2 ) 𝐀 3 4 3 4 3 2 3 2 \mathbf{A}=\left(\tfrac{3}{4},\tfrac{3}{4};\tfrac{3}{2},\tfrac{3}{2}\right) , 𝐩 = ( q , 1 − q ) 𝐩 𝑞 1 𝑞 \mathbf{p}=(q,1-q) , and σ = 1 𝜎 1 \sigma=1 . Boldface highlights the highest relative efficiencies.

	$q = 0.25$		$q = 0.50$		$q = 0.75$
$Δ$	E	SO	E	SO	E	SO
$0.05$	0.9282	0.9595	0.9851	0.9620	0.9716	0.9527
$0.20$	0.8678	0.9901	0.8999	0.9616	0.9517	0.9296
$0.50$	0.8312	0.9825	0.8223	0.9454	0.9448	0.9640
$1.00$	0.8867	0.9984	0.8625	0.9742	0.7525	0.9661

Equations122

d Θ_{t} = α sin (μ - Θ_{t}) d t + σ d W_{t},

d Θ_{t} = α sin (μ - Θ_{t}) d t + σ d W_{t},

f_{vM} (θ; μ, κ) := \frac{e ^{κ c o s (θ - μ)}}{2 π I _{0} ( κ )}, θ, μ \in [- π, π), κ \geq 0,

f_{vM} (θ; μ, κ) := \frac{e ^{κ c o s (θ - μ)}}{2 π I _{0} ( κ )}, θ, μ \in [- π, π), κ \geq 0,

d X_{t} = b (X_{t}) d t + σ (X_{t}) d W_{t},

d X_{t} = b (X_{t}) d t + σ (X_{t}) d W_{t},

\frac{\partial}{\partial t} p_{t} (x ∣ x_{s}) = - i = 1 \sum p \frac{\partial}{\partial x _{i}} (b_{i} (x) p_{t} (x ∣ x_{s})) + \frac{1}{2} i, j = 1 \sum p \frac{\partial ^{2}}{\partial x _{i} \partial x _{j}} (V_{ij} (x) p_{t} (x ∣ x_{s})),

\frac{\partial}{\partial t} p_{t} (x ∣ x_{s}) = - i = 1 \sum p \frac{\partial}{\partial x _{i}} (b_{i} (x) p_{t} (x ∣ x_{s})) + \frac{1}{2} i, j = 1 \sum p \frac{\partial ^{2}}{\partial x _{i} \partial x _{j}} (V_{ij} (x) p_{t} (x ∣ x_{s})),

b (x + 2 k π) = b (x), σ (x + 2 k π) = σ (x), \forall k \in Z^{p}, \forall x \in R^{p} .

b (x + 2 k π) = b (x), σ (x + 2 k π) = σ (x), \forall k \in Z^{p}, \forall x \in R^{p} .

p_{t}^{W} (\cdot ∣ θ_{s})

p_{t}^{W} (\cdot ∣ θ_{s})

P {Θ_{t_{2}} \in B ∣

P {Θ_{t_{2}} \in B ∣

=

b_{i} (x) = \frac{1}{2} j = 1 \sum p V_{ij} (x) \frac{\partial}{\partial x _{j}} lo g f (x) + det V (x)^{\frac{1}{2}} j = 1 \sum p \frac{\partial}{\partial x _{j}} (V_{ij} (x) det V (x)^{- \frac{1}{2}}),

b_{i} (x) = \frac{1}{2} j = 1 \sum p V_{ij} (x) \frac{\partial}{\partial x _{j}} lo g f (x) + det V (x)^{\frac{1}{2}} j = 1 \sum p \frac{\partial}{\partial x _{j}} (V_{ij} (x) det V (x)^{- \frac{1}{2}}),

d Θ_{t} = \frac{1}{2} Σ \nabla lo g f (Θ_{t}) d t + Σ^{\frac{1}{2}} d W_{t} .

d Θ_{t} = \frac{1}{2} Σ \nabla lo g f (Θ_{t}) d t + Σ^{\frac{1}{2}} d W_{t} .

\displaystyle f_{\mathrm{MvM}}(\boldsymbol{\theta};\boldsymbol{\mu},\boldsymbol{\kappa},\boldsymbol{\Lambda}):=T(\boldsymbol{\kappa},\boldsymbol{\Lambda})^{-1}\exp\bigg{\{}\boldsymbol{\kappa}^{\prime}\cos(\boldsymbol{\theta}-\boldsymbol{\mu})+\frac{1}{2}\sin(\boldsymbol{\theta}-\boldsymbol{\mu})^{\prime}\boldsymbol{\Lambda}\sin(\boldsymbol{\theta}-\boldsymbol{\mu})\bigg{\}},

\displaystyle f_{\mathrm{MvM}}(\boldsymbol{\theta};\boldsymbol{\mu},\boldsymbol{\kappa},\boldsymbol{\Lambda}):=T(\boldsymbol{\kappa},\boldsymbol{\Lambda})^{-1}\exp\bigg{\{}\boldsymbol{\kappa}^{\prime}\cos(\boldsymbol{\theta}-\boldsymbol{\mu})+\frac{1}{2}\sin(\boldsymbol{\theta}-\boldsymbol{\mu})^{\prime}\boldsymbol{\Lambda}\sin(\boldsymbol{\theta}-\boldsymbol{\mu})\bigg{\}},

\displaystyle\mathrm{d}\boldsymbol{\Theta}_{t}=\big{[}\boldsymbol{\alpha}\circ\sin(\boldsymbol{\mu}-\boldsymbol{\Theta}_{t})-(\mathbf{A}^{*}\sin(\boldsymbol{\mu}-\boldsymbol{\Theta}_{t}))\circ\cos(\boldsymbol{\mu}-\boldsymbol{\Theta}_{t})\big{]}\mathrm{d}t+\sigma\mathrm{d}\mathbf{W}_{t},

\displaystyle\mathrm{d}\boldsymbol{\Theta}_{t}=\big{[}\boldsymbol{\alpha}\circ\sin(\boldsymbol{\mu}-\boldsymbol{\Theta}_{t})-(\mathbf{A}^{*}\sin(\boldsymbol{\mu}-\boldsymbol{\Theta}_{t}))\circ\cos(\boldsymbol{\mu}-\boldsymbol{\Theta}_{t})\big{]}\mathrm{d}t+\sigma\mathrm{d}\mathbf{W}_{t},

\displaystyle\mathrm{d}\Theta_{t}=\bigg{[}\alpha\sum_{k\in\mathbb{Z}}(\mu-\Theta_{t}-2k\pi)w_{k}(\Theta_{t})\bigg{]}\mathrm{d}t+\sigma\mathrm{d}W_{t},\quad w_{k}(\theta)=\frac{\phi_{\sigma/\sqrt{2\alpha}}\left(\theta-\mu+2k\pi\right)}{\sum_{m\in\mathbb{Z}}\phi_{\sigma/\sqrt{2\alpha}}\left(\theta-\mu+2m\pi\right)}.

\displaystyle\mathrm{d}\Theta_{t}=\bigg{[}\alpha\sum_{k\in\mathbb{Z}}(\mu-\Theta_{t}-2k\pi)w_{k}(\Theta_{t})\bigg{]}\mathrm{d}t+\sigma\mathrm{d}W_{t},\quad w_{k}(\theta)=\frac{\phi_{\sigma/\sqrt{2\alpha}}\left(\theta-\mu+2k\pi\right)}{\sum_{m\in\mathbb{Z}}\phi_{\sigma/\sqrt{2\alpha}}\left(\theta-\mu+2m\pi\right)}.

\displaystyle 0\leq a_{1}(\alpha,\sigma):=\frac{8\pi^{2}\alpha^{2}}{\sigma^{2}}\sum_{k\in\mathbb{Z}}k^{2}w_{k}(\mu)\leq\alpha,\quad 0\leq a_{2}(\alpha,\sigma):=-\alpha+\frac{2\pi^{2}\alpha^{2}}{\sigma^{2}}\bigg{[}4\sum_{k\in\mathbb{Z}}k^{2}w_{k}(\mu+\pi)-1\bigg{]}.

\displaystyle 0\leq a_{1}(\alpha,\sigma):=\frac{8\pi^{2}\alpha^{2}}{\sigma^{2}}\sum_{k\in\mathbb{Z}}k^{2}w_{k}(\mu)\leq\alpha,\quad 0\leq a_{2}(\alpha,\sigma):=-\alpha+\frac{2\pi^{2}\alpha^{2}}{\sigma^{2}}\bigg{[}4\sum_{k\in\mathbb{Z}}k^{2}w_{k}(\mu+\pi)-1\bigg{]}.

d Θ_{t} =

d Θ_{t} =

w_{k} (θ) =

d Θ_{t} = \frac{sinh ( 2 α ψ ) sin ( μ - Θ _{t} )}{ψ ( cosh ( 2 α ψ ) + sinh ( 2 α ψ ) cos ( μ - Θ _{t} ))} d t + σ d W_{t} .

d Θ_{t} = \frac{sinh ( 2 α ψ ) sin ( μ - Θ _{t} )}{ψ ( cosh ( 2 α ψ ) + sinh ( 2 α ψ ) cos ( μ - Θ _{t} ))} d t + σ d W_{t} .

\displaystyle\mathrm{d}\boldsymbol{\Theta}_{t}=\bigg{[}\sum_{j=1}^{m}\boldsymbol{\alpha}_{j}\circ\sin(\boldsymbol{\mu}_{j}-\boldsymbol{\Theta}_{t})v_{j}(\boldsymbol{\Theta}_{t})\bigg{]}\mathrm{d}t+\sigma\mathrm{d}\mathbf{W}_{t},\quad v_{j}(\boldsymbol{\theta})=\frac{p_{j}f_{\mathrm{MvM}}\big{(}\boldsymbol{\theta};\boldsymbol{\mu}_{j},\frac{2}{\sigma^{2}}\boldsymbol{\alpha}_{j},\mathbf{0}\big{)}}{\sum_{l=1}^{m}p_{l}f_{\mathrm{MvM}}\big{(}\boldsymbol{\theta};\boldsymbol{\mu}_{l},\frac{2}{\sigma^{2}}\boldsymbol{\alpha}_{l},\mathbf{0}\big{)}},

\displaystyle\mathrm{d}\boldsymbol{\Theta}_{t}=\bigg{[}\sum_{j=1}^{m}\boldsymbol{\alpha}_{j}\circ\sin(\boldsymbol{\mu}_{j}-\boldsymbol{\Theta}_{t})v_{j}(\boldsymbol{\Theta}_{t})\bigg{]}\mathrm{d}t+\sigma\mathrm{d}\mathbf{W}_{t},\quad v_{j}(\boldsymbol{\theta})=\frac{p_{j}f_{\mathrm{MvM}}\big{(}\boldsymbol{\theta};\boldsymbol{\mu}_{j},\frac{2}{\sigma^{2}}\boldsymbol{\alpha}_{j},\mathbf{0}\big{)}}{\sum_{l=1}^{m}p_{l}f_{\mathrm{MvM}}\big{(}\boldsymbol{\theta};\boldsymbol{\mu}_{l},\frac{2}{\sigma^{2}}\boldsymbol{\alpha}_{l},\mathbf{0}\big{)}},

d Θ_{t} = b (Θ_{t}; λ) d t + σ (Θ_{t}; λ) d W_{t},

d Θ_{t} = b (Θ_{t}; λ) d t + σ (Θ_{t}; λ) d W_{t},

\hat{λ}_{MLE} := ar g λ \in Λ max l (λ; {Θ_{Δ i}}_{i = 0}^{N}),

\hat{λ}_{MLE} := ar g λ \in Λ max l (λ; {Θ_{Δ i}}_{i = 0}^{N}),

l (λ; {Θ_{Δ i}}_{i = 0}^{N}) = lo g p (Θ_{0}; λ) + i = 1 \sum N lo g p_{Δ} (Θ_{Δ i} ∣ Θ_{Δ (i - 1)}; λ) .

l (λ; {Θ_{Δ i}}_{i = 0}^{N}) = lo g p (Θ_{0}; λ) + i = 1 \sum N lo g p_{Δ} (Θ_{Δ i} ∣ Θ_{Δ (i - 1)}; λ) .

\hat{λ}_{SMLE}^{ν} := ar g λ^{ν} \in Λ^{ν} max i = 0 \sum N lo g ν (Θ_{i}; λ^{ν}) .

\hat{λ}_{SMLE}^{ν} := ar g λ^{ν} \in Λ^{ν} max i = 0 \sum N lo g ν (Θ_{i}; λ^{ν}) .

p_{Δ} (Θ_{Δ i} ∣ Θ_{Δ (i - 1)}) \approx f_{WN} (Θ_{Δ i}; Θ_{Δ (i - 1)}, Δ Σ) \approx ϕ_{Δ Σ} (cmod (Θ_{Δ i} - Θ_{Δ (i - 1)})) .

p_{Δ} (Θ_{Δ i} ∣ Θ_{Δ (i - 1)}) \approx f_{WN} (Θ_{Δ i}; Θ_{Δ (i - 1)}, Δ Σ) \approx ϕ_{Δ Σ} (cmod (Θ_{Δ i} - Θ_{Δ (i - 1)})) .

\hat{Σ}_{HF} := \frac{1}{N Δ} i = 1 \sum N cmod (Θ_{Δ i} - Θ_{Δ (i - 1)}) cmod (Θ_{Δ i} - Θ_{Δ (i - 1)})^{'} .

\hat{Σ}_{HF} := \frac{1}{N Δ} i = 1 \sum N cmod (Θ_{Δ i} - Θ_{Δ (i - 1)}) cmod (Θ_{Δ i} - Θ_{Δ (i - 1)})^{'} .

\displaystyle\boldsymbol{\Theta}_{\Delta i}=\mathrm{cmod}\big{(}\boldsymbol{\Theta}_{\Delta(i-1)}+b(\boldsymbol{\Theta}_{\Delta(i-1)})\Delta+\sqrt{\Delta}\sigma(\boldsymbol{\Theta}_{\Delta(i-1)})\mathbf{Z}^{i}\big{)},

\displaystyle\boldsymbol{\Theta}_{\Delta i}=\mathrm{cmod}\big{(}\boldsymbol{\Theta}_{\Delta(i-1)}+b(\boldsymbol{\Theta}_{\Delta(i-1)})\Delta+\sqrt{\Delta}\sigma(\boldsymbol{\Theta}_{\Delta(i-1)})\mathbf{Z}^{i}\big{)},

p_{Δ}^{E} (θ ∣ φ) := f_{WN} (θ; φ + b (φ) Δ, V (φ) Δ), θ, φ \in T^{p} .

p_{Δ}^{E} (θ ∣ φ) := f_{WN} (θ; φ + b (φ) Δ, V (φ) Δ), θ, φ \in T^{p} .

d X_{t} = (b (X_{s}) + J_{s} (X_{t} - X_{s})) d t + σ_{s} d W_{t}, t \in [s, s + Δ) .

d X_{t} = (b (X_{s}) + J_{s} (X_{t} - X_{s})) d t + σ_{s} d W_{t}, t \in [s, s + Δ) .

vec (Γ_{t}) = (I \otimes J_{s} + J_{s} \otimes I)^{- 1} v_{t}, v_{t} := exp {J_{s} (t - s)} V_{s} exp {J_{s}^{'} (t - s)} - V_{s} .

vec (Γ_{t}) = (I \otimes J_{s} + J_{s} \otimes I)^{- 1} v_{t}, v_{t} := exp {J_{s} (t - s)} V_{s} exp {J_{s}^{'} (t - s)} - V_{s} .

Γ_{t} =

Γ_{t} =

=

p_{Δ}^{SO} (θ ∣ φ) := f_{WN} (θ; E_{Δ} (φ), V_{Δ} (φ)), θ, φ \in T^{p},

p_{Δ}^{SO} (θ ∣ φ) := f_{WN} (θ; E_{Δ} (φ), V_{Δ} (φ)), θ, φ \in T^{p},

E_{Δ} (φ) =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

egarpor/sdetorus
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11footnotetext: Department of Mathematical Sciences, University of Copenhagen (Denmark).22footnotetext: Bioinformatics Centre, Section for Computational and RNA Biology, Department of Biology, University of Copenhagen (Denmark).33footnotetext: Department of Statistics, Carlos III University of Madrid (Spain).44footnotetext: Department of Statistics, University of Leeds (UK).55footnotetext: Department of Statistics, University of Oxford (UK).66footnotetext: Image Section, Department of Computer Science, University of Copenhagen (Denmark).77footnotetext: Corresponding author. e-mail: [email protected].

Langevin diffusions on the torus: estimation and applications

Eduardo García-Portugués1,2,3,7, Michael Sørensen1,

Kanti V. Mardia4,5, and Thomas Hamelryck2,6

Abstract

We introduce stochastic models for continuous-time evolution of angles and develop their estimation. We focus on studying Langevin diffusions with stationary distributions equal to well-known distributions from directional statistics, since such diffusions can be regarded as toroidal analogues of the Ornstein–Uhlenbeck process. Their likelihood function is a product of transition densities with no analytical expression, but that can be calculated by solving the Fokker–Planck equation numerically through adequate schemes. We propose three approximate likelihoods that are computationally tractable: (i) a likelihood based on the stationary distribution; (ii) toroidal adaptations of the Euler and Shoji–Ozaki pseudo-likelihoods; (iii) a likelihood based on a specific approximation to the transition density of the wrapped normal process. A simulation study compares, in dimensions one and two, the approximate transition densities to the exact ones, and investigates the empirical performance of the approximate likelihoods. Finally, two diffusions are used to model the evolution of the backbone angles of the protein G (PDB identifier 1GB1) during a molecular dynamics simulation. The software package sdetorus implements the estimation methods and applications presented in the paper.

Keywords: Circular data; Directional statistics; Likelihood; Protein structure; Stochastic differential equation; Wrapped normal.

1 Introduction

Useful proposals of stochastic processes must take into account the particular features of the data that they aim to model. This is so for toroidal data, where observations are elements on the torus $\mathbb{T}^{p}=[-\pi,\pi)\times\overset{p}{\cdots}\times[-\pi,\pi)$ (with $-\pi$ and $\pi$ identified). Models and inference for circular data ( $p=1$ ) are notably different from the Euclidean case; see Mardia and Jupp (2000) or Jammalamadaka and SenGupta (2001) for a comprehensive description and a review of applications. One of the first continuous-time processes on the circle was proposed by Kent (1975). It is defined as the solution to the Stochastic Differential Equation (SDE)

[TABLE]

where $\{W_{t}\}$ is a Wiener process, $\alpha>0$ is the strength of the drift towards $\mu\in[-\pi,\pi)$ , and $\sigma>0$ is the diffusion coefficient. This process, referred to below as the von Mises (vM) process, can be regarded as a circular analogue of the Ornstein–Uhlenbeck (OU) process. The process is attracted to $\mu$ and, in the neighbourhood of $\mu$ , the drift is approximately linear. Moreover, the process is ergodic (i.e., it has a unique stationary distribution) and the stationary distribution (abbreviated as sdi) is $\mathrm{vM}\big{(}\mu,\frac{2\alpha}{\sigma^{2}}\big{)}$ . $\mathrm{vM}(\mu,\kappa)$ denotes the vM distribution with probability density function (pdf)

[TABLE]

with $\mathcal{I}_{\nu}$ being the modified Bessel function of the first kind and order $\nu$ . Despite its similarities with the OU process, the vM process is not as tractable as the former: no analytical expression for its transition probability density (tpd) is known. The vM process has been applied in mathematical biology (Hill and Häder, 1997; Codling and Hill, 2005), and related extensions were studied in physics in the context of oscillators (see Section 5.3.3 in Frank (2005) and references therein).

The contributions of this paper are two-fold. Firstly, we propose ergodic diffusions on the torus whose sdis are well-established distributions from directional statistics. These diffusions can be regarded as toroidal analogues of the OU process. Specifically, we introduce several Langevin diffusions, each defined as the wrapping of a $p$ -dimensional Euclidean diffusion solving the time-homogeneous SDE

[TABLE]

where $b:\mathbb{R}^{p}\rightarrow\mathbb{R}^{p}$ is the drift, $\sigma:\mathbb{R}^{p}\rightarrow\mathbb{R}^{p\times p}$ is the diffusion coefficient, and $\mathbf{W}_{t}=(W_{t,1},\ldots,W_{t,p})^{\prime}$ is a vector of $p$ independent standard Wiener processes (′ denotes transposition). We provide insights on the wrapping of (2) and study the properties of the new diffusions. We give particular emphasis to the Langevin diffusion with Wrapped Normal (WN) sdi, since this is a toroidal OU analogue with more tractable estimation.

Secondly, we present estimation procedures for discretely observed toroidal diffusions. The likelihood function involves the evaluation of the tpd $p_{t}(\cdot\,|\,\mathbf{x}_{s})$ , the density function of the conditional distribution of $\mathbf{X}_{t+s}$ given $\mathbf{X}_{s}=\mathbf{x}_{s}$ . The tpd solves the Fokker–Planck or Kolmogorov’s Forward equation, this is, the Partial Differential Equation (PDE)

[TABLE]

with $\mathbf{x},\mathbf{x}_{s}\in\mathbb{R}^{p}$ , $V(\cdot):=\sigma(\cdot)\sigma(\cdot)^{\prime}$ and initial condition $p_{0}(\mathbf{x}\,|\,\mathbf{x}_{s})=\delta(\mathbf{x}-\mathbf{x}_{s})$ ( $\delta(\cdot)$ represents Dirac’s delta). This PDE has no explicit solution except for very few particular choices of $b$ and $V$ . We consider maximum likelihood estimation based on the numerical solution of (3). This method is computationally costly, but serves as a benchmark to which other computationally more expedient methods can be compared. A simple solution is to replace the unknown tpd by the known sdi, hence reducing the problem to maximum likelihood estimation with independent and identically distributed data, but this is usually inefficient and only allows for the estimation of the parameters appearing in the sdi. We therefore develop better approximations to the tpd that are relatively easy to compute. For general diffusions, we introduce toroidal versions of the Euler and Shoji–Ozaki pseudo-likelihoods. For the WN process, we derive a specific, sdi-correct and computationally efficient tpd approximation. We investigate the quality of these estimators by calculating the Kullback–Leibler divergences of the approximating tpds with respect to the tpd obtained by numerically solving (3). Furthermore, in a simulation study for different discretization steps we compare, in the one- and two-dimensional cases, the performance of the proposed approximate likelihoods.

Next, we describe relevant literature to our contributions. Diffusions featuring trigonometric drifts were presented in Kessler and Sørensen (1999), Larsen and Sørensen (2007) and Sørensen (2012), although these processes are not designed to capture periodicity, but rather to have a bounded interval as their state space. Wrapped Gaussian processes have been considered by Jona-Lasinio et al. (2012) in the context of spatial modelling of wave directions. In a different setting, processes where the time-inhomogeneous drift $b(t,X_{t})$ is a periodic function of time have been studied by Dehay (2015) and Dehling et al. (2010). Discrete time processes on the circle include the circular autoregressive models by Breckling (1989) and the Markov processes on the circle by Wehrly and Johnson (1979), Kato (2010) and Yeh et al. (2013). In a broader perspective, stochastic calculus on manifolds has been extensively developed, see for example Émery (1989), Stroock (2000) and Hsu (2002). For the case of the torus, a flat and compact manifold, the modelling challenges do not reside in the curvature of the manifold, but rather in capturing angular dependencies, a non-trivial and ubiquitous problem in directional statistics, consequence of the complex behaviour of rotations on the torus. Finally, we refer to Rogers and Williams (2000), Steele (2001) and Øksendal (2003) for an exhaustive introduction to SDEs, and to Kloeden and Platen (1992) and Iacus (2008) for a more applied perspective.

The rest of this paper is organized as follows. Section 2 introduces diffusions on the torus. Section 3 presents and analyses several estimation procedures for them, whilst the empirical estimation performance is assessed in a simulation study in Section 4. Section 5 gives an application to modelling the evolution of protein backbone angles. Conclusions and final comments are given in Section 6.

2 Toroidal diffusions

The state space of a stochastic process $\left\{\boldsymbol{\Theta}_{t}\right\}$ on the torus is $\mathbb{T}^{p}=[-\pi,\pi)\times\overset{p}{\cdots}\times[-\pi,\pi)$ . The space $\mathbb{R}^{p}$ also plays a relevant role, since $\left\{\boldsymbol{\Theta}_{t}\right\}$ can be regarded as a Euclidean process $\left\{\mathbf{X}_{t}\right\}$ that is wrapped into its principal angles by $\mathrm{cmod}\left(\cdot\right):=((\cdot+\pi)\mod 2\pi)-\pi$ . This approach eases the interpretation of crossings through boundaries and motivates the following definition.

Definition 1 (Toroidal diffusion).

The stochastic process $\{\boldsymbol{\Theta}_{t}\}\subset\mathbb{T}^{p}$ is said to be a toroidal diffusion if it arises as the wrapping $\boldsymbol{\Theta}_{t}=\mathrm{cmod}\left(\mathbf{X}_{t}\right)$ of a diffusion (2) such that $b$ and $\sigma$ are $2\pi$ -periodic:

[TABLE]

The toroidal diffusion coming from the wrapping of (2) is denoted as $\mathrm{d}\boldsymbol{\Theta}_{t}=b(\boldsymbol{\Theta}_{t})\mathrm{d}t+\sigma(\boldsymbol{\Theta}_{t})\mathrm{d}\mathbf{W}_{t}$ .

The periodicity of $b$ and $\sigma$ are required to make $\{\boldsymbol{\Theta}_{t}\}$ a diffusion, since $\{\boldsymbol{\Theta}_{t}\}$ can only be Markovian if $\{\mathbf{X}_{t}\}$ is non-ergodic in $\mathbb{R}^{p}$ , as the next result shows.

Proposition 1 (Wrapped ergodic diffusion).

Let $\{\mathbf{X}_{t}\}$ be an ergodic diffusion on $\mathbb{R}^{p}$ with stationary density $\nu$ and tpd $p_{t}(\cdot\,|\,\mathbf{x}_{s})$ . The following statements hold for the wrapped process $\boldsymbol{\Theta}_{t}:=\mathrm{cmod}\left(\mathbf{X}_{t}\right)$ :

i.

$\{\boldsymbol{\Theta}_{t}\}$ * is ergodic on $\mathbb{T}^{p}$ , with stationary density $\nu^{\mathrm{W}}(\cdot):=\sum_{\mathbf{k}\in\mathbb{Z}^{p}}\nu(\cdot+2\pi\mathbf{k})$ .* 2. ii.

If $\mathbf{X}_{s}$ is distributed with density $\nu$ , then the conditional density of $\boldsymbol{\Theta}_{t+s}\,|\,\boldsymbol{\Theta}_{s}=\boldsymbol{\theta}_{s}$ is

[TABLE] 3. iii.

If $\{\mathbf{X}_{t}\}$ is time-reversible, i.e., $p_{t}(\mathbf{x}\,|\,\mathbf{y})\nu(\mathbf{y})=p_{t}(\mathbf{y}\,|\,\mathbf{x})\allowbreak\nu(\mathbf{x})$ , $\forall\mathbf{x},\mathbf{y}\in\mathbb{R}^{p}$ , then $p^{\mathrm{W}}_{t}(\boldsymbol{\theta}\,|\,\boldsymbol{\varphi})\nu^{\mathrm{W}}(\boldsymbol{\varphi})=p_{t}^{\mathrm{W}}(\boldsymbol{\varphi}\,|\,\boldsymbol{\theta})\allowbreak\nu^{\mathrm{W}}(\boldsymbol{\theta})$ , $\forall\boldsymbol{\theta},\boldsymbol{\varphi}\in\mathbb{T}^{p}$ . 4. iv.

The wrapped process is not Markovian.

Proof.

The statements can be easily checked, so we only illustrate the non-Markovianity. Recall that for $t_{0}<t_{1}<t_{2}$ and using the Markovianity of $\left\{\mathbf{X}_{t}\right\}$ ,

[TABLE]

This clearly depends on $\boldsymbol{\theta}_{t_{0}}$ unless $p_{t}$ is periodic on both arguments, impossible for a density in $\mathbb{R}^{p}$ . ∎

Thus, a wrapped ergodic diffusion is not a diffusion. In particular, the family of conditional distributions given by (4) does not define a semi-group of transition operators. The non-Markovianity arises because $\boldsymbol{\Theta}_{t_{2}}\,|\,(\boldsymbol{\Theta}_{t_{1}},\,\allowbreak\boldsymbol{\Theta}_{t_{0}})$ , with $t_{2}>t_{1}>t_{0}$ , does not depend only on $\boldsymbol{\Theta}_{t_{1}}$ but also on the winding number $\mathrm{wind}(\mathbf{X}_{t_{1}})\allowbreak:=\lfloor\frac{\mathbf{X}_{t_{1}}+\pi}{2\pi}\rfloor\in\mathbb{Z}^{p}$ of $\mathbf{X}_{t_{1}}=\boldsymbol{\Theta}_{t_{1}}+2\mathrm{wind}(\mathbf{X}_{t_{1}})\pi$ , hence the requirement for periodic $b$ and $\sigma$ in Definition 1.

Remark 1.

The density of $\boldsymbol{\Theta}_{t+s}\,|\,\boldsymbol{\Theta}_{s}=\boldsymbol{\theta}_{s}$ is remarkably different from the density of $\boldsymbol{\Theta}_{t+s}\,|\,\mathbf{X}_{s}=\boldsymbol{\theta}_{s}$ , given by $\sum_{\mathbf{k}\in\mathbb{Z}^{p}}p_{t}(\cdot+2\mathbf{k}\pi\,|\,\mathbf{X}_{s}=\boldsymbol{\theta}_{s})$ . To make this point clearer, let $\left\{X_{t}\right\}$ be the OU process $\mathrm{d}X_{t}=\alpha(\mu-X_{t})\mathrm{d}t+\sigma\mathrm{d}W_{t}$ , $\left\{\Theta_{t}\right\}$ its wrapped version with sdi $\sum_{k\in\mathbb{Z}}\phi_{\sigma/\sqrt{2\alpha}}(\theta\allowbreak-\mu+2k\pi)$ ( $\phi_{\sigma}$ is the pdf of a $\mathcal{N}(0,\sigma^{2})$ ), and assume $X_{s}\sim\mathcal{N}\big{(}\mu,\frac{\sigma^{2}}{2\alpha}\big{)}$ . Then:

i.

The density of $X_{t+s}\,|\,X_{s}=\theta_{s}$ is $\phi_{\sigma_{t}}(\cdot-\mu_{t})$ , with $\mu_{t}:=\mu+(\theta_{s}-\mu)e^{-\alpha t}$ and $\sigma_{t}^{2}:=\frac{\sigma^{2}}{2\alpha}(1-e^{-2\alpha t})$ . This is the usual tpd of the OU process. 2. ii.

The density of $X_{t+s}\,|\,\Theta_{s}=\theta_{s}$ is $\sum_{m\in\mathbb{Z}}\phi_{\sigma_{t}}(\cdot-\mu_{t}^{m})\allowbreak w_{m}(\theta_{s})$ , where $\mu^{m}_{t}:=\mu+(\theta_{s}+2\pi m-\mu)e^{-\alpha t}$ and $w_{m}(\theta_{s})=\frac{\phi_{\sigma/\sqrt{2\alpha}}(\theta_{s}-\mu+2\pi m)}{\sum_{k\in\mathbb{Z}}\phi_{\sigma/\sqrt{2\alpha}}(\theta_{s}-\mu+2k\pi)}$ . 3. iii.

The density of $\Theta_{t+s}\,|\,\Theta_{s}=\theta_{s}$ is $p_{t}^{\mathrm{W}}(\cdot\,|\,\Theta_{s}=\theta_{s})=\sum_{k,m\in\mathbb{Z}}\phi_{\sigma_{t}}\left(\cdot-\mu^{m}_{t}+2k\pi\right)w_{m}(\theta_{s})$ . This circular density can exhibit two modes describing the drift of $\{\Theta_{t}\}$ towards $\mu$ whenever $\theta_{s}$ and $\mu$ are antipodal. 4. iv.

The density of $\Theta_{t+s}\,|\,X_{s}=\theta_{s}$ is $\sum_{k\in\mathbb{Z}}\phi_{\sigma_{t}}(\cdot-\mu_{t}+\allowbreak 2k\pi)$ , which is unimodal. Whenever the circular shortest distance between $\theta_{s}$ and $\mu$ happens across the boundary, this circular density pushes the probability mass in the opposite direction.

Remark 2.

Liu (2013)** stated a similar density to iv above, with $2k\pi e^{-t}$ instead of $2\pi k$ , as the “tpd function of the OU process on the circle” and proved it satisfied the Chapman–Kolmogorov equation111Note that $-(x_{2}-x_{1}e^{-(t_{2}-t_{1})}+2k\pi e^{-t_{2}})^{2}$ should be in the exponential’s denominator of Liu (2013)’s (15) and (16).. That density is not circular (it has a time-shrinking period $2k\pi e^{-t}$ ).

The rest of the section is devoted to the introduction and analysis of notable toroidal diffusions.

2.1 Langevin toroidal diffusions

Let $f$ be a pdf over $\mathbb{R}^{p}$ . The so-called Langevin diffusions are a family of multivariate diffusions of the form (2), where the entries of $b$ are given by

[TABLE]

with $i=1,\ldots,p$ . The most important property of these diffusions is that, under mild regularity conditions on $f$ and $\sigma$ , they are ergodic with stationary density $f$ . This is particularly convenient since one of the first steps in modelling a given trajectory is to compare its empirical distribution with the sdi of the candidate diffusion model. Remarkably, the family of Langevin diffusions characterizes the family of ergodic diffusions with a given sdi that are time-reversible. The result is due to Kolmogoroff (1937) and was later extended by Kent (1978) using symmetric diffusions on manifolds (see Theorems 4.2 and 6.1 ibid). In particular, the OU process is identified as the unique time-reversible diffusion with Gaussian sdi and constant diffusion coefficient. This characterization is key for constructing analogues of the OU process in $\mathbb{T}^{p}$ by means of Langevin diffusions driven by Gaussian-like toroidal distributions.

The construction of Langevin toroidal diffusions is achieved by wrappings of Langevin diffusions, where now $f$ is a toroidal density, that is, $\int_{\mathbb{T}^{p}}f(\boldsymbol{\theta})\mathrm{d}\boldsymbol{\theta}=1$ and $f(\boldsymbol{\theta}+2\mathbf{k}\pi)=f(\boldsymbol{\theta})$ , $\forall\boldsymbol{\theta}\in\mathbb{T}^{p},\,\mathbf{k}\in\mathbb{Z}^{p}$ .

Proposition 2.

Assume $\{\boldsymbol{\Theta}_{t}\}$ is obtained from the wrapping of a Langevin diffusion $\{\mathbf{X}_{t}\}$ with drift (5), given by a strictly positive toroidal density $f$ . Assume that the second derivatives of both $f$ and the entries of $V$ are Hölder continuous, and that $V$ is $2\pi$ -periodic. Then, for the given $V$ , $\{\boldsymbol{\Theta}_{t}\}$ is the unique toroidal time-reversible diffusion that is ergodic with stationary density $f$ and squared diffusion coefficient $V$ .

Proof.

We provide a sketch. The time-reversibility with equilibrium density $f$ follows from Theorem 10.1 in Kent (1978) using the compactness (makes $\{\boldsymbol{\Theta}_{t}\}$ conservative), flatness, and global coordinates of $\mathbb{T}^{p}$ . The equilibrium distribution is also the (unique) sdi, so $\{\boldsymbol{\Theta}_{t}\}$ is ergodic. To show the uniqueness, note that by Theorem 6.1 ibid a time-reversible diffusion must have an equilibrium density $u$ and be $u$ -symmetric, where necessarily $u=f$ . By Theorem 4.2 ibid (and its proof) the only way a diffusion with a given $V$ can be $f$ -symmetric is if its drift is (5). ∎

As a consequence, any time-reversible toroidal diffusion with stationary density $f$ and $V(\mathbf{x})=\boldsymbol{\Sigma}$ is of the form

[TABLE]

The rest of the paper focuses on diffusions of this form.

2.2 Analogues of the Ornstein–Uhlenbeck process

The vM process can be considered as the circular analogue of the OU process (Kent, 1975). Two arguments support this claim: (i) the vM process is the unique time-reversible diffusion with vM sdi and constant diffusion coefficient; (ii) the vM distribution is usually regarded as the Gaussian circular analogue due to important Gaussian-like characterizations (Jammalamadaka and SenGupta, 2001, Section 2.2.4). However, it is worth to note that a similar argument to (ii) holds for the WN: this distribution exhibits certain similarities with the Gaussian (ibid, Section 2.2.6) and, contrary to the vM distribution, it appears in Gaussian-related limit laws such as the wrapped version of the central limit theorem (Mardia, 1972, Section 4.3.2).

In this section we investigate the main properties of the Langevin diffusions driven by the multivariate versions of the vM and WN distributions. In addition, we consider two appealing extensions driven by more flexible sdis: the symmetric circular distribution of Jones and Pewsey (2005) and mixtures of (independent) vM distributions. These processes are later employed in Section 4.

2.2.1 Multivariate von Mises

The multivariate extension of the vM distribution is not immediate: several competing alternatives are described in the literature, see Mardia and Frellsen (2012) for a review focused on the bivariate case. Among the available proposals, we chose the Multivariate von Mises (MvM) with sine interaction (Mardia et al., 2008) due to its pleasant modelling properties: simple unimodal characterization, ability of capturing positive/negative dependence within the same density formulation, and vM conditional distributions. The $\mathrm{MvM}(\boldsymbol{\mu},\boldsymbol{\kappa},\boldsymbol{\Lambda})$ pdf is

[TABLE]

where the trigonometric functions are understood as entry-wise operators, $\boldsymbol{\kappa}\geq 0$ , $\boldsymbol{\Lambda}$ is a symmetric matrix with zero diagonal, and $T(\boldsymbol{\kappa},\boldsymbol{\Lambda})$ is the normalizing constant. If $\boldsymbol{\Lambda}=\mathbf{0}$ , then the MvM distribution is the product of independent vM, and hence $T(\boldsymbol{\kappa},\mathbf{0})=(2\pi)^{p}\prod_{j=1}^{p}\mathcal{I}_{0}(\kappa_{j})$ . A sufficient condition for unimodality is that $\mathbf{P}:=\mathrm{diag}\left(\boldsymbol{\kappa}\right)-\boldsymbol{\Lambda}$ is positive definite (Mardia and Voss, 2014), a result related to the fact that, for large concentrations $\boldsymbol{\kappa}$ , $\mathrm{MvM}(\boldsymbol{\mu},\boldsymbol{\kappa},\boldsymbol{\Lambda})\approx\mathcal{N}_{p}(\boldsymbol{\mu},\mathbf{P}^{-1})$ . The operator $\mathrm{diag}(\cdot)$ denotes either the diagonal extraction or the diagonal matrix construction, depending on its argument.

The non-linear dependence structure of the MvM distribution forces $\boldsymbol{\Sigma}$ in the associated Langevin diffusion to be isotropic (i.e., $\boldsymbol{\Sigma}=\sigma^{2}\mathbf{I}$ ) if separability between the drift and diffusion coefficients is desired. We opted to preserve separability and to generalize (1) by having a $\mathrm{MvM}\big{(}\boldsymbol{\mu},\frac{2\boldsymbol{\alpha}}{\sigma^{2}},\frac{2\mathbf{A}^{*}}{\sigma^{2}}\big{)}$ sdi:

[TABLE]

where $\circ$ denotes the element-wise product of matrices, $\boldsymbol{\alpha}:=\mathrm{diag}\left(\mathbf{A}\right)$ , $\mathbf{A}^{*}:=\mathrm{diag}\left(\boldsymbol{\alpha}\right)-\mathbf{A}$ , and $\mathbf{A}$ is a positive definite matrix. The equilibrium points of drift are located at $\boldsymbol{\mu}+\mathbf{k}_{0}\pi$ , with $\mathbf{k}_{0}\in\left\{-1,0,1\right\}^{p}$ (we assume implicit wrapping by $\mathrm{cmod}$ in the sums of angles in this section), and are unstable if any component is antipodal, this is, unless $\mathbf{k}_{0}=\mathbf{0}$ (see Figures 1 and 2). The drift is approximately linear in a neighbourhood of $\boldsymbol{\mu}$ and has Jacobian $-\mathbf{A}$ . For the unstable points, the drift has Jacobian $-\mathbf{A}\circ(\mathbf{s}\mathbf{s}^{\prime})$ , with $\mathbf{s}=\cos(\mathbf{k}_{0}\pi)$ a vector of signs. In the circular case, the maximal drifts (in absolute value) towards $\mu$ are placed at $\mu\pm\frac{\pi}{2}$ (see Figure 1). For the general case, the maximal marginal drifts for the $j$ -th component happen at $\mu_{j}-\tan^{-1}\big{(}A_{jj}\big{[}\sum_{k\neq j}A_{jk}\sin(\mu_{k}-\theta_{k})\big{]}^{-1}\big{)}+k_{0}\pi$ , $k_{0}\in\left\{-1,0,1\right\}$ .

2.2.2 Wrapped normal

The pdf of a (multivariate) wrapped normal, $\mathrm{WN}\left(\boldsymbol{\mu},\boldsymbol{\Sigma}\right)$ , is given by $f_{\mathrm{WN}}(\boldsymbol{\theta};\boldsymbol{\mu},\boldsymbol{\Sigma}):=\sum_{\mathbf{k}\in\mathbb{Z}^{p}}\phi_{\boldsymbol{\Sigma}}(\boldsymbol{\theta}-\boldsymbol{\mu}+2\mathbf{k}\pi)$ , with $\boldsymbol{\mu}\in\mathbb{T}^{p}$ , $\boldsymbol{\Sigma}$ a covariance matrix, and $\phi_{\boldsymbol{\Sigma}}$ the pdf of a $\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma})$ . For the sake of clarity of exposition, we first introduce the circular case and then the multivariate extension. Using the OU parametrization, the circular WN process with $\mathrm{WN}\big{(}\mu,\frac{\sigma^{2}}{2\alpha}\big{)}$ sdi is defined as

[TABLE]

Despite the similar shape of the vM and WN densities in the main bulk of the probability, their behaviour is substantially different at antipodality, a fact strengthened in log scale. The WN process drift is a smoothed “sawtooth wave” that has negative slope at $\mu$ and crosses the $x$ -axis at $\mu+k\pi$ , $k\in\{-1,0,1\}$ . Hence, the drift behaves almost linearly in a neighbourhood of $\mu$ (equilibrium point, stable) and rapidly decays to pass across $\mu\pm\pi$ (equilibrium point, unstable). This neighbourhood is larger than for the vM process. There is no separability between $\alpha$ and $\sigma$ and both alter the drift non-trivially. For example, the drift maxima are implicitly given by $\sum_{k\in\mathbb{Z}}k^{2}w_{k}(\theta)-\big{[}\sum_{k\in\mathbb{Z}}kw_{k}(\theta)\big{]}^{2}=\frac{\sigma^{2}}{8\alpha\pi^{2}}$ , and vary from $\mu\pm\pi$ (if $\frac{\sigma^{2}}{2\alpha}\to 0$ , the sdi is degenerate at $\mu$ ) to $\mu\pm\frac{\pi}{2}$ (if $\frac{\sigma^{2}}{2\alpha}\to\infty$ , the sdi is uniform and the drift is null). Thereby, the maximum drifts always happen closer to antipodality than in the vM process (see Figure 1). The slopes of the drift at $\mu$ and $\mu\pm\pi$ are $-\alpha+a_{1}(\alpha,\sigma)$ and $a_{2}(\alpha,\sigma)$ , respectively, where

[TABLE]

The lower and upper bounds for $a_{1}(\alpha,\sigma)$ (respectively, $a_{2}(\alpha,\sigma)$ ) are attained, with $\alpha$ fixed, when $\sigma\to 0$ ( $\sigma\to\infty$ ) and $\sigma\to\infty$ ( $\sigma\to 0$ ), respectively.

The multivariate extension of (6) is the diffusion

[TABLE]

This diffusion has $\mathrm{WN}\left(\boldsymbol{\mu},\frac{1}{2}\mathbf{A}^{-1}\boldsymbol{\Sigma}\right)$ sdi, provided that $\mathbf{A}$ is invertible and such that $\mathbf{A}^{-1}\boldsymbol{\Sigma}$ is a covariance matrix. The drift is null at $\boldsymbol{\mu}+\mathbf{k}_{0}\pi$ , with $\mathbf{k}_{0}\in\left\{-1,0,1\right\}^{p}$ , since $\sum_{\mathbf{k}\in\mathbb{Z}^{p}}(2\mathbf{k}+\mathbf{k}_{0})w_{\mathbf{k}}(\boldsymbol{\mu}+\mathbf{k}_{0}\pi)=\mathbf{0}$ due to the symmetry of $w_{\mathbf{k}}(\boldsymbol{\mu})$ as a function of $\mathbf{k}\in\mathbb{Z}^{p}$ . Properties similar to the circular case can be obtained using that $\nabla w_{\mathbf{k}}(\boldsymbol{\theta})=4\pi\boldsymbol{\Sigma}^{-1}\mathbf{A}w_{\mathbf{k}}(\boldsymbol{\theta})\left[\sum_{\mathbf{m}\in\mathbb{Z}^{p}}\mathbf{m}w_{\mathbf{m}}(\boldsymbol{\theta})-\mathbf{k}\right]$ . For instance, the Jacobian of the drift at $\boldsymbol{\mu}$ is $-\mathbf{A}+8\pi^{2}\mathbf{A}\big{[}\sum_{\mathbf{k}\in\mathbb{Z}^{p}}\mathbf{k}\mathbf{k}^{\prime}w_{\mathbf{k}}(\boldsymbol{\mu})\big{]}\allowbreak\mathbf{A}^{\prime}\boldsymbol{\Sigma}^{-1}$ .

The vector field of the drift has a characteristic tessellated structure that, in the two-dimensional case, is formed by hexagonal-like tiles (see Figure 2). $\boldsymbol{\Sigma}$ alters the tessellation that binds the drifts $\mathbf{A}(\boldsymbol{\mu}-\boldsymbol{\theta}-2\mathbf{k}\pi)$ by modifying $\{w_{\mathbf{k}}(\boldsymbol{\theta}):\mathbf{k}\in\mathbb{Z}^{p}\}$ . This set is the distribution of the winding numbers of $\mathbf{X}\sim\mathcal{N}(\boldsymbol{\mu},\frac{1}{2}\mathbf{A}^{-1}\boldsymbol{\Sigma})$ , since $\mathbb{P}\{\mathrm{wind}(\mathbf{X})=\mathbf{k}\,|\,\mathrm{cmod}\left(\mathbf{X}\right)=\boldsymbol{\theta}\}=w_{\mathbf{k}}(\boldsymbol{\theta})$ and satisfies that $\arg\max_{\mathbf{k}\in\mathbb{Z}^{p}}w_{\mathbf{k}}(\boldsymbol{\theta})=\mathrm{wind}(\boldsymbol{\mu}-\boldsymbol{\theta})$ . Under isotropy (i.e. $\boldsymbol{\Sigma}=\sigma^{2}\mathbf{I}$ ), the larger (respectively, smaller) $\sigma$ , the more spread (concentrated) the distribution of winding numbers is, resulting in flat (peaked) drifts with smooth (rough) transitions in the limits defining the tessellation.

2.2.3 Jones and Pewsey (2005)’s circular distribution

The Jones and Pewsey (2005) distribution, $\mathrm{JP}(\mu,\kappa,\psi)$ , is a tractable family of symmetric and unimodal circular distributions that contains the Wrapped Cauchy (WC, $\psi=-1$ ), Cardioid (Ca, $\psi=1$ ), and von Mises ( $\psi\to 0$ ) distributions. Its pdf is $f_{\mathrm{JP}}(\theta;\mu,\kappa,\psi):=(2\pi\allowbreak P_{1/\psi}(\cosh(\kappa\psi)))^{-1}(\cosh(\kappa\psi)\allowbreak+\allowbreak\sinh(\kappa\psi)\allowbreak\cos(\theta-\mu))^{1/\psi}$ , with $\mu\in[-\pi,\pi)$ , $\kappa\geq 0$ , $\psi\in\mathbb{R}$ , and $P_{\nu}$ the Legendre function of the first kind and order $\nu$ .

The diffusion with $\mathrm{JP}\big{(}\mu,\frac{2\alpha}{\sigma^{2}},\psi\sigma^{2}\big{)}$ sdi, parametrised to yield (1) as a particular case, is

[TABLE]

The maximal drifts, located at $\mu\pm(\frac{\pi}{2}+\allowbreak\sin^{-1}(\tanh(2\alpha\psi)))$ , are closer to the equilibrium mean $\mu$ when $\psi<0$ and to the antipodal mean when $\psi>0$ . The slope of the drift at $\mu$ is $\frac{e^{4\alpha\psi}-1}{4\psi}$ . At $\mu\pm\pi$ , it is $\frac{e^{-4\alpha\psi}-1}{4\psi}$ . This relates to the fact that the drifts with $\psi<0$ equal the ones with $\psi>0$ , once translated by $\pm\pi$ and reflected around $\mu$ . Hence, whilst the WC diffusion features a drift attracting the process towards a tight neighbourhood around $\mu$ , the Ca diffusion repulses the process from $\mu\pm\pi$ and weakly attracts it towards $\mu$ (see trajectories and drifts in Figure 1).

2.2.4 Mixtures of independent von Mises

The density of an $m$ -mixture of independent von Mises distributions, $\mathrm{mivM}(\mathbf{M},\mathbf{K},\mathbf{p})$ , is given by $f_{\mathrm{mivM}}(\boldsymbol{\theta};\mathbf{M},\allowbreak\mathbf{K},\mathbf{p})\allowbreak:=\sum_{j=1}^{m}p_{j}f_{\mathrm{MvM}}(\boldsymbol{\theta};\boldsymbol{\mu}_{j},\boldsymbol{\kappa}_{j},\mathbf{0})$ , with $\mathbf{M}:=(\boldsymbol{\mu}_{1},\allowbreak\ldots,\allowbreak\boldsymbol{\mu}_{m})^{\prime}$ , $\mathbf{K}:=(\boldsymbol{\kappa}_{1},\ldots,\boldsymbol{\kappa}_{m})^{\prime}$ , $\mathbf{p}:=(p_{1},\ldots,p_{m})^{\prime}$ , and $p_{j}\geq 0$ , $j=1,\ldots,m$ and $\sum_{j=1}^{m}p_{j}=1$ . The mivM distribution is a highly flexible tool for modelling multimodal and skewed circular data (Banerjee et al., 2005), and has tractability as a key advantage: the normalizing constant is known and estimation by the Expectation-Maximization (EM) algorithm is relatively easy. Setting $\mathbf{A}=(\boldsymbol{\alpha}_{1},\ldots,\boldsymbol{\alpha}_{m})^{\prime}$ and $\boldsymbol{\kappa}_{j}=\nolinebreak[4]\frac{2\boldsymbol{\alpha}_{j}}{\sigma^{2}}$ ,

[TABLE]

has $\mathrm{mivM}\big{(}\mathbf{M},\frac{2\mathbf{A}}{\sigma^{2}},\mathbf{p})$ sdi. The mivM process drift is a weighted average of the corresponding component drifts, whose weights are the posterior probabilities of drawing $\boldsymbol{\Theta}_{t}$ from the mixture components of the sdi. The drift behaves locally around $\boldsymbol{\mu}_{j}$ as $\boldsymbol{\alpha}_{j}\circ\sin(\boldsymbol{\theta}-\boldsymbol{\mu}_{j})v_{j}(\boldsymbol{\mu}_{j})+\mathbf{b}_{j}$ , with $\mathbf{b}_{j}=\sum_{k\neq j}\boldsymbol{\alpha}_{k}\circ\sin(\boldsymbol{\mu}_{k}-\boldsymbol{\mu}_{j})v_{k}(\boldsymbol{\mu}_{j})$ (Figures 1 and 2). Then, $\boldsymbol{\mu}_{j}$ is only an asymptotic equilibrium point for $\sigma\to 0$ , since $\lim_{\sigma\to 0}v_{k}(\boldsymbol{\mu}_{j})=\delta_{jk}$ . The larger $\sigma$ , the smoother the binding of the component drifts is.

3 Estimation for toroidal diffusions

We focus now on the estimation of the vector parameter $\boldsymbol{\lambda}$ of a toroidal diffusion

[TABLE]

when the data are observations at discrete time points, $\{\boldsymbol{\Theta}_{\Delta i}\}_{i=0}^{N}$ . For simplicity, we assume that the time points are equidistant in the time interval $[0,T]$ , $T=N\Delta$ . The Maximum Likelihood Estimator (MLE) for $\boldsymbol{\lambda}\in\Lambda$ is given by

[TABLE]

where, using the Markovianity of (9), the log-likelihood is given by

[TABLE]

Here $p_{\Delta}(\cdot\,|\,\cdot;\boldsymbol{\lambda})$ is the tpd of (9). The first term in (10) is often disregarded or set to the sdi of (9). Maximum likelihood estimation is, under weak regularity conditions, consistent and asymptotically efficient when $N\to\infty$ with fixed $\Delta$ (Dacunha-Castelle and Florens-Zmirou, 1986), or when $\Delta\to 0$ and $T\to\infty$ (Sørensen, 2008). However, it can rarely be readily performed, as usually no explicit expression for the tpd exists and this tpd is only given implicitly as the solution to (3) on $\mathbb{T}^{p}$ .

In the following we present and analyse several estimation strategies to circumvent the unavailability of the tpd when dealing with toroidal diffusions. All these methods rely on an approximate likelihood function, where the unknown tpd is replaced by an approximation. For the sake of brevity, we suppress the, implicitly assumed, dependence on $\boldsymbol{\lambda}$ in the notation.

3.1 Estimation based on the stationary distribution

The simplest approximate likelihood function is obtained by replacing the tpd by the stationary density of (9). Usually, the sdi depends only on a function $\boldsymbol{\lambda}^{\nu}$ of $\boldsymbol{\lambda}$ . For instance, for the WN process, $\boldsymbol{\lambda}=(\mathbf{A},\boldsymbol{\mu},\boldsymbol{\Sigma})$ and $\boldsymbol{\lambda}^{\nu}=(\boldsymbol{\mu},\frac{1}{2}\mathbf{A}^{-1}\boldsymbol{\Sigma})$ . Therefore, we denote the stationary density by $\nu(\cdot;\boldsymbol{\lambda}^{\nu})$ and state the Stationary MLE (SMLE) of $\boldsymbol{\lambda}^{\nu}$ as

[TABLE]

For the vM process, SMLE is semi-explicit (Mardia and Jupp, 2000, page 198). The JP distribution has implicit SMLE and is discussed in Jones and Pewsey (2005, Section 3). Effective estimation in MvM distributions involves pseudo-likelihood (Mardia et al., 2008). Finally, inference for mivM distributions can be carried out by the EM algorithm (Banerjee et al., 2005). The simple SMLE is of interest for three reasons: (i) for stationary ergodic processes, it is consistent for $\boldsymbol{\lambda}^{\nu}$ as $N\to\infty$ for fixed $\Delta$ (Kessler, 2000); (ii) $\hat{\boldsymbol{\lambda}}^{\nu}_{\mathrm{SMLE}}$ can be used as a sensible starting value in the optimization routines of more sophisticated procedures; (iii) it can be supplemented by estimators of the rest of $\boldsymbol{\lambda}$ (see Bibby and Sørensen (2001), for example).

When the unidentifiability of $\boldsymbol{\lambda}$ by SMLE involves the diffusion matrix $\boldsymbol{\Sigma}$ , an estimator of $\boldsymbol{\lambda}$ can be obtained by an estimator of $\boldsymbol{\Sigma}$ that is unrelated to the SMLE. Conditionally on $\boldsymbol{\Theta}_{\Delta(i-1)}$ , $\boldsymbol{\Theta}_{\Delta i}$ is approximately distributed as $\mathrm{WN}(\boldsymbol{\Theta}_{\Delta(i-1)},\Delta\boldsymbol{\Sigma})$ when $\Delta\approx 0$ (high frequency observations). This, plus the high concentration of such WN distribution (see Remark 3 below), gives

[TABLE]

Thus, an approximate MLE of $\boldsymbol{\Sigma}$ is

[TABLE]

Under isotropy, $\hat{\sigma}_{\mathrm{HF}}^{2}:=p^{-1}\mathrm{tr}[\hat{\boldsymbol{\Sigma}}_{\mathrm{HF}}]$ . The Euclidean counterpart of (12) is well-known to be a consistent estimator of $\boldsymbol{\Sigma}$ as $\Delta\rightarrow 0$ (for fixed $T$ ) due to the convergence in probability to the quadratic variation. The consistency for $\hat{\boldsymbol{\Sigma}}_{\mathrm{HF}}$ follows easily from this result.

The estimator (12) gives a practical method to disentangle the unidentifiability inherent to SMLE. We illustrate this with the WN process. The SMLE $(\hat{\boldsymbol{\mu}},\hat{\mathbf{S}})_{\mathrm{SMLE}}$ for $(\boldsymbol{\mu},\mathbf{S})$ , where $\mathbf{S}=\frac{1}{2}\mathbf{A}^{-1}\boldsymbol{\Sigma}$ , can be found by optimizing (11). The circular means, $\hat{\boldsymbol{\mu}}_{c}:=\mathrm{atan2}\big{(}\sum_{i=1}^{N}\sin(\boldsymbol{\Theta}_{i\Delta}),\allowbreak\sum_{i=1}^{N}\cos(\boldsymbol{\Theta}_{i\Delta})\big{)}$ , and the high-concentration estimate of $\mathbf{S}$ , $\frac{1}{N}\sum_{i=1}^{N}\mathrm{cmod}\left(\boldsymbol{\Theta}_{i\Delta}-\hat{\boldsymbol{\mu}}_{c}\right)\allowbreak\mathrm{cmod}\left(\boldsymbol{\Theta}_{i\Delta}-\hat{\boldsymbol{\mu}}_{c}\right)^{\prime}$ , can be used as starting values. $(\hat{\boldsymbol{\mu}},\hat{\mathbf{S}})_{\mathrm{SMLE}}$ and (12) give $\hat{\mathbf{A}}:=\frac{1}{2}\hat{\boldsymbol{\Sigma}}_{\mathrm{HF}}\mathbf{S}_{\mathrm{SMLE}}^{-1}$ , resulting in $\hat{\boldsymbol{\lambda}}=(\hat{\mathbf{A}},\hat{\boldsymbol{\mu}}_{\mathrm{SMLE}},\hat{\boldsymbol{\Sigma}}_{\mathrm{HF}})$ . Similar approaches can be followed for the rest of the diffusions presented in Section 2.

3.2 Adapted Euler and Shoji–Ozaki pseudo-likelihoods

The well-known Euler pseudo-likelihood can be adapted for toroidal diffusions with minor changes. The Euler scheme arises as the first order discretization of the process, where the drift and diffusion coefficient are approximated constantly. After wrapping, the scheme becomes

[TABLE]

where $\mathbf{Z}^{i}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ , $i=1,\ldots,N$ . The wrapping yields the Euler pseudo-tpd

[TABLE]

When $\Delta\to\infty$ , the Euler pseudo-tpd converges to the uniform distribution in $\mathbb{T}^{p}$ by spreading its probability mass whilst the mean moves along the wrapped line $\{\mathrm{cmod}\left(\boldsymbol{\varphi}+b(\boldsymbol{\varphi})\Delta\right):\Delta>0\}$ . The Euler pseudo-likelihood is obtained from (10) by replacing the tpd by the Euler pseudo-tpd.

The Shoji–Ozaki (Shoji and Ozaki, 1998) scheme uses a linear approximation for the drift and assumes the diffusion coefficient constant between observation times: for $t\in[s,s+\Delta)$ , $b(\mathbf{X}_{t})\approx b(\mathbf{X}_{s})+\mathbf{J}_{s}(\mathbf{X}_{t}-\mathbf{X}_{s})$ , where $\mathbf{J}_{s}=J(\mathbf{X}_{s})$ denotes the Jacobian of $b$ at $\mathbf{X}_{s}$ . This gives the linear approximating SDE

[TABLE]

Conditionally on $\mathbf{X}_{s}$ , this is a multivariate OU process. Hence, $\mathbf{X}_{t}\,|\,\mathbf{X}_{s}\sim\mathcal{N}(\boldsymbol{\mu}_{t},\boldsymbol{\Gamma}_{t})$ , with $\boldsymbol{\mu}_{t}:=\mathbf{J}_{s}^{-1}(\exp\{\mathbf{J}_{s}(t-s)\}-\mathbf{I})b(\mathbf{X}_{s})$ , $\boldsymbol{\Gamma}_{t}:=\int_{s}^{t}\exp\{\mathbf{J}_{s}(t-u)\}\mathbf{V}_{s}\exp\{\mathbf{J}_{s}^{\prime}(t-u)\}\mathrm{d}u$ , and $\mathbf{V}_{s}=\boldsymbol{\sigma}_{s}\boldsymbol{\sigma}_{s}^{\prime}$ . If $\mathbf{J}_{s}$ has no pair of reverse-sign eigenvalues, then

[TABLE]

If $\mathbf{V}_{s}^{-1}\mathbf{J}_{s}$ is symmetric, then $\boldsymbol{\Gamma}_{t}$ admits a more explicit form222Note the similar argument given in Roberts and Stramer (2002), albeit in their equation (24) the covariance matrix is not symmetric, probably because of a typo in (25), which should have been $(J(x)a_{x,h})^{\prime}=J(x)a_{x,h}$ .:

[TABLE]

Interestingly, for the Langevin family of diffusions, $\mathbf{V}_{s}^{-1}\mathbf{J}_{s}$ is guaranteed to be symmetric as long as the diffusion coefficient is constant. This is due to the particular form of (5), which gives $\mathbf{J}_{s}=\frac{1}{2}\mathbf{V}_{s}\boldsymbol{\mathcal{H}}_{s}$ , where $\boldsymbol{\mathcal{H}}_{s}$ stands for the Hessian of $\log f$ at $\mathbf{X}_{s}$ . Therefore, (14) simplifies notably the evaluation of the Shoji–Ozaki pseudo-likelihood for all the toroidal diffusions considered in this paper.

The Shoji–Ozaki pseudo-tpd for toroidal diffusions is333In Shoji and Ozaki (1998) the drift approximation is done by Itô’s formula. To obtain a simpler pseudo-likelihood, we use a local linear approximation of $b$ as in Ozaki (1985) (for the case $p=1$ ). Without this extra simplification, the expectation becomes $\tilde{E}_{\Delta}(\boldsymbol{\varphi})=E_{\Delta}(\boldsymbol{\varphi})+J(\boldsymbol{\varphi})^{-2}(\exp\{J(\boldsymbol{\varphi})\Delta\}-\mathbf{I}-J(\boldsymbol{\varphi})\Delta)M(\boldsymbol{\varphi})$ with $M(\boldsymbol{\varphi})=\frac{1}{2}\left(\mathrm{tr}\left[\mathbf{V}(\boldsymbol{\varphi})\mathbf{H}_{1}(\boldsymbol{\varphi})\right],\ldots,\mathrm{tr}\left[\mathbf{V}(\boldsymbol{\varphi})\mathbf{H}_{n}(\boldsymbol{\varphi})\right]\right)^{\prime}$ and $\mathbf{H}_{i}(\boldsymbol{\varphi})=\left(\tfrac{\partial^{2}b_{i}(\boldsymbol{\varphi})}{\partial\phi_{k}\partial\phi_{l}}\right)_{1\leq k,l\leq p}$ , $i=1,\ldots,p$ .

[TABLE]

where, assuming that $V(\boldsymbol{\varphi})^{-1}J(\boldsymbol{\varphi})$ is symmetric (otherwise use (13) instead of (14)),

[TABLE]

When $J(\boldsymbol{\varphi})$ shrinks to $\mathbf{0}$ , then $E_{\Delta}(\boldsymbol{\varphi})\approx\boldsymbol{\varphi}+b(\boldsymbol{\varphi})\Delta$ and $V_{\Delta}(\boldsymbol{\varphi})\approx V(\boldsymbol{\varphi})\Delta$ , so the Euler scheme follows by continuity. If all the real parts of the eigenvalues of $J(\boldsymbol{\varphi})$ are negative, then $p^{\mathrm{SO}}_{\Delta}(\boldsymbol{\theta}\,|\,\boldsymbol{\varphi})\underset{\Delta\to\infty}{\longrightarrow}f_{\mathrm{WN}}\big{(}\boldsymbol{\theta};\boldsymbol{\varphi}-J(\boldsymbol{\varphi})b(\boldsymbol{\varphi}),-\frac{1}{2}J(\boldsymbol{\varphi})^{-1}V(\boldsymbol{\varphi})\big{)}$ and the pseudo-tpd does not degenerate into a uniform density as Euler’s does. Otherwise, the pseudo-tpd converges to the uniform distribution in $\mathbb{T}^{p}$ exponentially fast (see Figure 3), at a rate controlled by the maximum positive real part of the eigenvalues.

A disadvantage of these pseudo-likelihoods is that they are unimodal, so they cannot capture the multimodality of the tpd, a distinctive feature of toroidal diffusions.

Remark 3.

Evaluating $f_{\mathrm{WN}}(\cdot;\boldsymbol{\mu},\boldsymbol{\Sigma})$ for the above pseudo-tpds is a computationally demanding task. Several approximations are possible:

i.

High-concentration*. Use the closest winding number as a one-term truncation of the series, i.e., $\phi_{\boldsymbol{\Sigma}}(\mathrm{cmod}\left(\cdot-\boldsymbol{\mu}\right))$ .* 2. ii.

Fixed truncation*. Mardia and Jupp (2000, page 50)** suggests (for $p=1$ ) $\sum_{\mathbf{k}\in\{-1,0,1\}^{p}}\phi_{\boldsymbol{\Sigma}}(\cdot-\boldsymbol{\mu}+2\mathbf{k}\pi)$ , which is usually enough for practical purposes if the argument lays in $\mathbb{T}^{p}$ .* 3. iii.

Von Mises moment matching*. Uses the approximation $\mathrm{WN}(\mu,\sigma^{2})\approx\mathrm{vM}(\mu,A_{1}^{-1}(e^{-\sigma^{2}/2}))$ , with $A_{1}(\kappa)=\mathcal{I}_{1}(\kappa)/\mathcal{I}_{0}(\kappa)$ (Mardia and Jupp, 2000, page 38). This approximation generalizes easily to the multivariate case only if $\boldsymbol{\Sigma}$ is diagonal. For the bivariate case, an alternative is to use a von Mises score matching (Mardia, 2017).* 4. iv.

Adaptive truncation*. The Jona-Lasinio et al. (2012)**’s “ $3\sigma$ adaptive truncation” can be generalized to the multivariate case by Bonferroni: $\sum_{\mathbf{k}=\mathbf{k}_{L}}^{\mathbf{k}_{U}}\phi_{\boldsymbol{\Sigma}}(\cdot-\boldsymbol{\mu}+2\mathbf{k}\pi)$ with $\mathbf{k}_{U}=-\mathbf{k}_{L}=1+\lfloor z_{1-\alpha/(2p)}\sqrt{\mathrm{diag}\left(\boldsymbol{\Sigma}\right)}\allowbreak/(2\pi)\rfloor$ , where $z_{\alpha}$ is the upper $\alpha$ -quantile of a $\mathcal{N}(0,1)$ , ensures a probability mass in $\mathbb{T}^{p}$ larger than $1-\alpha$ .*

For $p=1,2$ , a simple compromise between tractability and accuracy is combining i and ii into $\sum_{\mathbf{k}\in\{-1,0,1\}^{p}}\allowbreak\phi_{\boldsymbol{\Sigma}}(\mathrm{cmod}\left(\cdot-\boldsymbol{\mu}\right)+2\mathbf{k}\pi)$ , which has a probability coverage of $\mathbb{T}^{p}$ larger than $1-2\sum_{j=1}^{p}\Phi(-\tfrac{3\pi}{\sigma_{j}})$ .

3.3 Wrapped Ornstein–Uhlenbeck approximation of the WN process

We now present a specific approximation for the tpd of the WN process that allows to model the multimodality in the tpd. Multimodality is not uncommon for toroidal diffusions since each coordinate can move towards its mean value in two directions and, contrary to what happens with the OU process, this implies that neither the WN nor the MvM processes have tpds within the parametric families of the sdis.

The approximation relies on the connection of the WN process with the tractable multivariate OU process:

[TABLE]

with $\boldsymbol{\mu}\in\mathbb{R}^{p}$ , $\boldsymbol{\Sigma}$ a covariance matrix, and $\mathbf{A}$ such that $\mathbf{A}^{-1}\boldsymbol{\Sigma}$ is a covariance matrix. The last assumption ensures that the OU process is ergodic and time-reversible, and as a consequence, implies a simple expression for the covariance matrix of the tpd (see below). Under this setting, the process is ergodic, time-reversible, and has stationary density $\mathcal{N}\big{(}\boldsymbol{\mu},\frac{1}{2}\mathbf{A}^{-1}\boldsymbol{\Sigma}\big{)}$ . We denote by WOU, standing for Wrapped multivariate OU process, to the wrapping of (15). Assuming that $\mathbf{X}_{s}\sim\mathcal{N}\big{(}\boldsymbol{\mu},\frac{1}{2}\mathbf{A}^{-1}\boldsymbol{\Sigma}\big{)}$ , the conditional density of WOU is given by Proposition 1 and the tpd of (15):

[TABLE]

where, by the same argument used in (14),

[TABLE]

The conditional density (16) can be seen as a wrapping of the tpd of (15) weighted by the sdi of the winding numbers, which resembles the structure of the WN process drift: a weighting of linear drifts like (15) according to the winding number sdi in order to achieve periodicity. Albeit (16) and the tpd of the WN process are different, they behave similarly in many situations. The next corollary from Proposition 1 formalizes these arguments.

Corollary 1.

Suppose $\{\boldsymbol{\Theta}_{t}\}$ solves (7) with $\boldsymbol{\Theta}_{0}=\boldsymbol{\theta}_{0}$ and let $\{\boldsymbol{\Theta}^{\mathrm{WOU}}_{t}\}$ be the wrapping of the solution to (15), where $\mathbf{X}_{0}\sim\mathcal{N}(\boldsymbol{\mu},\frac{1}{2}\mathbf{A}^{-1}\boldsymbol{\Sigma})$ . We condition, moreover, on $\boldsymbol{\Theta}^{\mathrm{WOU}}_{0}=\boldsymbol{\theta}_{0}$ . Then:

i.

As $t\rightarrow 0$ , $\boldsymbol{\Theta}_{t}\rightarrow\boldsymbol{\theta}_{0}$ and $\boldsymbol{\Theta}^{\mathrm{WOU}}_{t}\rightarrow\boldsymbol{\theta}_{0}$ in probability. 2. ii.

As $t\rightarrow\infty$ , both $\boldsymbol{\Theta}_{t}$ and $\boldsymbol{\Theta}^{\mathrm{WOU}}_{t}$ converge in distribution to a $\mathrm{WN}(\boldsymbol{\mu},\tfrac{1}{2}\mathbf{A}^{-1}\boldsymbol{\Sigma})$ . 3. iii.

When $\mathbf{A}^{-1}\boldsymbol{\Sigma}\rightarrow\mathbf{0}$ with $\boldsymbol{\Sigma}$ bounded, $\boldsymbol{\Theta}_{t}-\boldsymbol{\Theta}^{\mathrm{WOU}}_{t}\rightarrow\mathbf{0}$ in probability, so the distributions of $\boldsymbol{\Theta}_{t}$ and $\boldsymbol{\Theta}^{\mathrm{WOU}}_{t}$ are similar in the limit. 4. iv.

$p_{t}^{\mathrm{WOU}}$ satisfies $f_{\mathrm{WN}}(\boldsymbol{\theta}_{0};\boldsymbol{\mu},\tfrac{1}{2}\mathbf{A}^{-1}\boldsymbol{\Sigma})p^{\mathrm{WOU}}_{t}(\boldsymbol{\theta}\,|\,\boldsymbol{\theta}_{0})=f_{\mathrm{WN}}(\boldsymbol{\theta};\boldsymbol{\mu},\tfrac{1}{2}\mathbf{A}^{-1}\boldsymbol{\Sigma})p^{\mathrm{WOU}}_{t}(\boldsymbol{\theta}_{0}\,|\,\boldsymbol{\theta})$ , $\forall\boldsymbol{\theta},\boldsymbol{\theta}_{0}\in\mathbb{T}^{p}$ (just like $p_{t}$ ).

Proof.

The first two statements for the WN process are well-known for any diffusion, and it follows from (16) that, for $\boldsymbol{\theta}\neq\boldsymbol{\theta}_{0}$ , $\lim_{t\to 0}p^{\mathrm{WOU}}_{t}(\boldsymbol{\theta}\,|\,\boldsymbol{\theta}_{0})=0$ and that $\lim_{t\to\infty}p^{\mathrm{WOU}}_{t}(\boldsymbol{\theta}\,|\,\boldsymbol{\theta}_{0})=f_{\mathrm{WN}}(\boldsymbol{\theta};\boldsymbol{\mu},\frac{1}{2}\mathbf{A}^{-1}\boldsymbol{\Sigma})$ . The last statement follows from (16) and the fact that the OU process is time-reversible when $\mathbf{A}^{-1}\boldsymbol{\Sigma}$ is positive definite. We give a rough sketch of a proof of the third statement. The result follows because the tpd of the WN process is asymptotically equal to the tpd of the OU process in the high concentration limit. To see this, suppose that $\mathbf{X}_{t}$ solves (15) with $\mathbf{X}_{0}=\boldsymbol{\theta}_{0}$ (we can ignore the other starting points), and that $\boldsymbol{\Theta}_{t}$ solves (7) with $\boldsymbol{\Theta}_{0}=\boldsymbol{\theta}_{0}$ , both driven by the same Wiener process. Then $\mathbf{Y}_{t}:=\mathbf{X}_{t}-\boldsymbol{\Theta}_{t}$ solves $\mathrm{d}\mathbf{Y}_{t}=-\mathbf{A}\mathbf{Y}_{t}\mathrm{d}t+\mathrm{d}\mathbf{Z}_{t}$ , with $\mathbf{Y}_{0}=\mathbf{0}$ , where

[TABLE]

If $\mathbf{Y}_{t}=\int_{0}^{t}e^{-\mathbf{A}(t-s)}\mathrm{d}\mathbf{Z}_{s}\rightarrow\mathbf{0}$ in probability, then the two distributions of $\boldsymbol{\Theta}_{t}$ and $\mathbf{X}_{t}$ will be the same in the limit. This follows because $\mathbf{A}w_{\mathbf{k}}(\boldsymbol{\theta})\rightarrow\mathbf{0}$ and $\mathbf{A}(1-w_{\mathbf{0}}(\boldsymbol{\theta}))\rightarrow\mathbf{0}$ for $\mathbf{k}\in\mathbb{Z}^{p}\backslash\{\mathbf{0}\}$ and $-\pi<\boldsymbol{\theta}<\pi$ (we consider only the case $\mathrm{wind}(\boldsymbol{\mu}-\boldsymbol{\theta})=\mathbf{0}$ because $\mathbb{P}[|\boldsymbol{\Theta}_{s}-\boldsymbol{\mu}|\leq\boldsymbol{\pi}]\to 1$ ), and hence $\mathbf{Z}_{s}\rightarrow\mathbf{0}$ in probability for all $s\leq t$ . ∎

The tractability of (16) degenerates quickly with the dimension, but it can be readily computed for $p=1,2$ , two highly relevant situations in practice. We focus our attention on implementation matters for the non-trivial case $p=2$ . The first point of inquiry is what parametrization of $\mathbf{A}$ and $\boldsymbol{\Sigma}$ leads to a covariance matrix $\mathbf{A}^{-1}\boldsymbol{\Sigma}$ , which guarantees a non-degenerate WN sdi.

Lemma 1.

Let $\mathbf{A}$ and $\boldsymbol{\Sigma}$ be $2\times 2$ matrices, $\boldsymbol{\Sigma}=\big{(}\sigma_{1}^{2},\allowbreak\rho\sigma_{1}\sigma_{2};\rho\sigma_{1}\sigma_{2},\sigma_{2}^{2}\big{)}$ positive-definite. Assume $\alpha_{1},\alpha_{2}>0$ . Any matrix $\mathbf{A}$ such that $\mathbf{A}^{-1}\boldsymbol{\Sigma}$ is a covariance matrix has the form

[TABLE]

with $\alpha_{3}^{2}<\frac{\rho^{2}(\alpha_{1}-\alpha_{2})^{2}}{4}+\alpha_{1}\alpha_{2}$ .

The parametrization with $\rho=0$ provides a compromise between flexibility and simplicity, and will be employed throughout (first occurrences in Figures 3 and 5). With $\rho=0$ the dependence between components is modelled by $\alpha_{3}$ , which is clear from

[TABLE]

The second point is the efficient computation of $e^{-t\mathbf{A}}$ and $\boldsymbol{\Gamma}_{t}$ . In virtue of Corollary 2.4 of Bernstein and So (1993), $e^{t\mathbf{A}}=a(t)\mathbf{I}+b(t)\mathbf{A}$ with $b(t):=e^{s(\mathbf{A})t}\tfrac{\sinh(q(\mathbf{A})t)}{q(\mathbf{A})}$ (if $q(\mathbf{A})=0$ , then, by continuity, $b(t)=e^{s(\mathbf{A})t}t$ ), $a(t):=e^{s(\mathbf{A})t}\cosh(q(\mathbf{A})t)-s(\mathbf{A})b(t)$ , $s(\mathbf{A}):=\tfrac{\mathrm{tr}\left[\mathbf{A}\right]}{2}$ , and $q(\mathbf{A}):=\sqrt{\left|\det(\mathbf{A}-s(\mathbf{A})\mathbf{I})\right|}$ . Therefore,

[TABLE]

with $s(t):=1-a(-2t)$ and $i(t):=-\tfrac{1}{2}b(-2t)$ . Expression (17) shows neatly the interpolation between the infinitesimal and stationary covariance matrices and is especially useful if it is required to compute the tpd for several $t$ ’s.

To conclude, we highlight some of the advantages of the WOU approximation over the Euler and Shoji–Ozaki pseudo-likelihoods for the WN process. Firstly, WOU is able to capture the multimodality of the tpd (see Figure 3) and has the correct sdi. Secondly, WOU is faster to compute than Shoji–Ozaki, as it does not require exponentiation and inversion of the Jacobian matrix for each observation, but only once.

3.4 Likelihood by numerical PDE solution

An alternative to approximate likelihoods is to compute the “exact” (up to a prescribed accuracy) MLE by a numerical solution of (3). This approach is computationally expensive, but remains valid for arbitrary diffusions and discretization times. Moreover, it provides insightful visualizations of the tpd. In the following, we discuss how to solve numerically (3) for dimensions $p=1,2$ .

3.4.1 One-dimensional case

We consider a state grid $\mathcal{G}:=\left\{x_{1},\ldots,x_{M_{x}}\right\}$ in $[-\pi,\pi)$ constructed with step $\Delta x:=\frac{2\pi}{M_{x}}$ , and such that $x_{M_{x}+1}\allowbreak:=x_{1}=-\pi$ and $x_{0}:=x_{M_{x}}=\pi-\Delta x$ . We also consider a time grid in $[0,T]$ with $\Delta t:=\frac{T}{M_{t}}$ . For consistency with the common notation for PDEs, we refer by $u(x,t)$ to $p_{t}(x\,|\,p_{s}):=\int_{\mathbb{T}^{1}}p_{t}(x\,|\,\phi)p_{s}(\phi)\mathrm{d}\phi$ , the solution of the PDE for the initial condition (at time $s$ ) given by a circular density $p_{s}$ . The vector $\mathbf{u}^{n}$ , $n=0,\ldots,{M_{t}}$ , denotes the tpd evaluated at $\mathcal{G}$ at time $s+n\Delta t$ . We write $b_{i}:=b(x_{i})$ and $\sigma^{2}_{i}:=\sigma^{2}(x_{i})$ , $i=1,\ldots,{M_{x}}$ .

We employ the so-called Crank–Nicolson scheme for discretizing (3), which can be rewritten as

[TABLE]

Crank–Nicolson is a well-known scheme for diffusion and convection-diffusion PDEs such as (3). It is based on a trapezoidal-like approximation of the forward difference of the time derivative that is combined with a centered finite differences of the state derivatives:

[TABLE]

with $r:=\frac{\Delta t}{4(\Delta x)^{2}}$ , $\gamma_{i}:=\left(-b_{i+1}\Delta x+\sigma^{2}_{i+1}\right)r$ , $\beta_{i}:=\sigma_{i}^{2}r$ , and $\alpha_{i}:=\left(b_{i-1}\Delta x+\sigma^{2}_{i-1}\right)r$ . The next step in time of the solution, $\mathbf{u}^{n+1}$ , is implicitly given by the system

[TABLE]

with subscript $\pm$ denoting the vector with entries circularly shifted $\mp 1$ position. It is well-known (Thomas, 1995, page 225) that this periodic tridiagonal system can be solved efficiently by tacking the tridiagonal systems $\mathbf{B}\mathbf{y}_{1}=\mathbf{d}_{n}$ and $\mathbf{B}\mathbf{y}_{2}=\mathbf{w}$ (where $\mathbf{F}=\mathbf{B}-\mathbf{w}\mathbf{z}^{\prime}$ ), and using the Sherman–Morrison formula: $\mathbf{u}^{n+1}=\mathbf{y}_{1}+\tfrac{\mathbf{z}^{\prime}\mathbf{y}_{1}}{1-\mathbf{z}^{\prime}\mathbf{y}_{2}}\mathbf{y}_{2}$ . The latter tridiagonal systems can be jointly solved by a modification of the Thomas algorithm, since they share coefficient matrix. The cost of the solution is $\mathcal{O}\left({M_{t}}{M_{x}}\right)$ . In addition, since $\mathbf{F}$ is constant with respect to time, the tridiagonal LU factorization underlying the Thomas algorithm can be reused, yielding a complexity factor reduction of $5/8$ on the tridiagonal solver.

3.4.2 Two-dimensional case

We consider now two grids $\mathcal{G}_{x}$ and $\mathcal{G}_{y}$ analogous to $\mathcal{G}$ , but of sizes $M_{x}$ and $M_{y}$ , and steps $\Delta x$ and $\Delta y$ . We refer by $u(x,y,t)$ to $p_{t}(x,y\,|\,p_{s}):=\int_{\mathbb{T}^{2}}p_{t}(x,y\,|\,\boldsymbol{\varphi})p_{s}(\boldsymbol{\varphi})\mathrm{d}\boldsymbol{\varphi}$ , where $p_{s}$ is a toroidal density giving the initial condition (at time $s$ ). The matrix $\mathbf{U}^{n}$ , $n=0,\ldots,{M_{t}}$ , denotes the tpd evaluated at $\mathcal{G}_{x}\times\mathcal{G}_{y}$ at time $s+n\Delta t$ . We write $b_{z;i,j}:=b_{z}(x_{i},y_{j})$ , $\sigma_{z;i,j}^{2}:=\sigma_{z}^{2}(x_{i},y_{j})$ , $\sigma_{xy;i,j}^{2}:=\sigma_{xy}^{2}(x_{i},y_{j})$ , with $z$ standing for $x$ or $y$ , and $i=1,\ldots,M_{x}$ , $j=1,\ldots,M_{y}$ . With this notation, (3) becomes

[TABLE]

The Crank–Nicolson scheme proceeds as in the one-dimensional case:

[TABLE]

with finite differences that can be collected into three terms associated to the partial and mixed derivatives:

[TABLE]

We have denoted $r_{z}:=\frac{\Delta t}{4(\Delta z)^{2}}$ , $r_{xy}:=\frac{\Delta t}{8\Delta x\Delta y}$ , and

[TABLE]

Let $F=F_{x}+F_{y}+F_{xy}$ be the linear functions mapping $\mathbf{U}^{n}$ into $F(\mathbf{U}^{n})=F_{x}(\mathbf{U}^{n})+F_{y}(\mathbf{U}^{n})+F_{xy}(\mathbf{U}^{n})=(F^{n}_{x;i,j}+F^{n}_{y;i,j}+F^{n}_{xy;i,j})=(F^{n}_{i,j})$ and $I$ the identity function. Then, we can express (19) as

[TABLE]

If the left and right hand sides of (20) are stacked column-wise, (20) becomes an $M_{x}M_{y}\times M_{x}M_{y}$ periodic $9$ -diagonal system. This system cannot be solved so efficiently as in the tridiagonal case, requiring a more complex algorithm or a generic sparse LU factorization.

An alternative approach that reduces drastically the computational burden of solving (20) is to adopt an Alternating Direction Implicit (ADI) scheme. ADI schemes split the multidimensional finite differences in a series of univariate discretizations with simpler associated systems. Originally developed for the diffusion equation, they were extended to the convection-diffusion equations with a mixed derivative term by McKee et al. (1996), in the so-called Douglas scheme. This scheme proceeds with an explicit multivariate step corrected by two unidimensional Crank–Nicolson steps, whose purpose is to stabilize the explicit step:

[TABLE]

Consequently, if the matrix equations in (21)–(23) are transformed into linear systems by column-wise stacking for (21) and (22), and row-wise stacking for (23), the Douglas scheme transforms the difficult task of solving (20) into solving two periodic tridiagonal systems of size $M_{x}M_{y}$ . Specifically, the steps in (21)–(23) are carried out using

[TABLE]

$\mathbf{U}^{n+1}$ is obtained by setting $\mathbf{Y}$ equal to $\mathbf{U}^{n}$ , $\mathbf{Y}_{1}$ or $\mathbf{Y}_{2}$ in the above expressions and by solving (22) and (23) as (18) was. Then, the total cost of the solution is $\mathcal{O}\left({M_{t}}M_{x}M_{y}\right)$ . Note that the row-stacking vector $\mathbf{y}_{r}^{n}$ can be directly obtained from $\mathbf{y}_{c}^{n}$ by extracting the indexes $((k_{c}-1)\mod M_{y})M_{x}+\big{\lfloor}\frac{k_{c}-1}{M_{y}}\big{\rfloor}+1$ , $k_{c}=1,\ldots,\allowbreak M_{x}M_{y}$ , of the latter (analogous for the converse). We refer to the neat expository paper of In ’t Hout and Foulon (2010) for further description of ADI schemes.

3.4.3 Remarks on the discretization schemes

The Crank–Nicolson and Douglas schemes are tailored solutions for solving (3) that exploit the particular PDE structure. It is worth to note that, among other methods, a well-known approach to solve PDEs is the method of lines. This method is prone to create stiff systems, which need to be handled properly by a meta-solver that chooses between stiff and non-stiff solvers (e.g., the lsoda implementation in Soetaert et al. (2012)). Not surprisingly, in our application we found that the efficiency and reliability of the tailored solutions were superior to the latter, much more general, meta-solver.

Some theoretical remarks about the schemes employed are given as follows. The Crank–Nicolson scheme is conservative (hence the Douglas scheme is too), which can be easily seen from the periodic tridiagonal system. It is also second-order consistent in time and space (with the discretization used), with the appealing property of being unconditionally stable with respect to $\Delta t$ . The Douglas scheme is first-order consistent and unconditionally stable when applied to two-dimensional convection-diffusion equations with a mixed derivative term. See In ’t Hout and Foulon (2010) for the description of second-order ADI schemes of the same computational complexity (but with a factor increase of at least two). Both unconditional stabilities refer to the usual framework of constant coefficients.

3.4.4 Likelihood evaluation

The PDE numerical solutions approximate $p_{t}(\boldsymbol{\theta}\,|\,p_{s})=\int_{\mathbb{T}^{p}}p_{t}(\boldsymbol{\theta}\,|\,\boldsymbol{\varphi})\allowbreak p_{s}(\boldsymbol{\varphi})\mathrm{d}\boldsymbol{\varphi}$ , where $p_{s}$ is a density over $\mathbb{T}^{p}$ giving the initial condition. Therefore, $p_{t}(\boldsymbol{\theta}\,|\,\boldsymbol{\theta}_{0})$ can be approximated by considering a concentrated $\mathrm{WN}(\boldsymbol{\theta}_{0},\sigma^{2}_{0}\mathbf{I})$ as the initial condition. For a fixed grid, $\sigma_{0}$ must not be set to an arbitrarily small value, as it will create a sharp initial condition poorly discretized and prone to raise numerical errors. A possible rule of thumb is to choose a small $\sigma_{0}$ such that the periodic trapezoidal rule of the discretized $\mathrm{WN}(\boldsymbol{\theta}_{0},\sigma^{2}_{0}\mathbf{I})$ is close to one.

We illustrate the evaluation of the log-likelihood (10) from the PDE solution for $p=1$ . The extension to $p=2$ is conceptually straightforward, albeit cumbersome in notation. Given the sample $\{\Theta_{\Delta i}\}_{i=1}^{N}$ and the grid $\mathcal{G}$ , let denote by $\mathbf{P}:=p_{t}(\mathcal{G}\,|\,\mathcal{G})$ the $M_{x}\times M_{x}$ tpd matrix of the process discretized in $\mathcal{G}$ . The $j$ -th column of $\mathbf{P}$ is obtained by solving the PDE with initial condition $\mathrm{WN}(x_{j},\sigma^{2}_{0})$ . We can approximate $p_{\Delta}(\Theta_{i\Delta}\,|\,\Theta_{(i-1)\Delta})$ from $\mathbf{P}$ by linear interpolation:

[TABLE]

with $g_{0}(i):=\lceil\frac{\Theta_{\Delta i}+\pi}{\Delta x}\rceil$ , $\omega_{0}(\theta)=\frac{x_{g_{0}(i)+1}-\theta}{\Delta x}$ , and $\omega_{1}(\theta)=\frac{\theta-x_{g_{0}(i)}}{\Delta x}$ . The log-likelihood is obtained by plugging (24) into (10). The advantage of doing so is that the number of PDE solutions required for a single log-likelihood evaluation remains bounded by $M_{x}$ , irrespectively of $N$ . In addition, we only need to compute the columns of $\mathbf{P}$ corresponding to the unique set of indexes $\{g_{0}(i)+l:i=0,\ldots,N-1,l=0,1\}$ . A simpler, though less precise, alternative to (24) is to use constant interpolation for $\Theta_{\Delta(i-1)}$ . This results in a lower number of PDE solutions, specially in the two-dimensional case. Finally, if the drift is antisymmetric around a point $\mu$ , then $p_{t}(\theta\,|\,\varphi)=p_{t}(2\mu-\theta\,|\,2\mu-\varphi)$ . Hence, if $\mathcal{G}$ is circularly centered at $\mu$ , half of the columns of $\mathbf{P}$ contain redundant information. The situation is analogous for $p=2$ : if $b(\theta_{1}-\mu_{1},\theta_{2}-\mu_{2})=-b(\mu_{1}-\theta_{1},\mu_{2}-\theta_{2})$ , $\forall\theta_{1},\theta_{2}\in[-\pi,\pi)$ , and $\mathcal{G}_{x}$ and $\mathcal{G}_{y}$ are both centered at $\mu_{1}$ and $\mu_{2}$ , respectively, then only half of the columns of $\mathbf{P}$ are required. If the drift is isotropic, then only one fourth of the columns are needed.

4 Simulation study

We measure now the performance of the likelihood approximations given in Section 3. Two types of empirical analysis are employed. First, we compare the divergences between the true tpd of a diffusion and its approximations across time. Second, we examine the errors of the approximate likelihoods in estimating $\boldsymbol{\lambda}$ in several diffusions.

4.1 Kullback–Leibler divergences for WN and vM processes

All the estimation approaches described on Section 3 share a common root: the substitution of the true tpd $p_{t}$ by an approximation $p^{\mathrm{A}}_{t}$ . The goodness-of-fit of these approximations has a direct influence on MLE since, for a general parametric setting, MLE is equivalent to minimizing the Kullback–Leibler divergence of the parametric pdf from the empirical pdf. We propose to measure the Kullback–Leibler divergence of $p^{\mathrm{A}}_{t}(\cdot\,|\,\boldsymbol{\theta}_{s})$ from $p_{t}(\cdot\,|\,\boldsymbol{\theta}_{s})$ by weighting with the stationary density the contributions of each initial point $\boldsymbol{\theta}_{s}$ :

[TABLE]

The curve $\mathrm{D}^{\mathrm{A}}_{t}$ gives a succinct summary of the goodness-of-fit of any approximation to the tpd across time. Its effective computation – when no analytical expression for the tpd exists – can be done with the PDE solution to the tpd. Some care is needed though. The PDE solution involves the initial condition in the form of a concentrated $\mathrm{WN}(\boldsymbol{\theta}_{0},\sigma_{0}^{2}\mathbf{I})$ . This initial condition implies that the PDE solution is approximating $p_{t,\sigma_{0}^{2}}(\boldsymbol{\theta}\,|\,\boldsymbol{\theta}_{0}):=\int_{\mathbb{T}^{p}}p_{t}(\boldsymbol{\theta}\,|\,\boldsymbol{\varphi})f_{\mathrm{WN}}(\boldsymbol{\varphi};\boldsymbol{\theta}_{0},\sigma_{0}^{2})\mathrm{d}\boldsymbol{\varphi}$ rather than $p_{t}$ . Therefore, a more adequate approach is to smooth also the approximations in the computation of $\mathrm{D}^{\mathrm{A}}_{t}$ to perform a fair comparison:

[TABLE]

We explore the $\mathrm{D}^{\mathrm{A}}_{t,\sigma_{0}^{2}}$ curves for several variants of the approximations given in Section 3, denoted as S (Stationary density), E (Euler), SO (Shoji–Ozaki), UE (Unwrapped Euler – the usual Euler pseudo-likelihood), USO (Unwrapped Shoji–Ozaki), EvM, SOvM, and WOU. Suffix vM denotes the use of a vM distribution (moment if one-dimensional; score if two-dimensional) matching approximation to the WN distribution appearing in the pseudo-likelihoods.

Figures 4 and 5 show the Kullback–Leibler curves for the WN process with $p=1$ and $p=2$ , under different drift strengths and diffusivities. We highlight as follows their main features. First, WOU outperforms in almost all scenarios and times the other approximations. The main exceptions are the lower left scenarios of both figures, representing processes with a high diffusivity (small drifts and large diffusivities), where WOU is outperformed by SO and E for a significant range of intermediate times. In addition to S, WOU is the only approximation whose accuracy improves as time increases, above a certain local maximum in the Kullback–Leibler divergence. Second, the Euler and Shoji–Ozaki pseudo-likelihoods deteriorate or stabilize as time increases, except for scenarios with low and moderate diffusivity where SO is close to WOU (and both are close to the true tpd). E is systematically behind SO in performance, usually by several orders of magnitude. S is, as expected, giving a poor performance unless $t$ is large. Third, the wrapped versions of the pseudo-likelihoods dominate uniformly the unwrapped ones, both having similar performances if the process is highly concentrated. Indeed, the wrapping of SO is key in preventing the spread of probability mass outside $\mathbb{T}^{p}$ when the Jacobian of the drift has positive eigenvalues and $t$ grows, which raises numerical instabilities (e.g., lower right panel of Figure 5). Finally, matching the WN distribution of E and SO by a vM has different effects depending on the method. For E, the results are similar for both E and EvM, except for a bump in small times with high diffusivities. However, SOvM consistently adds a high bias to SO, resulting in significant higher divergences. As a general advice, we recommend to approximate the tpd of the WN process by WOU, SO and E, in this order.

We reproduce the same experiment on the vM process, with results collected in Figures 6 and 7. The highlights are similar except for the following differences. First, the good properties that WOU has for the WN process do not hold any more, evincing its process-specificity. S is now the only approximation whose accuracy improves over time. Second, SO is systematically above E in performance, yet this difference is reduced as SO is not the true tpd under high-concentration. Third, the vM distribution match does not provide a better approximation to the tpd, despite the sdi being vM. EvM is again close to E and EvM except for small $t$ ’s where EvM adds a substantial bias for scenarios with moderate and high diffusivities. The same happens for SOvM in $p=1$ , whereas for $p=2$ SOvM increases the Kullback–Leibler divergence by several orders of magnitude when compared to SO in the scenarios with high diffusivity. Our general advice is to approximate the tpd by SO and E, in this order.

4.2 Empirical performance of the approximate likelihoods

We compare now the efficiency of WOU, SO, and E – the best performing tpd approximations, according to the weighted Kullback–Leibler divergences – in estimating the unknown parameters of the diffusion (9) from a trajectory $\{\boldsymbol{\Theta}_{\Delta i}\}_{i=0}^{N}$ . In this section, we set $N=250$ and assume that $\sigma(\cdot;\boldsymbol{\lambda})=\boldsymbol{\Sigma}^{\frac{1}{2}}$ is known in order to avoid the inherent unidentifiabilities of $\boldsymbol{\lambda}$ when $\Delta$ is large and the tpd converges to the sdi. We explore the behaviour of the estimators for dimensions $p=1,2$ , time steps $\Delta=0.05,0.20,0.50,1.00$ , and for representative parameter choices of the WN process and of two challenging diffusions. For $p=1$ , we also consider the PDE-based approximation to the likelihood. The trajectories are simulated using the E method with time step $0.001$ and then subsampled for given $\Delta$ ’s.

In order to summarize the overall performance of a collection $\{\hat{\boldsymbol{\lambda}}_{j}=(\hat{\lambda}_{j,1},\ldots,\hat{\lambda}_{j,K}):j=1\ldots,J\}$ of $K$ -variate estimators of $\boldsymbol{\lambda}$ , we consider a global measure of relative performance. This measure is the componentwise average of Relative Efficiency (RE), where the relative efficiency is measured with respect to the best estimator at a given component in terms of Mean Squared Error (MSE):

[TABLE]

Hence, if $\hat{\boldsymbol{\lambda}}_{j}$ is the best estimator for all the components of $\boldsymbol{\lambda}$ , then $\mathrm{RE}(\hat{\boldsymbol{\lambda}}_{j})=1$ . We estimate $\mathrm{RE}(\hat{\boldsymbol{\lambda}}_{j})$ by Monte Carlo with $1000$ replicates, where $\hat{\boldsymbol{\lambda}}_{j}$ is obtained by maximizing the approximate likelihood with a common optimization procedure that employs (11) as starting values.

4.2.1 WN process

Table 1 shows the relative efficiencies for E, SO, WOU, and PDE with $p=1$ . When averaging across scenarios and discretization times, the global ranking of performance is: WOU ( $0.9195$ ), PDE ( $0.8831$ ), SO ( $0.8766$ ), and E ( $0.7642$ ). On average, E is the best performing method for $\Delta=0.05$ , followed closely by SO. However, the relative performance of E severely decays as $\Delta$ increases. A similar pattern is present for SO, although the decay in relative efficiency is less severe, being by a narrow margin the best performing method for $\Delta=0.20$ (above E and WOU with an absolute difference lower than $0.5\%$ ). PDE is significantly underperforming for $\Delta=0.05,0.20$ , which is explained by the bias induced by the initial condition: $\sigma_{0}=0.1$ was considered as a compromise between tractability ( $M_{x}=500$ , $M_{t}=\lceil 100\Delta\rceil$ ) and accuracy. PDE becomes the best performer on average for $\Delta=0.50,1.00$ , where the effects of the initial condition become less important. WOU shows an intermediate profile with an indubitable advantage: on average, its relative efficiency has an absolute difference with respect to the best performing method of less than $2.5\%$ . This fact is what makes it the best method on the global ranking of performance.

Table 2 gives the relative efficiencies for E, SO, and WOU in $p=2$ . When averaging across scenarios and discretization times, the global ranking of performance is: WOU ( $0.9608$ ), SO ( $0.8372$ ), and E ( $0.6607$ ). Similarly to $p=1$ , E is the best performing method for $\Delta=0.05$ and its relative efficiency quickly decays as $\Delta$ increases. SO and WOU perform similarly for low diffusive scenarios ( $\sigma=1$ ), but for $\sigma=2$ WOU significantly outperforms SO for $\Delta=0.20,050,1.00$ , a fact explained by the proneness of the tpd to be multimodal in those situations. The competitive performance of WOU for $p=1,2$ under all scenarios and $\Delta$ ’s, in addition to its affordable computational cost, places it as the preferred estimation method for the WN process.

4.2.2 Other processes

The WC diffusion has a remarkably different drift from the WN process (Figure 1). As a consequence, the tpd of the WC diffusion quickly becomes highly non-WN (multimodal, “heavy tails”, peaked), both the opposite defining features of the pseudo-tpds. This affects the relative efficiencies for E, SO, and PDE given in Table 4, whose global performance is: PDE ( $0.9727$ ), SO ( $0.4587$ ), and E ( $0.4131$ ). The supremacy of the PDE, except for small drift ( $\alpha=0.5$ ) and $\Delta=0.05$ , is evident. Thus, Table 4 is an illustration of the low efficiency of applying the Euler and Shoji–Ozaki pseudo-likelihoods for highly non-WN processes at arbitrary $\Delta$ ’s.

Finally, Table 5 shows the relative efficiencies of E and SO for a mivM diffusion with antipodal means. In order to avoid spurious maximums, $q$ was estimated by SMLE and then kept fixed when optimizing the approximate likelihood. The global performances are: SO ( $0.9655$ ), and E ( $0.8920$ ). The analysis by $\Delta$ ’s shows that, as in the WC diffusion, SO is performing better than E except for $\Delta=0.05$ . However, inspection of the tpd shows a prevalent multimodality, which points towards a low efficiency of the pseudo-likelihoods when $\Delta$ is not small.

5 Application to molecular dynamics

Toroidal data arises from the representation of the backbone of a protein made of $n$ amino acids as a sequence of $n-2$ pairs of dihedral angles $(\phi,\psi)$ , thus as a point in $\mathbb{T}^{2(n-2)}$ . The dihedral angles capture the rotations around the N–Cα and Cα–C bonds, which are the remaining degrees of freedom of the backbone (if the bond angles and bond lengths are assumed fixed to their ideal values). Molecular dynamics simulations are widely employed to study the folding and the dynamical properties of proteins, providing ultra high frequency trajectories of protein structures. The dihedral angles of the time-varying backbone result in a trajectory $\{(\phi_{1,i\Delta},\allowbreak\psi_{1,i\Delta},\allowbreak\ldots,\phi_{n-2,i\Delta},\psi_{n-2,i\Delta})\}_{i=0}^{N}$ . Diffusive models on the torus are appropriate tools to summarize these trajectories and, once fitted, can be used as computationally affordable emulators of the physical process.

We consider data from molecular dynamics simulations of the protein G (Protein Data Bank identifier 1GB1) around its native state. This protein contains $n=56$ amino acids and, due to its relatively small size and availability of extensive experimental data, is commonly considered in the molecular dynamics literature. The molecular dynamics simulations were done using the CHARMM36 force field with the EEF1-SB solvent model (Bottaro et al., 2013) during $T=100$ nanoseconds equally discretized in $10000$ time cuts, which afterwards were subsampled to $N=1000$ . For the sake of illustration, we study two specific trajectories: $\{\psi_{\Delta i}\}_{i=0}^{N}$ of the $9$ -th amino acid (Glycine, between Asparagine and Lysine), and $\{(\phi_{\Delta i},\psi_{\Delta i})\}_{i=0}^{N}$ of the $14$ -th amino acid (Glycine, between Lysine and Glutamate). These one- and two-dimensional trajectories exhibit multi- and unimodal patterns that are representative of the general case.

The one-dimensional multimodal trajectory was modelled with a diffusion driven by a mixture of two vM distributions, as given in (8). The fitting was done with the PDE method with $M_{x}=500$ , $M_{t}=20$ , and $\sigma_{0}=0.01$ . We used SMLE and (12) as starting values, and fixed the mixture proportions to the stationary estimates to avoid spurious minima. The optimization took $115$ seconds in a $1.7$ GHz core for $566$ likelihood evaluations and gave $\hat{\boldsymbol{\alpha}}=(9.06,5.00)$ , $\hat{\boldsymbol{\mu}}=(0.23,-2.91)$ , $\hat{\sigma}=1.08$ , and $\hat{p}=0.56$ . The first row of Figure 8 presents a graphical summary of the parametric fit. The first panel shows the observed data and a simulated trajectory from the fitted model, which captures the main patterns of the observed data, except for some outliers.

In order to evaluate the goodness-of-fit of the parametric model – and due to the absence of formal tests directly applicable in this setting, to the best of the authors’ knowledge – we compared graphically the parametric fits of the drift and diffusion coefficient with their nonparametric estimations. To that aim, we considered the following Nadaraya–Watson estimator for the drift

[TABLE]

with $Y_{i}:=\mathrm{cmod}\left(\Theta_{\Delta(i+1)}-\Theta_{\Delta i}\right)/\Delta$ and $h$ as the bandwidth parameter. For the diffusion coefficient, we set $Y_{i}:=\left(\mathrm{cmod}\left(\Theta_{\Delta(i+1)}-\Theta_{\Delta i}\right)\right)^{2}/\Delta$ and then took the square root in the estimate. To remove the smoothing bias of (25), we smoothed the parametric estimate by considering $Y_{i}=b(\Theta_{i\Delta};\hat{\boldsymbol{\lambda}})$ in (25), hence equating both biases under the correct specification of the model. The second panel in first row of Figure 8 compares the nonparametric and smoothed parametric estimates of the drift. Both drifts are shadowed according to a kernel density estimate that emphasizes the regions were the data is present. For those regions, there is a close match between both estimates. The third panel shows a similar analysis for the diffusion coefficient, whose nonparametric estimate exhibits mild departures from $\hat{\sigma}$ in the regions with high density.

For modelling the two-dimensional and unimodal trajectory we employed a bivariate WN diffusion with unconstrained $\boldsymbol{\Sigma}$ . The fitting was done with the WOU approximation using SMLE and (12) for starting values. The optimization took $14$ seconds for $2739$ approximate likelihood evaluations. The first panel in the second row of Figure 8 shows the correct match between the simulated and the observed trajectories, again except for some outliers from the latter. The next panel shows the comparison between the vector fields for the smoothed parametric and nonparametric drifts. They show a strong agreement on the drift structure at regions with presence of data, both in magnitude and direction. The parametric vector field $(\sigma_{1}(\phi,\psi;\hat{\boldsymbol{\lambda}}),\sigma_{2}(\phi,\psi;\hat{\boldsymbol{\lambda}}))$ and the nonparametric $(\hat{\sigma}_{1,h_{1}}(\phi,\psi),\hat{\sigma}_{2,h_{2}}(\phi,\psi))$ have a proper match for the regions with data, the latter being constant in most of $\mathbb{T}^{p}$ . The nonparametric estimates were constructed by considering product kernels on the covariates. All the bandwidths were automatically selected by cross-validation.

6 Conclusions

We introduced ergodic diffusions on the torus as the natural processes with stationary distributions equal to well-known toroidal distributions. The WN process, with an available analytical approximation to its tpd, is shown to be the most tractable OU-like toroidal process among the different proposals. This approximation outperforms the wrapped Euler and Shoji–Ozaki pseudo-likelihoods, and shows an affordable computational cost for one and two dimensions. In addition, we provide numerical solutions of the one- and two-dimensional Fokker–Planck PDEs for approximating the true tpd, which serve as benchmarks of the accuracy of the approximating tpds. A thorough simulation study explored the performance of the approximate likelihoods under different scenarios. Finally, a data application illustrated the usefulness of the new diffusive models for modelling molecular dynamics simulations.

We summarize some important practical conclusions. For estimating the WN process, we recommend to use WOU as a first option for a fast and accurate approximation in dimensions $p=1,2$ . For a general process, we advise to employ PDE with $p=1$ if accuracy is a priority, and SO in case speed is. For $p=2$ , SO is preferred to E, but both are prone to underperform severely for highly non-WN tpds, which can be visualized using the PDE solution.

The development of a general and computationally fast method for approximating an arbitrary tpd, that is able to cope with multimodality, remains an open challenge. A promising avenue is methods based on simulation, which have been successful for Euclidean diffusions; see e.g. Beskos et al. (2006), Papaspiliopoulos and Roberts (2012), Sermaidis et al. (2012), Bladt et al. (2006), and references in these papers. The simplest algorithm by Beskos et al. (2006a) is well suited for exact simulation of the transient diffusion (i.e., before wrapping) because of the periodicity of the coefficients, and the method in Sermaidis et al. (2012) is applicable to Langevin diffusions. It is therefore likely that the exact simulation methods can be adapted to toroidal Langevin diffusions by finding ways to deal with the wrapping when simulating diffusion bridges. It is also of interest to study whether the coupling methods underlying the diffusions bridge simulation technique in Bladt et al. (2006) can be adapted to the torus setting. Another interesting approach would be to include the winding number for each observation as a latent variable and apply methods like the EM algorithm or the Gibbs sampler for likelihood inference.

Software

The software sdetorus, available at https://github.com/egarpor/sdetorus, contains the implementations of the methods described in the paper and the files required for reproducing all the empirical analyses.

Acknowledgements

This work is part of the Dynamical Systems Interdisciplinary Network, University of Copenhagen. It was funded by the University of Copenhagen 2016 Excellence Programme for Interdisciplinary Research (UCPH2016-DSIN) and by project MTM2016-76969-P from the Spanish Ministry of Economy, Industry and Competitiveness, and European Regional Development Fund (ERDF). We acknowledge the insightful discussions with John Kent, Jotun Hein, and Michael Golden that led to the key motivation for the manuscript. We are grateful to Sandro Bottaro for the providing the molecular dynamics data used in the illustration. We acknowledge the valuable comments and remarks provided by two anonymous referees and an Associate Editor, which significantly improved the manuscript.

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Banerjee et al. (2005) Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. (2005). Clustering on the unit hypersphere using von Mises-Fisher distributions. J. Mach. Learn. Res. , 6:1345–1382.
2Bernstein and So (1993) Bernstein, D. S. and So, W. (1993). Some explicit formulas for the matrix exponential. IEEE Trans. Automat. Control , 38(8):1228–1232.
3Beskos et al. (2006) Beskos, A., Papaspiliopoulos, O., Roberts, G. O., and Fearnhead, P. (2006). Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes. J. R. Stat. Soc. Ser. B Stat. Methodol. , 68(3):333–382.
4Beskos et al. (2006 a) Beskos, A., Papaspiliopoulos, O., Roberts G. O. (2006 a). Retrospective exact simulation of diffusion sample paths with applications. Bernoulli , 12(6):1077–1098.
5Bibby and Sørensen (2001) Bibby, B. M. and Sørensen, M. (2001). Simplified estimating functions for diffusion models with a high-dimensional parameter. Scand. J. Statist. , 28(1):99–112.
6Bladt et al. (2006) Bladt, M., Finch, S., and Sørensen, M. (2016). Simulation of multivariate diffusion bridges. J. R. Stat. Soc. Ser. B Stat. Methodol. , 78(2):343–369.
7Bottaro et al. (2013) Bottaro, S., Lindorff-Larsen, K., and Best, R. B. (2013). Variational optimization of an all-atom implicit solvent force field to match explicit solvent simulation data. J. Chem. Theory Comput. , 9(12):5641–5652.
8Breckling (1989) Breckling, J. (1989). The analysis of directional time series: applications to wind speed and direction , volume 61 of Lecture Notes in Statistics . Springer-Verlag, Berlin.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Langevin diffusions on the torus: estimation and applications

Abstract

1 Introduction

2 Toroidal diffusions

Definition 1** (Toroidal diffusion).**

Proposition 1** (Wrapped ergodic diffusion).**

Proof.

Remark 1**.**

Remark 2**.**

2.1 Langevin toroidal diffusions

Proposition 2**.**

Proof.

2.2 Analogues of the Ornstein–Uhlenbeck process

2.2.1 Multivariate von Mises

2.2.2 Wrapped normal

2.2.3 Jones and Pewsey (2005)’s circular distribution

2.2.4 Mixtures of independent von Mises

3 Estimation for toroidal diffusions

3.1 Estimation based on the stationary distribution

3.2 Adapted Euler and Shoji–Ozaki pseudo-likelihoods

Remark 3**.**

3.3 Wrapped Ornstein–Uhlenbeck approximation of the WN process

Corollary 1**.**

Proof.

Lemma 1**.**

3.4 Likelihood by numerical PDE solution

3.4.1 One-dimensional case

3.4.2 Two-dimensional case

3.4.3 Remarks on the discretization schemes

3.4.4 Likelihood evaluation

4 Simulation study

4.1 Kullback–Leibler divergences for WN and vM processes

4.2 Empirical performance of the approximate likelihoods

4.2.1 WN process

4.2.2 Other processes

5 Application to molecular dynamics

6 Conclusions

Software

Acknowledgements

Definition 1 (Toroidal diffusion).

Proposition 1 (Wrapped ergodic diffusion).

Remark 1.

Remark 2.

Proposition 2.

Remark 3.

Corollary 1.

Lemma 1.