Detecting British Columbia Coastal Rainfall Patterns by Clustering   Gaussian Processes

Forrest Paton; Paul D. McNicholas

arXiv:1812.09758·stat.AP·April 7, 2020

Detecting British Columbia Coastal Rainfall Patterns by Clustering Gaussian Processes

Forrest Paton, Paul D. McNicholas

PDF

TL;DR

This paper develops a clustering method based on Gaussian process covariance kernels to analyze and identify patterns in British Columbia's coastal rainfall data, revealing insights related to El Niño and La Niña phenomena.

Contribution

It introduces a novel clustering approach for functional data using Gaussian process covariance kernels, applied to coastal rainfall analysis.

Findings

01

Clustering reveals distinct rainfall patterns linked to climate phenomena.

02

The method can identify extreme weather events from rainfall data.

03

Insights are consistent with known climate patterns like El Niño and La Niña.

Abstract

Functional data analysis is a statistical framework where data are assumed to follow some functional form. This method of analysis is commonly applied to time series data, where time, measured continuously or in discrete intervals, serves as the location for a function's value. Gaussian processes are a generalization of the multivariate normal distribution to function space and, in this paper, they are used to shed light on coastal rainfall patterns in British Columbia (BC). Specifically, this work addressed the question over how one should carry out an exploratory cluster analysis for the BC, or any similar, coastal rainfall data. An approach is developed for clustering multiple processes observed on a comparable interval, based on how similar their underlying covariance kernel is. This approach provides interesting insights into the BC data, and these insights can be framed in terms…

Tables4

Table 1. Table 1: Mean values for recovered hyper-parameters, with standard errors, for Simulation I.

Parameter	Truth	Mean Estimate	Standard Error
$π_{1}$	0.33	0.33	0
$π_{2}$	0.67	0.67	0
$l_{1}$	1	1.23	0.02
$l_{2}$	3	3.08	0.03
$σ_{1}$	3	2.18	0.07
$σ_{2}$	1	1.43	0.09

Table 2. Table 2: Mean values for recovered hyper-parameters, with standard errors, for Simulation II.

Parameter	Truth	Mean Estimate	Standard Error
$π_{1}$	0.5	0.52	0.004
$π_{2}$	0.5	0.48	0.004
$l_{1}$	1	1.30	0.032
$l_{2}$	2	2.10	0.018
$σ_{1}$	1	2.01	0.081
$σ_{2}$	2	2.05	0.082

Table 3. Table 3: Years in group two, for each weather station, based on maximizing the complete-data log-likelihood, where remaining years belong to group one.

Weather Station	Years in Group Two
Vancouver	1995
Tofino	1995
Port Hardy	1992, 1994, 1995, 1998, 2000
Victoria	1999

Table 4. Table 4: Clustering results, parameters recovered. These estimates are the result of considering each possible two-group combination and selecting the one with the greatest (complete-data) likelihood. The clusters differed mainly with respect to their length-scale parameter l 1 subscript 𝑙 1 l_{1} and l 2 subscript 𝑙 2 l_{2} .

Parameter	Mean Estimate
Tofino
$π_{1}$	0.90
$π_{2}$	0.10
$l_{1}$	0.365
$l_{2}$	0.923

Equations64

\mathbf{f}(\mathbf{x})\sim\text{GP}\big{(}m(\mathbf{x}),\ k(\mathbf{x},\mathbf{x}^{\top})\big{)},

\mathbf{f}(\mathbf{x})\sim\text{GP}\big{(}m(\mathbf{x}),\ k(\mathbf{x},\mathbf{x}^{\top})\big{)},

k (x_{i}, x_{j}) = σ^{2} exp {- \frac{1}{2 l ^{2}} (x_{i} - x_{j})^{2}},

k (x_{i}, x_{j}) = σ^{2} exp {- \frac{1}{2 l ^{2}} (x_{i} - x_{j})^{2}},

K = k (x_{1}, x_{1}) ⋮ k (x_{n}, x_{1}) \dots ⋱ \dots k (x_{1}, x_{n}) ⋮ k (x_{n}, x_{n}) .

K = k (x_{1}, x_{1}) ⋮ k (x_{n}, x_{1}) \dots ⋱ \dots k (x_{1}, x_{n}) ⋮ k (x_{n}, x_{n}) .

p (f ∣ x) = function prior ϕ (f ∣ 0, K), p (y ∣ f) = likelihood i = 1 \prod n ϕ (y_{i} ∣ f_{i}) .

p (f ∣ x) = function prior ϕ (f ∣ 0, K), p (y ∣ f) = likelihood i = 1 \prod n ϕ (y_{i} ∣ f_{i}) .

p (y ∣ x)

p (y ∣ x)

lo g p (y ∣ x)

lo g p (y ∣ x)

lo g p (y ∣ x)

lo g p (y ∣ x)

f (x) \sim N (0, K (x, x)),

f (x) \sim N (0, K (x, x)),

f (x_{*}) \sim N (0, K (x_{*}, x_{*})),

f (x_{*}) \sim N (0, K (x_{*}, x_{*})),

[f (x) f (x_{*})] \sim N (0, [K (x, x) K (x_{*}, x) K (x, x_{*}) K (x_{*}, x_{*})]) .

[f (x) f (x_{*})] \sim N (0, [K (x, x) K (x_{*}, x) K (x, x_{*}) K (x_{*}, x_{*})]) .

\hat{f} (x_{*}) ≜ E [f (x_{*}) ∣ x, y, x_{*}] = K (x_{*}, x) [K (x, x)]^{- 1} y;

\hat{f} (x_{*}) ≜ E [f (x_{*}) ∣ x, y, x_{*}] = K (x_{*}, x) [K (x, x)]^{- 1} y;

f (x ∣ φ)

f (x ∣ φ)

L (φ) = i = 1 \prod n g = 1 \sum G π_{g} ϕ (x_{i} ∣ μ_{g}, Σ_{g}),

L (φ) = i = 1 \prod n g = 1 \sum G π_{g} ϕ (x_{i} ∣ μ_{g}, Σ_{g}),

L_{c} (φ) = i = 1 \prod n g = 1 \sum G [π_{g} ϕ (x_{i} ∣ μ_{g}, Σ_{g})]^{z_{i g}} .

L_{c} (φ) = i = 1 \prod n g = 1 \sum G [π_{g} ϕ (x_{i} ∣ μ_{g}, Σ_{g})]^{z_{i g}} .

\overset{z}{^}_{i g} : = E [Z_{i g} ∣ x_{i}] = \frac{π ^ _{g} ϕ ( x _{i} ∣ μ ^ _{g} , Σ ^ _{g} )}{\sum _{h = 1}^{G} π ^ _{h} ϕ ( x _{i} ∣ μ ^ _{h} , Σ ^ _{h} )}

\overset{z}{^}_{i g} : = E [Z_{i g} ∣ x_{i}] = \frac{π ^ _{g} ϕ ( x _{i} ∣ μ ^ _{g} , Σ ^ _{g} )}{\sum _{h = 1}^{G} π ^ _{h} ϕ ( x _{i} ∣ μ ^ _{h} , Σ ^ _{h} )}

\overset{π}{^}_{g}

\overset{π}{^}_{g}

\hat{μ}_{g}

\hat{Σ}_{g}

MAP (\overset{z}{^}_{i g}) = {10 if g = argmax_{h} (\overset{z}{^}_{ih}), otherwise.

MAP (\overset{z}{^}_{i g}) = {10 if g = argmax_{h} (\overset{z}{^}_{ih}), otherwise.

p (y_{i} ∣ θ_{i}, x_{i})

p (y_{i} ∣ θ_{i}, x_{i})

p (y_{i} ∣ θ, x_{i})

p (y_{i} ∣ θ, x_{i})

L_{c} (φ) = i = 1 \prod n g = 1 \sum G [π_{g} ϕ (y_{i} ∣ 0, K_{g})]^{z_{i g}},

L_{c} (φ) = i = 1 \prod n g = 1 \sum G [π_{g} ϕ (y_{i} ∣ 0, K_{g})]^{z_{i g}},

k (x_{i}, x_{j})

k (x_{i}, x_{j})

lo g p (y_{i} ∣ x_{i}, Θ_{i})

lo g p (y_{i} ∣ x_{i}, Θ_{i})

\frac{\partial}{\partial Θ _{i}} lo g p (y ∣ x, Θ_{i}) = \frac{1}{2} y^{⊤} K^{- 1} \frac{\partial K}{\partial Θ _{i}} K^{- 1} y - \frac{1}{2} tr (K^{- 1} \frac{\partial K}{\partial Θ _{i}}) = \frac{1}{2} tr {(α α^{⊤} - K^{- 1}) \frac{\partial K}{\partial Θ _{i}}},

\frac{\partial}{\partial Θ _{i}} lo g p (y ∣ x, Θ_{i}) = \frac{1}{2} y^{⊤} K^{- 1} \frac{\partial K}{\partial Θ _{i}} K^{- 1} y - \frac{1}{2} tr (K^{- 1} \frac{\partial K}{\partial Θ _{i}}) = \frac{1}{2} tr {(α α^{⊤} - K^{- 1}) \frac{\partial K}{\partial Θ _{i}}},

\frac{\partial K}{\partial l _{i}}

\frac{\partial K}{\partial l _{i}}

\frac{\partial K}{\partial σ _{i}}

l_{i}^{(t + 1)}

l_{i}^{(t + 1)}

σ_{i}^{(t + 1)}

\overset{r}{^}_{i g} = \frac{π _{g} ϕ ( y _{i} ∣ 0 , K _{g} )}{\sum _{h = 1}^{G} π _{h} ϕ ( y _{i} ∣ 0 , K _{h} )} .

\overset{r}{^}_{i g} = \frac{π _{g} ϕ ( y _{i} ∣ 0 , K _{g} )}{\sum _{h = 1}^{G} π _{h} ϕ ( y _{i} ∣ 0 , K _{h} )} .

π_{g} = \frac{m _{g}}{m},

π_{g} = \frac{m _{g}}{m},

\hat{l}_{g} = \frac{1}{m _{g}} i = 1 \sum n \overset{r}{^}_{i g} l_{i}^{max}, \overset{σ}{^}_{g} = \frac{1}{m _{g}} i = 1 \sum n \overset{r}{^}_{i g} σ_{i}^{max},

\hat{l}_{g} = \frac{1}{m _{g}} i = 1 \sum n \overset{r}{^}_{i g} l_{i}^{max}, \overset{σ}{^}_{g} = \frac{1}{m _{g}} i = 1 \sum n \overset{r}{^}_{i g} σ_{i}^{max},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Detecting British Columbia Coastal Rainfall Patterns by Clustering Gaussian Processes

Forrest Paton and Paul D. McNicholas

(Department of Mathematics and Statistics, McMaster University, Hamilton, Ontario, Canada.)

Abstract

Functional data analysis is a statistical framework where data are assumed to follow some functional form. This method of analysis is commonly applied to time series data, where time, measured continuously or in discrete intervals, serves as the location for a function’s value. Gaussian processes are a generalization of the multivariate normal distribution to function space and, in this paper, they are used to shed light on coastal rainfall patterns in British Columbia (BC). Specifically, this work addressed the question over how one should carry out an exploratory cluster analysis for the BC, or any similar, coastal rainfall data. An approach is developed for clustering multiple processes observed on a comparable interval, based on how similar their underlying covariance kernel is. This approach provides interesting insights into the BC data, and these insights can be framed in terms of El Niño and La Niña; however, the result is not simply one cluster representing El Niño years and another for La Niña years. From one perspective, the results show that clustering annual rainfall can potentially be used to identify extreme weather patterns.

Keywords: British Columbia; Clustering; Coastal Rainfall; El Niño; Extreme weather; Gaussian processes; La Niña; Mixture model.

1 Introduction

In contrast to predictable yearly seasonal changes, El Niño, a well studied teleconnection, does not occur at regular intervals. This is of particular interest for policy makers, businesses, and people who rely on calculable, foreseeable weather patterns. For example, seasonal changes can specifically alter food production plans, such as when to deploy fishing vessels or harvest crops. While El Niño is primarily categorized through warming temperatures in the eastern and equatorial Pacific Ocean, its effects can be seen around the globe through teleconnections. Teleconnections are an environmental phenomena that describe correlated large-scale atmospheric changes over non-contiguous geographic regions. While some teleconnections are well established, others rely on observing statistical irregularities (Gudmundson and Babkina, 2003; Ward et al., 2014); namely, El Niño’s effect on precipitation patterns. Sir Gilbert Walker, a 20th century English scientist, for example, first identified the link between Asian monsoons and Pacific coastal barometer readings (Gudmundson and Babkina, 2003). However, the study of teleconnections, such as precipitation patterns, is notoriously complex because of intricate spatial and temporal correlations. While spatial data is often modelled under the assumption that geographical points near one another share more information than those far apart, work has been done to model both nearby and remote geographical correlations. Hewitt et al. (2018) predict precipitation in Colorado by modelling both locally available data and remote Pacific Ocean sea surface temperatures. Understanding El Niño’s effect on distant precipitation can shed light on these patterns. Specifically, classifying irregular precipitation patterns in the Americas can help understand El Niño’s impact on local weather systems.

A question of particular interest is: which years exhibit distinct rainfall patterns? For prediction, knowing how Pacific Ocean changes affect rainfall can have significant implications (O’Gorman, 2015). A model for clustering yearly rainfall data will also give insight to research on the mechanistic properties of teleconnections. This has overlap in the statistical field of functional data clustering, where data are assumed to follow some functional form. Herein, mixture model-based clustering provides an effective approach to cluster data, and Gaussian Processes provide a model for rainfall. Much recent work has been done modelling climatological data with spatial and time dependence (Armal et al., 2018; Cabral et al., 2019; Gupta et al., 2016). Specifically, the use of Gaussian processes (GPs) gives a probabilistic starting point. A GP is a stochastic process that generalizes a finite-dimensional normal distribution to function space. GPs have been used to successfully solve complex non-linear regression and classification problems (Rasmussen, 2005; Roberts et al., 2013). When multiple functions exist on the same interval, usually compact $[0,T]$ and finite, it can be useful to classify them into a finite number of mutually exclusive groups. Here the process would be defined on the index set time, specifically the months of the year.

2 Background

2.1 Gaussian Process

Rasmussen (2005) defines a GP as: “…a collection of random variables, any finite number of which have a joint Gaussian distribution”. A GP $\mathbf{f}(\mathbf{x})$ creates a set of random variables evaluated at $\mathbf{x}$ . In essence, a GP is a distribution over functions, i.e.,

[TABLE]

where $m(\mathbf{x})=\mathbb{E}[\mathbf{f}(\mathbf{x})]$ and $k(\mathbf{x},\mathbf{x}^{\top})=\mathbb{E}[(\mathbf{f}(\mathbf{x})-m(\mathbf{x}))(\mathbf{f}(\mathbf{x}^{\top})-m(\mathbf{x}^{\top}))]$ are the mean function and covariance kernel, respectively, $\mathbf{x}=({x}_{1},\ldots,{x}_{n})^{\top}$ is the function’s index, and $\mathbf{f}(\mathbf{x})=(f_{1},\ldots,f_{n})^{\top}$ is the function’s output, i.e., $(x_{i},f_{i})$ is a point in $\mathbb{R}^{2}$ . A common GP, and the kernel considered in this paper, is defined with mean function 0 and squared exponential (SE) covariance function

[TABLE]

where $\sigma^{2}$ and $l$ are hyper-parameters that control the shape of the process; specifically, $\sigma^{2}$ controls the amount of variation in $\mathbf{f}(\mathbf{x})$ and $l$ , the length-scale parameter, controls the correlation. The SE kernel is a widely used (Rasmussen, 2005). One convenient property of the SE kernel is infinite differentiability, which is useful because the first derivative is needed for hyper-parameter estimation. Here, the use of the term hyper-parameter refers to a set of parameters that make up the “non-parametric model”. That is, the hyper-parameters only appear in the model’s prior and, as shown later in (3), are integrated out of the final model (posterior).

Although the SE is the most popular, different kernel functions can be used. Because the kernel is needed to populate a multivariate normal distribution covariance matrix, the kernel is restricted to produce positive semi-definite matrices. While the SE covariance kernel is a popular choice, modelling the shape of the covariance kernel is an open ended problem. One systematic solution can be formed by considering a Bayesian model selection framework as discussed in Rasmussen (2005). Duvenaud et al. (2013) also provide a solution by considering automatic kernel selection through searching over appropriate kernel structures. For the SE covariance kernel, $\sigma^{2}$ controls the height or the amplitude of the GP and $l$ controls the correlation between observations. This kernel is used to construct a matrix, K, which will serve as the covariance matrix in a multivariate normal distribution introduced in the next section:

[TABLE]

While a GP is defined on the entire real line, we only observe a finite number $n$ of realizations $\mathbf{x}=({x}_{1},\ldots,{x}_{n})^{\top}$ , and corresponding $\mathbf{y}=({y}_{1},\ldots,{y}_{n})^{\top}$ , where $\mathbf{y}\triangleq\mathbf{f}(\mathbf{x})$ . The vector $\mathbf{x}$ is commonly called the input and represents the location of the the process, i.e., observation $y_{i}=f(x_{i})$ . The vector $\mathbf{y}$ is referred to as the output, and is the function evaluated at location $\mathbf{x}$ . This allows for a generalization to a multivariate normal distribution via $\mathbf{f}(\mathbf{x})\sim\mathcal{N}\big{(}\textbf{0},\ \textbf{K}\big{)}$ . This is possible because marginalizing a Gaussian distribution is trivial: the resulting distribution is Gaussian and we can ignore the ( $x,y$ ) pairs that are unobserved or missing. Here changing the kernel affects the shape of the function, effectively controlling the magnitude of which observations $x_{i}$ and $x_{j}$ are correlated. Formulating the problem in this way, we can see the kernel is our prior on the function space, and the (marginal) likelihood for the GP comes after conditioning on the realized points. As shown in the next section, the hyper-parameters are often estimated to maximize the likelihood of the GP.

2.2 Likelihood

The previous section introduced the kernel function and how it relates to a prior on function space and how the hyper-parameters affect the correlation between the input $\mathbf{x}$ and outcome $\mathbf{y}$ . Now, the likelihood for a GP will be introduced and strategies for choosing the hyper-parameters will be illustrated. From the definition of a GP, $\mathbf{y}\sim\mathcal{N}(\mathbf{0},\textbf{K})$ , which will be shown formally below. First, let $\mathbf{f}\sim\mathcal{N}(\mathbf{0},\textbf{K})$ , where K is the covariance matrix constructed from the SE kernel shown in (2).

The likelihood for a GP is conditioned on the observed values to obtain a marginal likelihood, using $\phi$ to denote the normal density function:

[TABLE]

We get the marginal for $\mathbf{y}$ by using Bayes’ rule and integrating over $\mathbf{f}$ :

[TABLE]

Taking the natural logarithm of (3) gives

[TABLE]

This likelihood can be broken down into three main components, the data fit term, model complexity term, and a constant term:

[TABLE]

The data fit and complexity component share an interesting tradeoff. For small length-scale values $l$ , the model will fit the data well and the data fit component will be small. However, points will not be considered “near” each other, resulting in a high model complexity. Conversely, if $l$ is large (suggesting no correlation between points), then the the complexity will be small but the data fit term will be large (Murphy, 2012). This is because the SE kernel will converge to $\sigma^{2}$ , turning K into a diagonal matrix. Because GPs have these inherent penalty terms for over- and under-fitting, cross validation is generally not used to estimate kernel hyper-parameters.

2.3 Predictive Distribution

GPs are commonly used in supervised regression tasks for their ability to non-parametrically approximate complex functions and solve functional engineering problems (Bin and Wenlai, 2013). It is often of interest to infer the function’s value outside of the paired training data ( $\mathbf{x},\mathbf{y}$ ). To do this, a predictive distribution can be constructed. Let $\mathbf{y}_{*}=\mathbf{f}(\mathbf{x}_{*})$ be the unobserved outputs to be inferred at locations $\mathbf{x}_{*}$ . The joint distribution can be derived through probabilistic terms,

[TABLE]

Note that (8) is the joint distribution of the observed pairs ( $\mathbf{x},\mathbf{y}$ ) and unobserved pairs ( $\mathbf{x}_{*}$ , $\mathbf{f}(\mathbf{x}_{*})$ ). The expected value for $\mathbf{f}(\mathbf{x}_{*})$ can be derived using conditional properties which leads to

[TABLE]

a complete derivation is given by Rasmussen (2005). Then, (9) can then be used to compute estimates for the value of the function $\mathbf{f}(\mathbf{x}_{*})$ at location $\mathbf{x}_{*}$ .

2.4 Model-Based Clustering

Clustering, a.k.a. unsupervised classification, is an unsupervised machine learning task which attempts to classify (unlabelled) data points into distinct groups. Commonly, clustering is defined as assigning data into groups such that data in the same cluster are more similar to each other than to data in different clusters. Initially this definition seems intuitive; however, practically there are some problems. Namely, grouping each data point into its own cluster would satisfy this definition. Following several others (e.g., Tiedeman, 1955; Wolfe, 1963), McNicholas (2016a) provides a definition not based on similarity: “a cluster is a unimodal component within an appropriate finite mixture model”, where the word ‘appropriate’ requires consideration and, specifically, that the component densities have the necessary flexibility to fit the data (see McNicholas, 2016a, for further discussion). Whatever definition of a cluster one may prefer, many methods have been developed to tackle this problem of unsupervised learning. Model-based refers to using probability distributions to model the clusters (as opposed to hierarchical clustering, $k$ -means clustering, etc.).

2.5 Finite Mixture Model

The finite mixture model is a popular tool for (model-based) clustering — recent reviews are provided by Bouveyron and Brunet-Saumard (2014) and McNicholas (2016b). The density of a $G$ -component finite mixture model is

[TABLE]

where $\pi_{g}>0$ , with $\sum_{g=1}^{G}\pi_{g}=1$ , is the $g$ th mixing proportion $f_{g}(\mathbf{x}|\bm{\theta}_{g})$ is the $g$ th component density, and $\bm{\theta}=(\bm{\theta}_{1},\ldots,\bm{\theta}_{G})$ are the component density-specific parameters, with $\bm{\varphi}=(\bm{\pi},\bm{\theta})$ and $\bm{\pi}=(\pi_{1},...,\pi_{G})$ . The likelihood for $\mathbf{x}_{1},\ldots,\mathbf{x}_{n}$ in a model-based clustering paradigm, using a Gaussian mixture model, is given by

[TABLE]

where $\phi(\mathbf{x}_{i}\ |\ \bm{\mu}_{g},\bm{\Sigma}_{g})$ is the density of a multivariate Gaussian distribution with mean $\bm{\mu}_{g}$ and covariance matrix $\bm{\Sigma}_{g}$ .

2.6 Expectation-Maximization Algorithm

Model-based clustering requires estimating the unknown model parameters from the likelihood in (11). The expectation-maximization (EM) algorithm (Dempster et al., 1977) provides a good starting point for this problem. Each iteration of the EM algorithm starts by computing the expectation of the complete-data log-likelihood (E-step), then maximizes the conditional expectation of the complete-data log-likelihood (M-step). The E- and M-steps are iterated until some stopping rule is met. Consider a Gaussian model-based clustering complete-data likelihood, denoted by $\mathcal{L}_{c}(\bm{\varphi})$ , where $\bm{\varphi}$ denotes all the parameters and the complete-data comprise the observed $\mathbf{x}_{1},\ldots,\mathbf{x}_{n}$ together with the missing labels $\mathbf{z}_{1},\ldots,\mathbf{z}_{n}$ defined so that $\mathbf{z}_{i}=(z_{i1},\ldots,z_{iG})$ and $z_{ig}=1$ if observation $i$ belongs to component $g$ and $z_{ig}=0$ otherwise. Now, for the Gaussian mixture model-based clustering paradigm — corresponding to the likelihood in (11) — we have complete-data likelihood

[TABLE]

In the E-step, we compute

[TABLE]

conditional on the current parameter updates (estimates). Next, in the M-step, the parameters are updated. This amounts to estimating the covariance matrix and mean vector for a Gaussian mixture model. In the M-step, the updates are:

[TABLE]

where $n_{g}=\sum^{n}_{i=1}\hat{z}_{ig}$ . After parameter estimation is completed, the clustering results are expressed through the probabilities $\hat{z}_{ig}$ , i.e., $\hat{z}_{ig}$ is the probability that $\mathbf{x}_{i}$ belongs to component $g$ . These soft probabilities $\hat{z}_{ig}\in[0,1]$ are often converted into hard classifications via maximum a posteriori probabilities:

[TABLE]

Extensive details on model-based clustering and parameter estimation are given by McNicholas (2016a).

3 Model to Cluster Functional Data

3.1 Model Formulation

We have seen that the log-likelihood for a GP with the observed output vector $\mathbf{y}$ and corresponding input vector $\mathbf{x}$ was distributed according to a multivariate Gaussian distribution. When clustering GPs, the goal will be to find clusters that contain processes which have similar paths. The meaning of similar path refers not only to how close two processes’ values are but also to how similar their shapes are (smooth, wiggly, etc.). Now let us define the notation used for the model: the $i$ th GP will have output vector $\mathbf{y}_{i}$ , input vector $\mathbf{x}_{i}$ , and

[TABLE]

The density in (18) is the probability density function, i.e., $p_{g}(\mathbf{y}_{i}|\bm{\theta}_{g},\mathbf{x}_{i})$ , used as the component density for the finite mixture model

[TABLE]

where $\bm{\theta}_{g}=\{l_{g},\sigma_{g}\}$ denotes the hyper-parameters for the $g$ th cluster. Because the likelihood of a GP is a Gaussian distribution with covariance matrix $\mathbf{K}$ , the complete-data likelihood is given by

[TABLE]

where $\mathbf{K}_{g}$ is the covariance matrix corresponding to cluster $g$ and $\phi(\mathbf{y}_{i}\ |\ \bm{0},\mathbf{K}_{g})$ is the Gaussian density with mean $\bm{0}$ and covariance $\mathbf{K}_{g}$ . An SE covariance kernel is used as the prior on the function space, i.e.,

[TABLE]

The goal is to recover the $G$ pairs of kernel hyper-parameters $\bm{\theta_{g}}=\{l_{g},\sigma_{g}^{2}\}$ and the mixing parameters $\bm{\pi}=(\pi_{1},...,\pi_{G})$ , and thence to estimate the latent variables $\bm{z}_{1},\ldots,\bm{z}_{n}$ .

3.2 GP Parameter Estimation

The first step is to estimate each GP’s kernel hyperparameters. Herein, the kernel hyper-parameters for the $i$ th GP is denoted by $\bm{\Theta}_{i}=\{l_{i}$ , $\sigma_{i}\}$ . In this step, the maximized kernel hyper-parameters for each GP, $l_{i}^{\text{max}}$ and $\sigma_{i}^{\ \text{max}}$ , are estimated. To find these maximized hyper-parameters, an MLE solution is found using gradient ascent, starting with the log-likelihood i.e.,

[TABLE]

The derivative is then taken w.r.t. to the kernel hyper-parameters

[TABLE]

where $\bm{\alpha}=\textbf{K}^{-1}\mathbf{y}$ . The partial derivatives for $l_{i}$ and $\sigma_{i}$ are calculated from the first derivatives of the kernel function:

[TABLE]

After finding the gradient for the likelihood, a gradient ascent algorithm is used to find a sufficiently close solution. This algorithm is given by repeating the following updates until a stopping rule is statisfied:

[TABLE]

where superscript $(t)$ denotes iteration $t$ . After maximizing the kernel hyper-parameters, we have $\bm{\hat{}}{\boldsymbol{\Theta}}=\{\bm{\hat{}}{\boldsymbol{\Theta}}_{1},\bm{\hat{}}{\boldsymbol{\Theta}}_{2},\ldots,\bm{\hat{}}{\boldsymbol{\Theta}}_{n}\},$ where $\bm{\hat{}}{\boldsymbol{\Theta}}_{1}=\{l_{1}^{\ \text{max}},\sigma_{1}^{\ \text{max}}\}$ denotes the maximized kernel hyper-parameters for the first GP, $\bm{\hat{\Theta}}_{2}$ denotes the hyper-parameters for the second GP, and so on.

3.3 Cluster Parameter Estimation

The model seeks to cluster the processes and make inferences on the latent variables. A modified EM approach is used. First, the mixing proportions and cluster hyper-parameters are initialized randomly, i.e., we initialize $\boldsymbol{\pi}$ , $\boldsymbol{l}$ , and $\boldsymbol{\sigma}$ , where $\boldsymbol{l}=\{l_{1},l_{2},\ldots,l_{G}\}$ , $\boldsymbol{\sigma}=\{\sigma_{1},\sigma_{2},\ldots,\sigma_{G}\}$ , and $\boldsymbol{\pi}=\{\pi_{1},\pi_{2},\ldots,\pi_{G}\}$ randomly. Next, each GP’s ( $\text{GP}_{1},\dots,\text{GP}_{n}$ ) responsibilities are calculated for each of the $G$ clusters to get $n\times G$ responsibilities

[TABLE]

Note that $\hat{r}_{ig}$ represents the responsibility, or conditional expected value, of the $i$ th process belonging to the $g$ th cluster, and $\hat{r}_{ig}\triangleq\hat{z}_{ig}$ . After the responsibilities are calculated, the mixing proportions $\pi_{1},\ldots,\pi_{G}$ are conditionally maximized on these responsibilities. The update for the $g$ th mixing proportion is

[TABLE]

where $m_{g}=\sum^{n}_{i=1}r_{ig}$ is the responsibility for cluster $g$ and $m=\sum_{g=1}^{G}m_{g}$ . The cluster-specific kernel hyper-parameters, i.e., $l_{g}$ and $\sigma_{g}$ , are then updated, where $l_{g}$ is the length-scale parameter for cluster $g$ and $\sigma_{g}$ is the height parameter for cluster $g$ :

[TABLE]

i.e., we are weighting the maximized hyper-parameters, $\sigma_{i}^{\ \text{max}}$ and $l_{i}^{\text{max}}$ , by their respective cluster responsibility $\hat{r}_{ig}$ .

This scheme, for calculating the responsibilities then updating the cluster parameters, is repeated until some stopping rule is met. In this case, when the change in expected complete-data log likelihood at iteration $t$ becomes small, i.e., until

[TABLE]

where $\mathcal{Q}(\bm{\varphi}^{(t)},\bm{\varphi}^{(t-1)})=\mathbb{E}[\log\left(\mathcal{L}_{c}(\bm{\varphi}^{(t)})\right)|\mathbf{y},\bm{\varphi}^{(t-1)}]$ is the expectation of the complete-data log likelihood.

3.4 Numerical Issues

At each iteration of gradient ascent, the GP’s likelihood gradient needs to be computed:

[TABLE]

This operation requires inverting a $t\times t$ matrix $\mathbf{K}^{-1}$ . Inverting large matrices is notoriously computationally unstable, especially when the matrices are not full rank (or sufficiently close) and eigenvalues become very large or very small. One solution is to first decompose the matrix into lower-triangular form, i.e., $\mathbf{K}=\mathbf{L}\mathbf{L}^{\top}$ and then invert $\mathbf{K}$ via $\mathbf{K}^{-1}=(\mathbf{L}^{-1})^{\top}\mathbf{L}^{-1}$ . We use the R (R Core Team, 2018) package FastGP (Gopalan and Bornn, 2016), which implements the package RcppEigen (Bates and Eddelbuettel, 2013) to invert the lower-triangular matrix $\mathbf{L}$ .

4 Simulation Studies

This section will first look at two cases of simulated data. The hyper-parameters $\bm{\theta}=\{\bm{l},\bm{\sigma}\}$ and the mixing proportions $\bm{\pi}$ will vary based on the simulated sets. The method developed in the previous section will then be applied to recover the hyper-parameters and classify each GP into their respective groups. For the two simulation studies, noiseless squared exponential covariance functions will be used, which in effect means that a perfectly interpolated, noiseless process is observed for the simulation.

4.1 Simulation I

The first simulation starts with generating 30 GPs. The processes are generated on the interval $[0,10]$ with $T=7$ evenly spaced realizations, i.e., each process has seven values spread evenly on the interval. In all, 10 of the 30 GPs are generated from a multivariate normal distribution using the R package mvtnorm (Genz et al., 2009). Where the covariance matrix was constructed using an SE covariance kernel with hyper-parameters $l=1$ and $\sigma=3$ . The remaining 20 were generated similarly but with a covariance matrix constructed with hyper-parameters $l=3$ and $\sigma=3$ .

After running the algorithm described in Section 3, estimates for the set of hyper-parameters and mixing proportion were recorded (Table 1). The mixing proportion is easily identified and accurately estimated. Using the MAP classification, the algorithm was able to correctly classify each process. Table 1 gives the mean parameter estimates and standard errors. This was done by randomly starting the algorithm 10 times — i.e., initializing the parameters from a random uniform draw — and calculating the mean and standard error from these 10 starts.

Once the processes are coloured by their MAP classification (Figure 1), one can visually see the difference between the two process clusters. The processes ( $g=2$ , blue) with the larger length-scale $l=3$ are smoother compared to those generated from the process with length-scale $l=1$ .

The length-scale parameter $l$ was also readily recovered in this scenario, producing similar estimates to the true hyper-parameter. The hyper-parameter $\sigma_{1}$ , which, recall, controls the function’s variance (in $y$ , the function’s output), is not near the true parameter value. One reason for this could because this cluster has a comparatively small length-scale $l=1$ , which models the relative correlation between the points.

4.2 Simulation II

The second simulation was carried out by first generating 20 GPs. Ten were generated from an SE covariance kernel with hyper-parameters $l=1$ and $\sigma=1$ . The remaining 10 GPs were generated from a covariance kernel with hyper-parameters $l=2$ and $\sigma=2$ . Similarly to Simulation I, the GPs were generated first by constructing the covariance matrix, then by generating random samples using the R package mvtnorm. In all, $T=9$ equally spaced observed values were recorded for each GP (Figure 2). Based on the plot in Figure 2, there seems to be no clear distinction or natural groups of processes. After coloring the processes by their (correct) classifications (Figure 2), there is still ambiguity about the two groups separation.

Again, the method accurately recovers the mixing parameter and length-scale (Table 2). However, for cluster 1, the length-scale $l$ is slightly overestimated and the method inflates $\sigma$ to account for the variance in the function’s output. The parameter estimates were calculated by randomly starting the algorithm 10 times and using the mean. Notably, the processes look very similar between groups, so much so that this solution might seem unconvincing if true group labels were unknown.

5 Coastal Rainfall in British Columbia

This section will look at historical monthly precipitation data for coastal regions of British Columbia (BC), Canada. These data are recorded by the Government of Canada and collected from the weather stations: Tofino A, Vancouver International Airport, Port Hardy A, and Victoria International Airport (Figure 3). These data were derived from the following resources available in the public domain.

Total monthly precipitation was recorded from January 1991 to December 2000. The ten years will be treated as independent GPs and each station will be treated separately for clustering purposes. The objective is to find $G=2$ cluster solutions for each station and compare the resulting clusters — because there are only 10 processes per station, $G>2$ cluster solutions were not explored. The data were processed first by removing the seasonality, i.e., the residual precipitation after a ten year monthly average was removed. The data were then centered and scaled such that the mean is [math] and standard deviation is $1$ . These analyses will use the SE covariance function discussed earlier. Thus, $\bm{\theta}=\{\bm{l}\}$ will be estimated and modelled holding $\bm{\sigma}=1$ constant. Note that the observed outputs $\mathbf{y}$ are (artificially) connected by lines for illustrative purposes (Figure 4).

Figure 4 illustrates the scaled Tofino data plotted by year. Next, the maximized hyper-parameter $l$ is fitted and results are shown in Figure 6. Because of multi-modal likelihoods in the gradient ascent approach to hyper-parameter fitting, a grid search was used to choose the optimal value. An example likelihood from this is shown in Figure 5.

Instead of applying the EM algorithm to estimate mixture parameters, the relatively small number of processes for each station meant that an exhaustive search could be used. That is, each two-group combination of the ten years was considered. This amounted to maximizing $2^{10}=1024$ likelihoods and selecting the model with the highest (complete-data) likelihood. Using these results, there seem to be two groups emerging, one group with a smaller length-scale and one group with a larger length-scale parameter. Group two has a larger group length scale parameter ( $0.842\leq l_{2}\leq 0.985$ ) as compared with group one ( $0.365\leq l_{1}\leq 0.401$ ). Group two is labelled the “irregular group” as, with the exception of Port Hardy, the mixing proportion suggests only about $10\%$ of the data come from this group (see Table 4). In this case, the years with a smaller length-scale indicate monthly rainfall is less correlated month-to-month than those with a larger length-scale.

Vancouver, Tofino, and Port Hardy all had the year 1995 assigned to group two (Table 3). Additionally, Port Hardy had the years 1992, 1994, 1998, and 2000 assigned to group two. The years assigned to group two in at least one weather station (1992, 1994, 1995, 1998, 1999, 2000) share an interesting characteristic related to Pacific Ocean temperatures. Specifically, El Niño and La Niña are events classified using the Oceanic Niño Index (ONI), an index that measures irregular ocean temperature changes over a three month moving average. An El Niño (irregularly warm) event immediately preceded a La Niña (irregularly cold) event twice during the studied time period. Once in 1995 and again in 1998. Both times the year started with warm enough ocean temperatures to classify it as an El Niño period, and by the end of the calendar year the ocean had cooled enough to be classified as La Niña (NOAA, 2019). The other years in the irregular cluster also differed in terms of regular Pacific Ocean temperatures. Generally, years clustered into the irregular group tended to correspond with falling Equatorial Pacific Ocean temperatures as shown in Figure 7A). These irregular years also coincided with years where the middle region of the Pacific Ocean (5S-5N and 170-120W) started the calendar year warmer than it ended, as illustrated in Figure 7B).

From further consideration of the estimated cluster parameters in Table 4, Cluster 2’s years tend towards a larger length-scale compared to Cluster 1. This suggests that in years where El Niño changes to La Niña, rainfall patterns change more smoothly (i.e., are more correlated) across months as opposed to regular weather years.

6 Discussion

A method for clustering functional data has been introduced to cluster coastal rainfall data from BC. First, the hyper-parameters that make up a GP were optimized through a gradient-based maximum likelihood optimizer. Because of computational feasibility, parameter estimates were obtained by considering every two-group combination of years and choosing the maximum likelihood. For the simulation studies, the usual EM algorithm for finite Gaussian mixture models was modified. Instead of maximizing the standard covariance matrix, hyper-parameters for a kernel function that measures correlation in $\mathbf{x}$ were optimized. The covariance matrix was then constructed from the optimal kernel parameters. Herein, we fix $G=2$ as known; however, if one were to consider data with more processes per station then $G>2$ could be considered. Notably missing, or incomplete, data can easily be handled by the proposed approach, either by using the predictive distribution to impute the missing data or by ignoring the missing values. This is possible because the model makes inference on the underlying hyper-parameters of the kernel, and not the particular index set of the process.

Two simulation studies were performed. When GPs from different distributions had a large difference in their length-scale parameter $l$ (i.e., 1 versus 3), parameters were readily recovered. When GPs had similar length-scale parameters, $l$ was recovered but $\sigma$ tended to shrink towards a common estimate between both clusters. The application to the rainfall data from the coastal region of B.C discovered two groups of years, one which contained “regular” years and the other “irregular” years. The irregular years consisted of years where there was a transition from El Niño to La Niña, or more generally cooling Pacific Ocean temperatures. These results suggest El Niño events have some effect on kernel hyper-parameters. The data were standardized to have a zero mean function, implying correlation between rainfall patterns month-to-month can discriminate some El Niño events (as apposed to magnitude of rainfall).

The most obvious direction for future work is to apply the approach developed herein to other rainfall data. In terms of the BC coastal rainfall data, one could carry out a clustering of all locations combined to see whether some years stand out from others. The BC data could be treated as matrix variate data and clustered accordingly, perhaps after the fashion of Gallaugher and McNicholas (2018). In both cases, it is of interest to observe whether $G>2$ clusters emerge.

Acknowledgements

This work was supported by the Canada Research Chairs program and an E.W.R. Steacie Memorial Fellowship.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Armal et al. (2018) Armal, S., N., Devineni, R., Khanbilvardi (2018) Trends in Extreme Rainfall Frequency in the Contiguous United States: Attribution to Climate Change and Climate Variability Modes. J. Climate 31 , 369–385.
2Bates and Eddelbuettel (2013) Bates, D. and Eddelbuettel, D. (2013). Fast and Elegant Numerical Linear Algebra Using the Rcpp Eigen Package. Journal of Statistical Software , 52 (5), 1–24.
3Berrocal (2016) Berrocal, V. J. (2016). Identifying trends in the spatial errors of a regional climate model via clustering. Environmetrics , 27 , 90–102.
4Bin and Wenlai (2013) Bin, S. and Wenlai, Y. (2013). Application of gaussian process regression to prediction of thermal comfort index. In 2013 IEEE 11th International Conference on Electronic Measurement Instruments , volume 2, pp. 958–961.
5Bouveyron and Brunet-Saumard (2014) Bouveyron, C. and Brunet-Saumard, C. (2014). Model-based clustering of high-dimensional data: A review. Computational Statistics and Data Analysis 71 , 52–78.
6Cabral et al. (2019) Cabral, R, Ferreira, A, Friederichs, P. (2019). Space–time trends and dependence of precipitation extremes in North‐Western Germany. Environmetrics 2019;e 2605.
7Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B , 39 , 1–38.
8Duvenaud et al. (2013) Duvenaud, D., Lloyd, J., Grosse, R., Tenenbaum, J., Zoubin, G. (2013). Structure Discovery in Nonparametric Regression through Compositional Kernel Search. Proceedings of the 30th International Conference on Machine Learning , 28 , 1166–1174.