Simultaneous Variable Selection, Clustering, and Smoothing in Function   on Scalar Regression

Suchit Mehrotra; Arnab Maity

arXiv:1906.10286·stat.AP·June 26, 2019

Simultaneous Variable Selection, Clustering, and Smoothing in Function on Scalar Regression

Suchit Mehrotra, Arnab Maity

PDF

1 Repo

TL;DR

This paper introduces a Bayesian approach for function-on-scalar regression that simultaneously performs variable selection, clustering of correlated effects, and smoothing, effectively addressing multicollinearity and reducing dimensionality.

Contribution

It proposes a novel prior that groups correlated predictors, enabling dimension reduction without losing relevant variables, validated through simulations and real data application.

Findings

01

Outperforms existing dimension reduction methods in simulations

02

Effectively clusters correlated predictors in real data

03

Reduces multicollinearity issues in functional regression

Abstract

We address the problem of multicollinearity in a function-on-scalar regression model by using a prior which simultaneously selects, clusters, and smooths functional effects. Our methodology groups effects of highly correlated predictors, performing dimension reduction without dropping relevant predictors from the model. We validate our approach via a simulation study, showing superior performance relative to existing dimension reduction approaches in the function-on-scalar literature. We also demonstrate the use of our model on a data set of age specific fertility rates from the United Nations Gender Information database.

Tables5

Table 1. Table 1 : Mean pointwise MSE for 100 datasets for each model. Standard errors (in parentheses) are estimated using the bootstrap with 100 repetitions.

	N = 30	N = 60	N = 120	N = 240
Design 1
FOSR	3.02 (0.16)	1.19 (0.08)	0.64 (0.04)	0.3 (0.02)
FOSR-PM	2.12 (0.16)	1.2 (0.08)	0.71 (0.05)	0.29 (0.02)
FOSR-DP	0.94 (0.05)	0.4 (0.03)	0.16 (0.01)	0.07 (0.01)
FOSR-DPPM	1.12 (0.06)	0.43 (0.03)	0.17 (0.01)	0.06 (0.01)
Design 2
FOSR	4.61 (0.22)	2.8 (0.15)	1.38 (0.08)	0.73 (0.04)
FOSR-PM	4.1 (0.22)	3.49 (0.2)	1.94 (0.11)	0.99 (0.07)
FOSR-DP	1.49 (0.09)	0.89 (0.06)	0.37 (0.03)	0.16 (0.01)
FOSR-DPPM	2.1 (0.12)	1.13 (0.07)	0.48 (0.03)	0.19 (0.02)
Design 3
FOSR	2.75 (0.1)	1.18 (0.03)	0.64 (0.02)	0.31 (0.01)
FOSR-PM	2.49 (0.08)	1.53 (0.04)	0.81 (0.04)	0.36 (0.01)
FOSR-DP	1.85 (0.05)	1.18 (0.03)	0.75 (0.03)	0.41 (0.01)
FOSR-DPPM	2.16 (0.06)	1.27 (0.04)	0.78 (0.03)	0.42 (0.02)
Design 4
FOSR	1.73 (0.07)	0.7 (0.03)	0.31 (0.01)	0.16 (0.01)
FOSR-PM	1.09 (0.03)	0.67 (0.03)	0.29 (0.01)	0.14 (0.01)
FOSR-DP	0.95 (0.03)	0.62 (0.02)	0.3 (0.01)	0.17 (0.01)
FOSR-DPPM	1.16 (0.04)	0.62 (0.02)	0.3 (0.01)	0.17 (0.01)

Table 2. (a) RAND Index

	N = 30	N = 60	N = 120	N = 240
Design 1
FOSR-DP	0.69 (0.01)	0.79 (0.01)	0.92 (0.01)	0.96 (0.01)
FOSR-DPPM	0.65 (0.01)	0.8 (0.01)	0.92 (0.01)	0.97 (0.01)
Design 2
FOSR-DP	0.69 (0.01)	0.78 (0.01)	0.87 (0.01)	0.93 (0.01)
FOSR-DPPM	0.63 (0.01)	0.76 (0.01)	0.84 (0.01)	0.92 (0.01)

Table 3. (a) RAND Index

	N = 30	N = 60	N = 120	N = 240
Design 1
FOSR-DP	0.69 (0.01)	0.79 (0.01)	0.92 (0.01)	0.96 (0.01)
FOSR-DPPM	0.65 (0.01)	0.8 (0.01)	0.92 (0.01)	0.97 (0.01)
Design 2
FOSR-DP	0.69 (0.01)	0.78 (0.01)	0.87 (0.01)	0.93 (0.01)
FOSR-DPPM	0.63 (0.01)	0.76 (0.01)	0.84 (0.01)	0.92 (0.01)

Table 4. (b) Adjusted RAND Index

	N = 30	N = 60	N = 120	N = 240
Design 1
FOSR-DP	0.41 (0.02)	0.59 (0.02)	0.83 (0.02)	0.93 (0.01)
FOSR-DPPM	0.47 (0.03)	0.64 (0.02)	0.84 (0.02)	0.94 (0.01)
Design 2
FOSR-DP	0.43 (0.02)	0.57 (0.03)	0.78 (0.02)	0.87 (0.02)
FOSR-DPPM	0.22 (0.02)	0.45 (0.03)	0.64 (0.02)	0.87 (0.02)

Table 5. Table 3 : Shows the proportion of iterations that the variables listed were set to zero by the FOSR-DPPM model

	Percent Zero
Contraception Prevalence (15-49)	$0.000$
Maternity Related Deaths per 1K Women	$0.000$
Births Atten. by Exp. Staff (% of Total)	$0.000$
Age at First Marriage	$0.006$
Male to Female Sex Ratio (15-49)	$0.007$
U5 Mortality	$0.012$
Cervical Cancer Deaths (Per 100K)	$0.148$
Health Expenditures (% of GDP)	$0.291$
Female BMI	$0.556$
Female Labor Force Participation	$0.579$
GDP Per Capita	$0.814$
Mean Yrs. of School (Women % Men) (25-34)	$0.861$
Life Expectancy	$0.901$
Dollar Billionares (per 1M)	$0.948$
Alcohol Consumption per Adult	$0.959$

Equations35

(\forall p \in {1, \dots, P}) β_{p} ∣ c_{p}, θ (\forall p \in {1, \dots, P}) c_{p} ∣ π (\forall k \in {1, \dots, K}) θ_{k} ∣ λ = k = 1 \sum K I (c_{p} = k) θ_{k} \sim Discrete (π_{1}, \dots, π_{K}) \sim N (0, λ^{- 1})

(\forall p \in {1, \dots, P}) β_{p} ∣ c_{p}, θ (\forall p \in {1, \dots, P}) c_{p} ∣ π (\forall k \in {1, \dots, K}) θ_{k} ∣ λ = k = 1 \sum K I (c_{p} = k) θ_{k} \sim Discrete (π_{1}, \dots, π_{K}) \sim N (0, λ^{- 1})

P (c_{i} = k ∣ c_{- i}) = \frac{n _{- i, k}}{N - 1 + α} P (c_{i} \neq = c_{j} for all i \neq = j ∣ c_{- i}) = \frac{α}{N - 1 + α}

P (c_{i} = k ∣ c_{- i}) = \frac{n _{- i, k}}{N - 1 + α} P (c_{i} \neq = c_{j} for all i \neq = j ∣ c_{- i}) = \frac{α}{N - 1 + α}

(\forall p \in {1, \dots, P}) β_{p} ∣ G G G_{0} ∣ λ \sim G \sim D P (G_{0}, α) = N (0, λ^{- 1})

(\forall p \in {1, \dots, P}) β_{p} ∣ G G G_{0} ∣ λ \sim G \sim D P (G_{0}, α) = N (0, λ^{- 1})

G = π_{0} δ_{0} + (1 - π_{0}) G^{*}, G^{*} \sim D P (G_{0}, α)

G = π_{0} δ_{0} + (1 - π_{0}) G^{*}, G^{*} \sim D P (G_{0}, α)

G = π_{0} δ_{0} + (1 - π_{0}) k = 1 \sum \infty π_{k} δ_{θ_{k}} = k = 0 \sum \infty \tilde{π}_{k} δ_{θ_{k}}

G = π_{0} δ_{0} + (1 - π_{0}) k = 1 \sum \infty π_{k} δ_{θ_{k}} = k = 0 \sum \infty \tilde{π}_{k} δ_{θ_{k}}

y_{i} (t) = μ (t) + p = 1 \sum P x_{i p} β_{p} (t) + e_{i} (t)

y_{i} (t) = μ (t) + p = 1 \sum P x_{i p} β_{p} (t) + e_{i} (t)

Y = W α^{T} + X β^{T} + E

Y = W α^{T} + X β^{T} + E

a_{p}

a_{p}

[\forall \in {1, \dots, P_{c}}] (\tilde{b}_{p}, \tilde{λ}_{b_{p}}) G G_{0} \sim G \sim D P (G_{0}, α) = N_{M} (0, \tilde{λ}_{b_{p}}^{- 1} R^{- 1}) G (a_{λ}, b_{λ})

[\forall \in {1, \dots, P_{c}}] (\tilde{b}_{p}, \tilde{λ}_{b_{p}}) G G_{0} \sim G \sim D P (G_{0}, α) = N_{M} (0, \tilde{λ}_{b_{p}}^{- 1} R^{- 1}) G (a_{λ}, b_{λ})

[\forall \in {1, \dots, P_{c}}] (\tilde{b}_{p}, \tilde{λ}_{b_{p}}) G^{*} G_{0} [π_{0}, (1 - π_{0})] \sim π_{0} δ_{0} + (1 - π_{0}) G^{*} \sim D P (G_{0}, α) = N_{M} (0, \tilde{λ}_{b_{p}}^{- 1} R^{- 1}) G (a_{λ}, b_{λ}) \sim Dirichlet (\frac{α _{0}}{2}, \frac{α _{0}}{2})

[\forall \in {1, \dots, P_{c}}] (\tilde{b}_{p}, \tilde{λ}_{b_{p}}) G^{*} G_{0} [π_{0}, (1 - π_{0})] \sim π_{0} δ_{0} + (1 - π_{0}) G^{*} \sim D P (G_{0}, α) = N_{M} (0, \tilde{λ}_{b_{p}}^{- 1} R^{- 1}) G (a_{λ}, b_{λ}) \sim Dirichlet (\frac{α _{0}}{2}, \frac{α _{0}}{2})

y a b [\forall p \in {1, \dots, P_{c}}] λ_{a_{p}} [\forall p \in {1, \dots, K}] λ_{b_{p}} τ \sim N_{N T} (\tilde{W} a + \tilde{X} b, τ^{- 1} I) \sim N_{M P_{f}} (0, Λ_{a}^{- 1} \otimes R^{- 1}) \sim N_{M K} (0, Λ_{b}^{- 1} \otimes R^{- 1}) \sim G (a_{λ}, b_{λ}) \sim G (a_{λ}, b_{λ}) \sim G (a_{τ}, b_{τ})

y a b [\forall p \in {1, \dots, P_{c}}] λ_{a_{p}} [\forall p \in {1, \dots, K}] λ_{b_{p}} τ \sim N_{N T} (\tilde{W} a + \tilde{X} b, τ^{- 1} I) \sim N_{M P_{f}} (0, Λ_{a}^{- 1} \otimes R^{- 1}) \sim N_{M K} (0, Λ_{b}^{- 1} \otimes R^{- 1}) \sim G (a_{λ}, b_{λ}) \sim G (a_{λ}, b_{λ}) \sim G (a_{τ}, b_{τ})

P (c_{p} = k ∣ c_{- p}, λ_{b}, \cdot)

P (c_{p} = k ∣ c_{- p}, λ_{b}, \cdot)

= \int f (y ∣ b, c^{'}, λ_{b}, X, W, a, τ) f (b ∣ λ_{b}, R) d b \times P (c_{p} = k ∣ c_{- p}, α)

\int f (y ∣ b, c^{'}, λ_{b}, X, W, a, τ) f (b ∣ λ_{b}, R) d b = (2 π)^{- \frac{N T}{2}} τ^{\frac{N T - M K ~}{2}} ∣ Λ_{b}^{'} ∣^{\frac{M}{2}} ∣ R ∣^{\frac{K ~}{2}} ∣ G ∣^{- \frac{1}{2}} exp {- \frac{τ}{2} (\hat{e}_{W}^{T} \hat{e}_{W} - g^{T} G^{- 1} g)}

\int f (y ∣ b, c^{'}, λ_{b}, X, W, a, τ) f (b ∣ λ_{b}, R) d b = (2 π)^{- \frac{N T}{2}} τ^{\frac{N T - M K ~}{2}} ∣ Λ_{b}^{'} ∣^{\frac{M}{2}} ∣ R ∣^{\frac{K ~}{2}} ∣ G ∣^{- \frac{1}{2}} exp {- \frac{τ}{2} (\hat{e}_{W}^{T} \hat{e}_{W} - g^{T} G^{- 1} g)}

π_{0}^{*} = π (c_{p} = 0∣ c_{- p}, α_{0})

π_{0}^{*} = π (c_{p} = 0∣ c_{- p}, α_{0})

π_{0}^{*} = P (c_{p} = 0∣ c_{- p}, α_{0}) For k \geq 1, P (c_{p} = k ∣ c_{- p}, α) P (c_{p} \neq = 0 and c_{p} \neq = c_{j} for all p \neq = j ∣ c_{- p}, α) = \frac{n _{- p, 0} + α _{0} /2}{P _{c} - 1 + α _{0}} = (1 - π_{0}^{*}) (\frac{n _{- p, k}}{P _{n z} - 1 + α}) = (1 - π_{0}^{*}) (\frac{α}{P _{n z} - 1 + α})

π_{0}^{*} = P (c_{p} = 0∣ c_{- p}, α_{0}) For k \geq 1, P (c_{p} = k ∣ c_{- p}, α) P (c_{p} \neq = 0 and c_{p} \neq = c_{j} for all p \neq = j ∣ c_{- p}, α) = \frac{n _{- p, 0} + α _{0} /2}{P _{c} - 1 + α _{0}} = (1 - π_{0}^{*}) (\frac{n _{- p, k}}{P _{n z} - 1 + α}) = (1 - π_{0}^{*}) (\frac{α}{P _{n z} - 1 + α})

(\tilde{b}_{p}, \tilde{λ}_{b_{p}})

(\tilde{b}_{p}, \tilde{λ}_{b_{p}})

(\tilde{b}_{p}, \tilde{λ}_{b_{p}})

(\tilde{b}_{p}, \tilde{λ}_{b_{p}})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

suchitm/fosr_clust
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Simultaneous Variable Selection, Clustering, and Smoothing in Function on Scalar Regression

Suchit Mehrotra

Department of Statistics, North Carolina State University

Arnab Maity

Department of Statistics, North Carolina State University

Abstract

We address the problem of multicollinearity in a function-on-scalar regression model by using a prior which simultaneously selects, clusters, and smooths functional effects. Our methodology groups effects of highly correlated predictors, performing dimension reduction without dropping relevant predictors from the model. We validate our approach via a simulation study, showing superior performance relative to existing dimension reduction approaches in the function-on-scalar literature. We also demonstrate the use of our model on a data set of age specific fertility rates from the United Nations Gender Information database.

1 Introduction

Recent technological advances in computing power and data storage have simplified the collection of vast quantities of data. Modern scientific questions often involve statistical analyses on datasets with a large number of predictors, many of which have no meaningful effect on the response. In numerous situations, an analysis is further complicated by multicollinearity in predictors. An example are microarray studies where interest lies in understanding the relationship between a health outcome and gene expressions; gene expression levels are often highly correlated and the number of genes is substantially larger than the number of samples. In such situations, modern statistical approaches have utilized the assumption that highly correlated predictors share the same effect on the response, especially if the correlation is due to the same underlying factor. They aim to reduce the dimension of the problem by summing or averaging groups of highly correlated columns and fitting a linear model with the new set of predictors, which, in essence, is equivalent to finding clusters of regression coefficients in a supervised manner.

Many methods exist which aim to cluster coefficients in the univariate linear model. For example, Bondell and Reich (2008), She (2010), and Sharma et al. (2013) all propose methods which minimize a loss function plus a penalty term to encourage clustering in the coefficients. These methods are computationally efficient as they are all a special case of the generalized lasso (Tibshirani and Taylor, 2011). In the Bayesian setting, the coefficient clustering problem is solved by utilizing mixture models as priors for the coefficients. Tadesse et al. (2005) uses finite mixture models to cluster the coefficients, while Nott (2008), Kim et al. (2006), Dunson et al. (2008), and Curtis and Ghosh (2011) all use Dirichlet Process priors to estimate clusters.

In this paper we add to the existing dimension reduction literature for models with a functional response and a set of scalar predictors, which up till now has focused only on variable selection. At present, many group penalties exits to select variables in a function-on-scalar regression. First, the group lasso (Yuan and Lin, 2006) can be used to shrink basis function coefficients belonging to the same predictor simultaneously to zero. Wang et al. (2007) modify this approach by using a group SCAD penalty, Chen et al. (2016) use a group MCP penalty, Fan and Reimherr (2017) propose an adaptive group lasso, while Kowal and Bourgeois (2018) utilize a group horseshoe prior. However, an issue with these approaches is the need to fix the number of basis functions before fitting the model, and recent work has focused on smoothing the coefficient functions along with selecting variables. For example, Goldsmith and Kitago (2016) use a penalized normal prior to smooth coefficient functions without any variable selection, while Parodi and Reimherr (2018) propose a framework for simultaneous variable selection and smoothing.

Currently, to the best of our knowledge, no function-on-scalar approaches exist which address multicollinearity in the predictors by exploiting a cluster structure in the coefficient functions. Additionally, in the Bayesian setting, methods to simultaneously select and smooth coefficient functions have yet to be developed. Consequently, our manuscript contributes to the current function-on-scalar literature by proposing a novel clustering based dimension reduction technique for functional data, heretofore unexplored.

Our paper proceeds as follows. We review the relevant background literature in Section 2, our model and relevant computational details in Sections 3 and 4, a simulation study and its results in Section 5, along with a real data application in Section 6.

2 Background

In a univariate linear model, an approach to finding clusters in a vector of coefficients, ${{\bm{\mathbf{{\beta}}}}}$ , is to assume the existence of $K$ clusters and use finite mixture models as a prior for ${{\bm{\mathbf{{\beta}}}}}$ (Tadesse et al., 2005). Using a latent class representation for cluster membership the prior can be written as:

[TABLE]

where ${\bf c}$ is a $P\times 1$ vector which stores class labels for the elements of ${{\bm{\mathbf{{\beta}}}}}$ . The hierarchy in (1) implies that $\beta_{p}\sim G=\sum_{k=1}^{K}\pi_{k}\delta_{\theta_{k}}(\beta_{p})$ where $\delta_{\theta_{k}}(x)$ is 1 if $x=\theta_{k}$ and [math] otherwise. Consequently, each element of ${{\bm{\mathbf{{\beta}}}}}$ is a finite mixture of K delta functions, and can only take on one of $K$ values. If we know $K$ in advance, estimating ${{\bm{\mathbf{{\theta}}}}}$ and class memberships, ${\bf c}$ , provides a solution to the coefficient clustering problem.

However, this approach suffers from a major drawback in that $K$ has to be fixed a-priori, which is information rarely available in applied settings. To remedy this issue, one can use priors which allow for a countably infinite number of mixture components, such as the Dirichlet Process.

2.1 Dirichlet Process Priors

The Dirichlet Process (DP) is indexed by a concentration parameter $\alpha$ and a base distribution $G_{0}$ , denoted $DP(G_{0},\alpha)$ . It has a long history of being used as a clustering prior for a set of points, $\{x_{1},\dots x_{N}\}$ , and was first investigated by Ferguson (1973) and Antoniak (1974). More recently, computationally efficient methods have been developed for parameter estimation by Escobar and West (1995), MacEachern and Müller (1998), and Neal (2000). Using a stick breaking construction of the Dirichlet process, it can be shown that $G(x)=\sum_{k=1}^{\infty}\pi_{k}\delta_{\theta_{k}}(x)$ is a sample from a $DP(G_{0},\alpha)$ , where $\pi_{k}=V_{k}\prod_{l=1}^{k-1}(1-V_{l})$ , $V_{k}\overset{iid}{\sim}\text{Beta}(1,\alpha)$ , and $\theta_{k}\overset{iid}{\sim}G_{0}$ (Sethuraman, 1994). Hence, the distributions sampled from a Dirichlet Process are a countably infinite mixture of point masses.

Key insights into the behaviour of the Dirichlet Process were given by Blackwell et al. (1973) and Neal (2000). Neal (2000) derives a conditional prior for the class labels, ${\bf c}$ , by viewing the Dirichlet Process mixture model as the limit of finite mixture models as $K\to\infty$ . By integrating out the mixing parameters ${{\bm{\mathbf{{\pi}}}}}$ he gives the following conditional priors for the elements of ${\bf c}$

[TABLE]

where $n_{-i,k}$ is the number of $c_{j}$ , $j\neq i$ , that are equal to $k$ and ${\bf c}_{-i}$ is ${\bf c}$ without the $i^{th}$ element. This representation allows us to bypass the sampling of mixing weights, ${{\bm{\mathbf{{\pi}}}}}$ , and shows that the prior conditional probability of an observation being assigned to an existing cluster is proportional to the number of elements in that cluster.

For the univariate linear model, we can modify the hierarchy in (1) for the Dirichlet Process prior as follows:

[TABLE]

Nott (2008) explored the estimation and predictive capacity of the DP prior with a normal base measure along with its applications in penalized spline smoothing. He also derived a Gibbs sampler for the model in (3) by leveraging a latent class representation for group membership. His algorithm iterates between sampling the cluster indicators for the elements of ${{\bm{\mathbf{{\beta}}}}}$ , ${\bf c}$ , and all other model parameters conditioned on the number of clusters in ${\bf c}$ , $K$ . We leverage this algorithm for development of our own Gibbs samplers and will discuss its relevant details in Section 4.

2.2 Variable Selection and Clustering

Kim et al. (2006), Dunson et al. (2008), and Curtis and Ghosh (2011) propose extensions to the prior in (3) for simultaneous variable selection and clustering. They use a mixture of a point mass at zero with a Dirichlet process prior

[TABLE]

where $0\leq\pi_{0}\leq 1$ is a mixture weight, $\delta_{0}$ is a point mass probability density at [math], and $DP(G_{0},\alpha)$ is a Dirichlet Process prior with precision $\alpha$ and base measure $G_{0}$ . Using the constructive definition of Sethuraman (1994) the prior in (4) is again an infinite mixture of point masses with the first point mass fixed at zero

[TABLE]

where $\tilde{\pi}_{0}=\pi_{0}$ , $\tilde{\pi}_{k}=(1-\pi_{0})\pi_{k}$ if $k\geq 1$ , $\theta_{0}=0$ and $\theta_{k}\overset{iid}{\sim}G_{0}$ if $k\geq 1$ . Computation for this model is similar to the model discussed in (3) because the prior in (5) can be interpreted as a clustering prior with the restriction that the first cluster is zero. We can account for this restriction by making a few changes to the Gibbs sampler developed by Nott (2008).

3 Model

In this section we extend the methods discussed in Section 2 to models with a functional response and a set of scalar covariates. Let $i\in\{1,\dots,N\}$ be the subject level index, and $p\in\{1,\dots,P\}$ be the predictor index. Then the general function-on-scalar model is:

[TABLE]

where $y_{i}(t)$ is the functional response value for subject $i$ at time $t$ , $\mu(t)$ is an intercept term, $x_{ip}$ is the $p^{th}$ predictor value for the $i^{th}$ individual, $\beta_{p}(t)$ is the functional effect for the $p^{th}$ predictor at time $t$ , and $e_{i}(t)$ is an error term. If all observations are observed on a common grid, this model can be written in matrix form as: ${\bf Y}={\bf X}{{\bm{\mathbf{{\beta}}}}}^{\sf T}+{\bf E}$ .

Since we wish to cluster only some of the predictors (at the least the intercept will remain unclustered), we delineate two groups of functional effects and propose the following model for a grid of equally spaced points:

[TABLE]

where ${\bf Y}_{N\times T}$ is a matrix of responses, ${\bf W}_{N\times P_{f}}$ is a matrix of predictors with free effects, ${{\bm{\mathbf{{\alpha}}}}}_{T\times P_{f}}$ , ${\bf X}_{N\times P_{c}}$ is a matrix of predictors whose effects, ${{\bm{\mathbf{{\beta}}}}}_{T\times P_{c}}$ , we wish to select and cluster, and ${\bf E}_{N\times T}$ is a matrix of errors with the ith row ${\bf e}_{i}\sim\mathcal{N}_{T}({\bm{\mathbf{{0}}}},\tau^{-1}{\bf I})$ . We expand each coefficient function as a linear combination of $M$ b-spline basis functions, $\{\theta_{1},\dots,\theta_{M}\}$ , such that ${{\bm{\mathbf{{\alpha}}}}}_{T\times P_{f}}={{\bm{\mathbf{{\Theta}}}}}_{T\times M}{\bf A}_{M\times P_{f}}$ and ${{\bm{\mathbf{{\beta}}}}}_{T\times P_{c}}={{\bm{\mathbf{{\Theta}}}}}_{T\times M}{\bm{\tilde{\mathbf{{B}}}}}_{M\times P_{c}}$ . This reduces the functional estimation problem to estimating the elements of ${\bf A}$ and ${\bm{\tilde{\mathbf{{B}}}}}$ .

In practice, care must be taken in choosing $M$ , balancing the need for a function to remain flexible with the risk of overfitting. We address this issue by expanding the functional effects using more basis functions than we think are necessary and use penalization to induce smoothness in the coefficient function. We follow Goldsmith and Kitago (2016) and use a full-rank penalty matrix ${\bf R}$ to penalize basis coefficients, setting ${\bf R}=\eta{\bf I}+(1-\eta){\bf R}_{2}$ , where ${\bf I}$ is the identity matrix, ${\bf R}_{2}$ is a second degree P-spline penalty (Eilers and Marx, 1996), and $\eta=0.001$ . Consequently, the prior for the columns of ${\bf A}$ , ${\bf a}_{p}$ , is $\forall p\in\{1,\dots,P_{f}\}$

[TABLE]

where $\lambda_{a_{p}}\sim\mathcal{G}(a_{\lambda},b_{\lambda})$ and $\mathcal{G}(a,b)$ is the Gamma distribution with shape $a$ and rate $b$ .

3.1 Clustering Functional Effects

Many approaches exist for clustering functional data and we refer the reader to Jacques and Preda (2014) for a review. Because we expand each coefficient function using b-splines, our clustering approach is similar to the work initially proposed by Abraham et al. (2003); we find clusters in ${{\bm{\mathbf{{\beta}}}}}$ by clustering the coefficients of the basis expansion. If we assume the columns of ${{\bm{\mathbf{{\beta}}}}}$ come from $K$ clusters, we can expand it as ${{\bm{\mathbf{{\beta}}}}}={{\bm{\mathbf{{\Theta}}}}}{\bm{\tilde{\mathbf{{B}}}}}={{\bm{\mathbf{{\Theta}}}}}{\bf B}{\bf C}^{\sf T}$ where ${{\bm{\mathbf{{\Theta}}}}}_{T\times M}$ and ${\bf B}_{M\times K}$ are the basis function and coefficient matrices, respectively, and ${\bf C}_{P_{c}\times K}$ is a matrix whose rows are one-hot encoded with class membership information. Consequently, ${\bf X}{{\bm{\mathbf{{\beta}}}}}^{\sf T}={\bf X}{\bm{\tilde{\mathbf{{B}}}}}^{\sf T}{{\bm{\mathbf{{\Theta}}}}}^{\sf T}={\bf X}{\bf C}{\bf B}^{\sf T}{{\bm{\mathbf{{\Theta}}}}}^{\sf T}$ , and ${\bf C}$ has the effect of summing columns of ${\bf X}$ with the same effect on the response.

Because we are also interested in estimating smooth coefficient functions, we give each predictor its own smoothing parameter, $\tilde{\lambda}_{b_{p}}$ . Therefore, our task is to estimate a mean and smoothing parameter for each of our clusters. We can easily do this by modifying our base distribution to be a multivariate normal-gamma (Ray and Mallick, 2006; Zhang and Telesca, 2014). Consequently, for clustering functional effects, we propose the prior:

[TABLE]

Extending this prior for simultaneous variable selection and clustering requires only one modification:

[TABLE]

In all computations in this paper we give $\alpha$ a non-informative Gamma prior and update it using the method outlined in Escobar and West (1995). We also fix $\alpha_{0}=2$ so that $\pi_{0}\sim\text{Unif}(0,1)$ .

4 Computation

In this section we give the details of a Gibbs sampler for estimating the parameters of our proposed models. The crux of the sampler is the update of the class labels for each parameter, ${\bf c}$ , and we extend the approach used by Nott (2008) to the model in (7).

We first list the model hierarchy conditioned on the cluster indicator matrix, ${\bf C}$ , and assume that it has $K$ columns. We also modify (7) and work with the transposed equation, ${\bf Y}^{\sf T}={{\bm{\mathbf{{\Theta}}}}}{\bf A}{\bf W}^{\sf T}+{{\bm{\mathbf{{\Theta}}}}}{\bf B}({\bf X}{\bf C})^{\sf T}+{\bf E}^{\sf T}$ . This, combined with the identity $\text{vec}({\bf A}{\bf B}{\bf C}^{\sf T})=({\bf C}\otimes{\bf A})\text{vec}({\bf B})$ , implies that our model can be written as ${\bf y}={\bm{\tilde{\mathbf{{W}}}}}{\bf a}+{\bm{\tilde{\mathbf{{X}}}}}{\bf b}+{\bf e}$ where ${\bf y}=\text{vec}({\bf Y}^{\sf T})$ , ${\bf e}=\text{vec}({\bf E}^{\sf T})$ , ${\bf a}=\text{vec}({\bf A})$ , ${\bf b}=\text{vec}({\bf B})$ , ${\bm{\tilde{\mathbf{{X}}}}}=({\bf X}{\bf C})\otimes{{\bm{\mathbf{{\Theta}}}}}$ and ${\bm{\tilde{\mathbf{{W}}}}}={\bf W}\otimes{{\bm{\mathbf{{\Theta}}}}}$ . Therefore, conditioned on ${\bf C}$ , our model hierarchy is given by:

[TABLE]

where ${{\bm{\mathbf{{\Lambda}}}}}_{a}=\text{diag}(\lambda_{a_{1}},\dots,\lambda_{a_{P_{f}}})$ and ${{\bm{\mathbf{{\Lambda}}}}}_{b}=\text{diag}(\lambda_{b_{1}},\dots,\lambda_{b_{K}})$ . From (11) it is easy to see that the updates of the parameters are similar to those for a normal linear model, with the posteriors of ${\bf a}$ and ${\bf b}$ being multivariate normal, and the posteriors of the $\lambda$ ’s and $\tau$ being Gamma. Therefore, the rest of this section will focus on estimation of the cluster indicator matrix, first for the clustering only Dirichlet Process prior (9) followed the variable selection and clustering prior (10).

4.1 Dirichlet Process

Updating ${\bf c}$ is the most computationally intensive step of the Gibbs sampler because its elements have to be updated sequentially. To update the class label for a predictor, $c_{p}$ , we need to calculate posterior probabilities of the predictor belonging to each cluster. In the linear model setting, Nott (2008) updates $c_{p}$ by integrating out the parameter vector, which is important for the mixing of the Markov chains, but requires $K+1$ matrix inversions for each predictor. Here, $K$ is the number of unique elements in ${\bf c}_{-p}$ and the additional inversion is for the proposal of a new cluster. Hence, his algorithm is a modification of the collapsed Gibbs sampler; Algorithm 3 in Neal (2000).

For our purposes, the inclusion of a prior on the smoothing parameters in (9) complicates the integral required to collapse the over all cluster parameters. Hence, we combine Algorithms 3 and 8 from Neal (2000) and collapse only over the cluster means. Our update of the elements of ${\bf c}$ proceeds as follows. For each $c_{i}$ , $i\in\{1,\dots,P_{c}\}$ :

•

Let $k^{-}$ be the number of distinct $c_{j}$ for $j\neq i$ and $h=k^{-}+1$ .

•

If $c_{i}=c_{j}$ for some $j\neq i$ draw from $\mathcal{G}(a_{\lambda},b_{\lambda})$ for $\lambda_{b_{h}}$ . If $\forall j\neq i$ , $c_{i}\neq c_{j}$ , set $\lambda_{b_{h}}=\tilde{\lambda}_{b_{i}}$ , where $\tilde{\lambda}_{b_{i}}$ is the smoothing parameter for the $i^{th}$ predictor.

•

For each $k\in\{1,\dots,h\}$

[TABLE]

where

[TABLE]

and ${\bf G}=[({\bf X}{\bf C}^{\prime})\otimes{{\bm{\mathbf{{\Theta}}}}}]^{\sf T}[({\bf X}{\bf C}^{\prime})\otimes{{\bm{\mathbf{{\Theta}}}}}]+\frac{1}{\tau}({{\bm{\mathbf{{\Lambda}}}}}_{b}^{\prime}\otimes{\bf R})$ , ${\bf g}=[({\bf X}{\bf C}^{\prime})\otimes{{\bm{\mathbf{{\Theta}}}}}]^{\sf T}{{\bm{\hat{\mathbf{{e}}}}}}_{W}$ , ${{\bm{\hat{\mathbf{{e}}}}}}_{W}=\text{vec}({\bf Y}^{\sf T}-{{\bm{\mathbf{{\Theta}}}}}{\bf A}{\bf W}^{\sf T})$ , ${\bf c}^{\prime}$ is the vector of class labels with $c_{p}=k$ , ${\bf C}^{\prime}$ is the corresponding one-hot encoded matrix, $\tilde{K}$ is the number of clusters in ${\bf c}^{\prime}$ , ${\bf b}$ is a $M\tilde{K}\times 1$ vector, ${{\bm{\mathbf{{\Lambda}}}}}_{b}^{\prime}$ is a $\tilde{K}\times\tilde{K}$ dimensional diagonal matrix of smoothing parameters for clusters present in ${\bf c}^{\prime}$ , and $P(c_{p}=k|{\bf c}_{-i},\alpha)$ can be calculated by using the formulas in (2). It should be noted that while $h$ inverses have to be computed for each predictor, they can be done in parallel to speed up the algorithm.

4.2 Dirichlet Process + Point Mass

Computation for this model is similar to the model discussed in Section 4.1 because the prior in (9) is a clustering prior with the restriction that the first cluster is zero. We can account for this by making a few changes to the Gibbs sampler outlined above. To update ${\bf c}$ , conditional prior probabilities of the form in (2) need to be determined and the integral in (12) has to be modified by dropping, from ${\bf X}$ , the columns in the null (zero) cluster. The update of ${\bm{\tilde{\mathbf{{B}}}}}$ also relies on dropping the columns in the null cluster because their effects are restricted to be zero; the rest of the update proceeds as before with the smaller predictor matrix.

If we let $P_{0}$ denote the number of variables which are in the null cluster and $P_{nz}$ denote the number of variables which are non-zero $(P_{c}=P_{0}+P_{nz})$ , we can calculate conditional prior probabilities of the form in (2) by thinking of the prior in two levels: the first being a two component finite mixture model and the second a Dirichlet Process. To begin, we need to determine the prior probabilities of the predictor $p$ being equal to zero, $P(c_{p}=0|{\bf c}_{-p},\alpha_{0})$ , where $[\pi_{0},(1-\pi_{0})]$ is given a $\text{Dirichlet}\left(\frac{\alpha_{0}}{2},\frac{\alpha_{0}}{2}\right)$ prior. This is given directly by Neal (2000) to be

[TABLE]

Because only a subset of the predictors are non zero, the Dirichlet Process prior can be thought of as active for only $P_{nz}$ predictors. Therefore, the probabilities from (13) can be combined with the probabilities from (2) to get the conditional prior for cluster membership:

[TABLE]

5 Simulations

In this section we evaluate the efficacy of our models using a simulation study. Our simulations consist of four designs and compare the performance of the proposed priors with two priors from the literature; the smoothing prior proposed by Goldsmith and Kitago (2016) and a modification of the variable selection prior proposed by Goldsmith and Schwartz (2017). For clarity, the priors and the corresponding names used in our tables are listed below:

•

FOSR: Goldsmith and Kitago (2016)

[TABLE]

•

FOSR-PM: Modification of Goldsmith and Schwartz (2017) to include smoothing

[TABLE]

•

FOSR-DP: Clustering prior given in (9)

•

FOSR-DPPM: Variable selection and clustering prior given in (10)

5.1 Design

All of the methods considered are fit under the assumption of iid errors, even though our simulations use an exponential covariance matrix with a nugget term. Using the notation outlined in Equation (7), we set $T=15$ , $P_{f}=5$ , and $P_{c}=15$ for all simulations, and use three Fourier basis functions to generate the coefficient functions. Additionally, we generate correlated values for ${\bf X}$ with $Cov(x_{i,p},x_{i,p^{\prime}})=0.75^{|p-p^{\prime}|}$ , while $w_{ij}\overset{iid}{\sim}\mathcal{N}(0,1)$ . Finally, we simulate each dataset with ${{\bm{\mathbf{{\Sigma}}}}}={{\bm{\mathbf{{\Sigma}}}}}^{\prime}+\sigma^{2}{\bf I}$ where ${{\bm{\mathbf{{\Sigma}}}}}^{\prime}_{i,j}=\exp(-10*(i-j)^{2})$ and $\sigma^{2}$ was chosen to set the signal-to-noise ratio in the data to

The major variation in our simulation designs is the cluster membership pattern of ${{\bm{\mathbf{{\beta}}}}}$ ; it is varied as follows:

•

Design 1: ${{\bm{\mathbf{{\beta}}}}}$ has three clusters with ${\bf c}=({\bm{\mathbf{{1}}}}_{7}^{\sf T},{\bm{\mathbf{{2}}}}_{4}^{\sf T},{\bm{\mathbf{{3}}}}_{4}^{\sf T})^{\sf T}$ . The coefficients of cluster 1 are set to zero.

•

Design 2: ${{\bm{\mathbf{{\beta}}}}}$ has three clusters with the same pattern as Design 1. However, all of the effects are non-zero.

•

Design 3: ${{\bm{\mathbf{{\beta}}}}}$ has no clusters and none of the coefficients are zero.

•

Design 4: Similar to Design 3 but the first seven elements of ${{\bm{\mathbf{{\beta}}}}}$ are set to zero.

5.2 Results

Each model was estimated using the Gibbs sampler described in Section 4, run for 5000 iterations with 2500 used as burn-in. Additionally, we used eight b-spline basis functions to estimate each coefficient function. We evaluate our methods using two criteria: parameter estimation using the pointwise mean squared error (MSE) of the functional effects, and clustering ability using the RAND and adjusted RAND indices.

Table 1 shows the pointwise MSE of our proposed methods in comparison to the other priors considered. It can be seen that when cluster structure is present in the coefficients, as in Designs 1 and 2, the FOSR-DP and FOSR-DPPM methods outperform FOSR and FOSR-PM handily. This is to be expected since our priors are designed to exploit this structure. What is interesting however, is the out-performance of our methods in the small sample setting for Designs 3 and 4. The data in Design 3 is simulated to give FOSR an advantage, however both FOSR-DP and FOSR-DPPM outperform FOSR when $N=30$ . Since the total number of predictors in the simulations is fixed at 20, and none are set to zero in Design 3, the high dimensionality of problem seems to lend itself well to the use of a dimension reduction method which borrows information from nearby covariates. We see something similar in Design 4, where FOSR-DP outperforms the FOSR-PM prior.

We also evaluated the clustering capacity of our methods using the RAND and Adjusted RAND indices, with results in Table 2. As can be seen, our methods are able to learn cluster structure in the coefficients despite miss-specification of the covariance matrix and a low signal-to-noise ratio.

6 Application

We demonstrate the use of our methods by analyzing age-specific fertility rates from the United Nations Gender Information (UNGEN) database. We use data from surveys conducted between 2000-2005, estimates of which are plotted in Figure 1. Due to the nature of the survey, fertility curves are only observed at 7 points, where each time point corresponds to an age group for the participants in question. Each curve seems to be a relatively smooth function of time, making a function-on-scalar analysis of this data appropriate. A similar data set is explored by Kowal (2018) but our analysis considers a larger population of countries and a greater number of demographic and socioeconomic predictors.

The response information from the UNGEN database is combined with fifteen country level covariates available on Gapminder. Each covariate included in the final model is a complete case average of Gapminder data from 2000-2005. This allows us to see how average values of demographic and socioeconomic variables during the years of the UN survey affect the fertility curve. The variables we consider include some potentially causal factors such as age at first marriage, something which should increase fertility in the early parts of the curve, along with seemingly inconsequential variables such as the proportion of dollar billionaires in a country and the amount of alcohol its inhabitants consume. Unfortunately, many countries had no covariate information for the years of the survey and were dropped from the analysis for simplicity. The final data set consists of 15 covariates for 92 countries around the world.

Figure 2 shows many highly correlated predictors among consideration. For example, Under-5 Mortality is highly positively correlated with Maternal Deaths (correlation coefficient of 0.85), but is highly negatively correlated with Contraception Prevalence (correlation of -0.85). Contraception Prevalence on the other hand is positively correlated with many socioeconomic factors such as life expectancy and the number of births attended by trained birth staff.

We fit the FOSR-DP and FOSR-DPPM models on the data set. Each method was fit using 10,000 iterations with 5,000 used for burn-in. Figure 3 show the cluster dendogram for the analyses. Viewing both dendograms we see that covariates with high positive correlation tend to be clustered together. Additionally, the structure of the tree seems similar across the two models, with small differences in the order that they are joined in a particular node. Finally, the variable that we thought a-priori should most clearly impact the age-specific fertility curve, age at first marriage, is a standalone covariate in both dendograms. Table 3 shows the percent of iterations after burn-in that each covariate was set to zero by the FOSR-DPPM model; of the fifteen covariates considered, only six are estimated to have non-zero effects if a 5% cutoff is used.

Finally, the estimated effects, on a standardized scale, of the non-zero coefficients are shown in Figure 4. From this we can see that the age at first marriage has a time varying effect; a decrease of one standard deviation of age increases the number of births in the 15-19 and 20-24 age groups, but the effect disappears as age increases. Additionally, U5 mortality and Maternal Deaths per 1-K are highly correlated variables with clustered coefficient curves. Their effect is positive throughout, but decreases as age increases. Finally, an interesting finding in our estimates is that contraception prevalence primarily impacts fertility in the middle of the curve; it is not impactful in the 15-19 age group.

7 Conclusion

In this paper we extend methodology for addressing multicollinearity in the predictors to function-on-scalar regression. We contribute to the literature by proposing a prior which simultaneously selects, clusters, and smooths the coefficient functions using a Bayesian approach, while the current Bayesian literature only focuses independently on selection or smoothing. Our model allows coefficient estimates to remain flexible while controlling overfitting, and performs dimension reduction by summing columns in the covariate matrix which have the same effect on the response. We also develop a Gibbs sampler which converges to the posterior distribution quickly in practice by combining multiple sampling algorithms developed in the clustering literature.

Our work can be extended in many ways. First, the Gibbs sampler proposed in Section 4 is not scalable to problems with large $N$ or $P$ , because collapsing the sampler over the basis function coefficients forces us to solve many linear systems in each iteration. Faster methods utilizing other algorithms for Dirichlet Process priors, with a focus on Variational Bayes approaches, have the potential to make our methodology applicable to larger data sets. Second, our methods assume that the coefficient curves are globally clustered, which can be a strong assumption in practice. Finally, if point estimation is the only goal, methods such as OSCAR, the Clustered Lasso, and PACS can be extended to the function-on-scalar regression framework, possibly providing another computationally efficient alternative to our approach.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abraham et al. (2003) Christophe Abraham, Pierre-André Cornillon, ERIC Matzner-Løber, and Nicolas Molinari. Unsupervised curve clustering using b-splines. Scandinavian journal of statistics , 30(3):581–595, 2003.
2Antoniak (1974) Charles E. Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. Ann. Statist. , 2(6):1152–1174, 11 1974.
3Blackwell et al. (1973) David Blackwell, James B Mac Queen, et al. Ferguson distributions via pólya urn schemes. The annals of statistics , 1(2):353–355, 1973.
4Bondell and Reich (2008) Howard D Bondell and Brian J Reich. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics , 64(1):115–123, 2008.
5Chen et al. (2016) Yakuan Chen, Jeff Goldsmith, and R Todd Ogden. Variable selection in function-on-scalar regression. Stat , 5(1):88–101, 2016.
6Curtis and Ghosh (2011) S Mc Kay Curtis and Sujit K Ghosh. A bayesian approach to multicollinearity and the simultaneous selection and clustering of predictors in linear regression. Journal of Statistical Theory and Practice , 5(4):715–735, 2011.
7Dunson et al. (2008) David B Dunson, Amy H Herring, and Stephanie M Engel. Bayesian selection and clustering of polymorphisms in functionally related genes. Journal of the American Statistical Association , 103(482):534–546, 2008.
8Eilers and Marx (1996) Paul HC Eilers and Brian D Marx. Flexible smoothing with b-splines and penalties. Statistical science , pages 89–102, 1996.