Quadratic regression for functional response models

Hidetoshi Matsui

arXiv:1702.02009·stat.ME·June 1, 2020

Quadratic regression for functional response models

Hidetoshi Matsui

PDF

TL;DR

This paper introduces a quadratic functional regression model that captures interactions in functional data, estimated via penalized likelihood, and demonstrates its effectiveness through simulations and meteorological data analysis.

Contribution

It extends the functional linear model to include quadratic terms with interaction effects, using basis expansions and penalized likelihood estimation.

Findings

01

The quadratic model outperforms linear models in simulations.

02

The method effectively analyzes meteorological data.

03

Penalized likelihood provides robust parameter estimation.

Abstract

We consider the problem of constructing a regression model with a functional predictor and a functional response. We extend the functional linear model to the quadratic model, where the quadratic term also takes the interaction between the argument of the functional data into consideration. We assume that the predictor and the coefficient functions are expressed by basis expansions, and then parameters included in the model are estimated by the penalized likelihood method assuming that the error function follows a Gaussian process. Monte Carlo simulations are conducted to illustrate the efficacy of the proposed method. Finally, we apply the proposed method to the analysis of meteorological data and explore the results.

Tables2

Table 1. Table 1: Averages of 100 ASEs and their standard deviations (in parentheses) for five models estimated using MLE. Values in this table are multiplied by 10.

	$n = 50$		$n = 100$		$n = 200$
	$ν_{3} = 0.3$	$ν_{3} = 0.6$	$ν_{3} = 0.3$	$ν_{3} = 0.6$	$ν_{3} = 0.3$	$ν_{3} = 0.6$
F-Inter	3.35 (0.90)	4.88 (1.41)	3.05 (1.14)	3.86 (1.21)	3.02 (0.93)	3.43 (0.95)
F-Lin	3.87 (1.85)	4.34 (2.52)	4.66 (2.73)	5.00 (2.89)	5.82 (3.08)	5.93 (3.08)
Inter	8.16 (2.54)	11.55 (3.14)	8.16 (2.09)	11.76 (2.26)	8.33 (1.63)	11.92 (1.69)
Quad	7.43 (2.32)	10.41 (2.92)	5.62 (1.72)	7.20 (1.76)	5.18 (1.38)	5.92 (1.39)
Lin	5.60 (1.89)	7.28 (2.53)	5.40 (2.41)	6.27 (2.46)	6.10 (2.88)	6.48 (2.87)

Table 2. Table 2: Averages of 100 ASEs and their standard deviations (in parentheses) for PMLE and model selection criteria. Values in this table are multiplied by 10.

	$n = 50$		$n = 100$		$n = 200$
	$ν_{3} = 0.3$	$ν_{3} = 0.6$	$ν_{3} = 0.3$	$ν_{3} = 0.6$	$ν_{3} = 0.3$	$ν_{3} = 0.6$
GCV	2.77 (1.02)	3.61 (1.51)	3.28 (1.58)	3.59 (1.58)	3.49 (1.22)	3.73 (1.23)
mAIC	2.69 (1.33)	3.32 (1.76)	2.85 (1.43)	3.10 (1.44)	2.98 (1.08)	3.12 (1.10)
GIC	2.76 (1.02)	3.61 (1.51)	2.78 (1.24)	3.21 (1.26)	2.92 (1.00)	3.13 (1.02)
GBIC	3.02 (1.58)	3.61 (1.80)	3.28 (1.58)	3.59 (1.58)	3.49 (1.22)	3.73 (1.23)

Equations73

Y = α + \int X (t) β (t) d t + \iint X (s) X (t) γ (s, t) d s d t + ε,

Y = α + \int X (t) β (t) d t + \iint X (s) X (t) γ (s, t) d s d t + ε,

y_{i} (t) = α (t) + \int_{S} x_{i} (s) β (s, t) d s + \iint_{S \times S} x_{i} (r) x_{i} (s) γ (r, s, t) d r d s + ε_{i} (t),

y_{i} (t) = α (t) + \int_{S} x_{i} (s) β (s, t) d s + \iint_{S \times S} x_{i} (r) x_{i} (s) γ (r, s, t) d r d s + ε_{i} (t),

\displaystyle x_{i}(s)=\sum_{k=1}^{M_{x}}w_{ik}\phi_{k}(s)=\mbox{\boldmath{$w$}}_{i}^{T}\mbox{\boldmath{$\phi$}}(s),

\displaystyle x_{i}(s)=\sum_{k=1}^{M_{x}}w_{ik}\phi_{k}(s)=\mbox{\boldmath{$w$}}_{i}^{T}\mbox{\boldmath{$\phi$}}(s),

\displaystyle\alpha(t)=\sum_{l=1}^{M_{y}}a_{l}\psi_{l}(t)=\mbox{\boldmath{$a$}}^{T}\mbox{\boldmath{$\psi$}}(t),

\displaystyle\alpha(t)=\sum_{l=1}^{M_{y}}a_{l}\psi_{l}(t)=\mbox{\boldmath{$a$}}^{T}\mbox{\boldmath{$\psi$}}(t),

\displaystyle\beta(s,t)=\sum_{k=1}^{M_{x}}\sum_{l=1}^{M_{y}}b_{kl}\phi_{k}(s)\psi_{l}(t)=\mbox{\boldmath{$\phi$}}(s)^{T}B\mbox{\boldmath{$\psi$}}(t),

\displaystyle\gamma(r,s,t)=\sum_{h=1}^{M_{x}}\sum_{k=1}^{M_{x}}\sum_{l=1}^{M_{y}}\gamma_{hkl}\phi_{h}(r)\phi_{k}(s)\psi_{l}(t)=\left\{\mbox{\boldmath{$\phi$}}(s)\otimes\mbox{\boldmath{$\phi$}}(r)\right\}^{T}{\Gamma}^{T}_{(3)}\mbox{\boldmath{$\psi$}}(t),

y_{i} (t)

y_{i} (t)

\displaystyle=\mbox{\boldmath{$z$}}_{i}^{T}\Theta^{T}\mbox{\boldmath{$\psi$}}(t)+\varepsilon_{i}(t),

\int \dots \int_{S^{p}} x_{i} (s_{1}) \dots x_{i} (s_{p}) γ (s_{1}, \dots, s_{p}, t) d s_{1} \dots d s_{p}

\int \dots \int_{S^{p}} x_{i} (s_{1}) \dots x_{i} (s_{p}) γ (s_{1}, \dots, s_{p}, t) d s_{1} \dots d s_{p}

\displaystyle=\int\cdots\int_{\mathcal{S}^{p}}(\mbox{\boldmath{$w$}}_{i}\otimes\cdots\otimes\mbox{\boldmath{$w$}}_{i})^{T}(\mbox{\boldmath{$\phi$}}(s_{p})\otimes\cdots\otimes\mbox{\boldmath{$\phi$}}(s_{1}))

\displaystyle\hskip 56.9055pt\times(\mbox{\boldmath{$\phi$}}(s_{p})\otimes\cdots\otimes\mbox{\boldmath{$\phi$}}(s_{1}))^{T}\Gamma_{(p+1)}^{T}\mbox{\boldmath{$\psi$}}(t)ds_{1}\cdots ds_{p}

\displaystyle=(\mbox{\boldmath{$w$}}_{i}\otimes\cdots\otimes\mbox{\boldmath{$w$}}_{i})^{T}(\Phi\otimes\cdots\otimes\Phi)\Gamma_{(p+1)}^{T}\mbox{\boldmath{$\psi$}}(t),

ε_{i} (t) = τ_{i} (t) + e_{i} (t),

ε_{i} (t) = τ_{i} (t) + e_{i} (t),

τ_{i} (t) \sim GP (0, k (\cdot, \cdot)),

\displaystyle f(\mbox{\boldmath{$y$}}_{i}|\Theta,\mbox{\boldmath{$\nu$}})=\frac{1}{(2\pi)^{n_{i}/2}\sqrt{|\Sigma_{i}|}}\exp\left\{-\frac{1}{2}\left(\mbox{\boldmath{$y$}}_{i}-\Psi_{i}\Theta\mbox{\boldmath{$z$}}_{i}\right)^{T}\Sigma_{i}^{-1}(\mbox{\boldmath{$y$}}_{i}-\Psi_{i}\Theta\mbox{\boldmath{$z$}}_{i})\right\},

\displaystyle f(\mbox{\boldmath{$y$}}_{i}|\Theta,\mbox{\boldmath{$\nu$}})=\frac{1}{(2\pi)^{n_{i}/2}\sqrt{|\Sigma_{i}|}}\exp\left\{-\frac{1}{2}\left(\mbox{\boldmath{$y$}}_{i}-\Psi_{i}\Theta\mbox{\boldmath{$z$}}_{i}\right)^{T}\Sigma_{i}^{-1}(\mbox{\boldmath{$y$}}_{i}-\Psi_{i}\Theta\mbox{\boldmath{$z$}}_{i})\right\},

\displaystyle\ell_{\lambda}(\Theta,\mbox{\boldmath{$\nu$}})=\ell(\Theta,\mbox{\boldmath{$\nu$}})-\frac{n\lambda}{2}P(\Theta),

\displaystyle\ell_{\lambda}(\Theta,\mbox{\boldmath{$\nu$}})=\ell(\Theta,\mbox{\boldmath{$\nu$}})-\frac{n\lambda}{2}P(\Theta),

P (Θ) =

P (Θ) =

tr {Γ_{(3)} (Ω_{x} \otimes I_{M_{x}}) Γ_{(3)}^{T}} + tr {Γ_{(3)} (I_{M_{x}} \otimes Ω_{x}) Γ_{(3)}^{T}} + tr {Γ_{(3)}^{T} Ω_{y} Γ_{(3)}}

=

=

vec \hat{Θ}

vec \hat{Θ}

\displaystyle=\left(X^{T}\Sigma^{-1}X+n\lambda\Omega\right)^{-1}\left(X^{T}\Sigma^{-1}\mbox{\boldmath{$y$}}\right),

\displaystyle\mbox{\boldmath{$\nu$}}^{new}=\mbox{\boldmath{$\nu$}}^{old}-\left\{\left.\frac{\partial^{2}\ell_{\lambda}(\Theta,\mbox{\boldmath{$\nu$}})}{\partial\mbox{\boldmath{$\nu$}}\partial\mbox{\boldmath{$\nu$}}^{T}}\right|_{\mbox{\boldmath{$\nu$}}=\mbox{\boldmath{$\nu$}}^{old}}\right\}^{-1}\left.\frac{\partial\ell_{\lambda}(\Theta,\mbox{\boldmath{$\nu$}})}{\partial\mbox{\boldmath{$\nu$}}}\right|_{\mbox{\boldmath{$\nu$}}=\mbox{\boldmath{$\nu$}}^{old}}.

\displaystyle\mbox{\boldmath{$\nu$}}^{new}=\mbox{\boldmath{$\nu$}}^{old}-\left\{\left.\frac{\partial^{2}\ell_{\lambda}(\Theta,\mbox{\boldmath{$\nu$}})}{\partial\mbox{\boldmath{$\nu$}}\partial\mbox{\boldmath{$\nu$}}^{T}}\right|_{\mbox{\boldmath{$\nu$}}=\mbox{\boldmath{$\nu$}}^{old}}\right\}^{-1}\left.\frac{\partial\ell_{\lambda}(\Theta,\mbox{\boldmath{$\nu$}})}{\partial\mbox{\boldmath{$\nu$}}}\right|_{\mbox{\boldmath{$\nu$}}=\mbox{\boldmath{$\nu$}}^{old}}.

\displaystyle\hat{\mbox{\boldmath{$y$}}}=X{\rm vec}{\hat{\Theta}}=X\left\{X^{T}\Sigma^{-1}X+n\lambda\Omega\right\}^{-1}X^{T}\Sigma^{-1}\mbox{\boldmath{$y$}}.

\displaystyle\hat{\mbox{\boldmath{$y$}}}=X{\rm vec}{\hat{\Theta}}=X\left\{X^{T}\Sigma^{-1}X+n\lambda\Omega\right\}^{-1}X^{T}\Sigma^{-1}\mbox{\boldmath{$y$}}.

GCV

GCV

mAIC

GIC =

GIC =

GBIC =

\displaystyle\log|R(\hat{\Theta},\hat{\mbox{\boldmath{$\nu$}}})|-\log|\Omega|_{+}-\zeta\log(2\pi),

\displaystyle R(\hat{\Theta},\hat{\mbox{\boldmath{$\nu$}}})

\displaystyle R(\hat{\Theta},\hat{\mbox{\boldmath{$\nu$}}})

\displaystyle Q(\hat{\Theta},\hat{\mbox{\boldmath{$\nu$}}})

(F - INTER)

(F - INTER)

(F - LIN)

(INTER)

(QUAD)

(LIN)

\displaystyle\frac{\partial\ell_{\lambda}(\Theta,\mbox{\boldmath{$\nu$}})}{\partial\Theta}

\displaystyle\frac{\partial\ell_{\lambda}(\Theta,\mbox{\boldmath{$\nu$}})}{\partial\Theta}

\displaystyle\frac{\partial\ell_{\lambda}(\Theta,\mbox{\boldmath{$\nu$}})}{\partial\nu_{j}}

α

\displaystyle\displaystyle{\frac{\partial^{2}\ell_{\lambda}(\Theta,\mbox{\boldmath{$\nu$}})}{\partial({\rm vec}\Theta)\partial({\rm vec}\Theta)^{T}}}

\displaystyle\displaystyle{\frac{\partial^{2}\ell_{\lambda}(\Theta,\mbox{\boldmath{$\nu$}})}{\partial({\rm vec}\Theta)\partial({\rm vec}\Theta)^{T}}}

\displaystyle\displaystyle{\frac{\partial^{2}\ell_{\lambda}(\Theta,\mbox{\boldmath{$\nu$}})}{\partial({\rm vec}\Theta)\partial\nu_{j}}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Quadratic regression for functional response models

Hidetoshi Matsui

*Faculty of Data Science, Shiga University

1-1-1, Banba, Hikone, Shiga 522-8522, Japan.

[email protected]

**Abstract: We consider the problem of constructing a regression model with a functional predictor and a functional response. We extend the functional linear model to the quadratic model, where the quadratic term also takes the interaction between the argument of the functional data into consideration. We assume that the predictor and the coefficient functions are expressed by basis expansions, and then parameters included in the model are estimated by the penalized likelihood method assuming that the error function follows a Gaussian process. Monte Carlo simulations are conducted to illustrate the efficacy of the proposed method. Finally, we apply the proposed method to the analysis of meteorological data and explore the results. **

Key Words and Phrases: Functional data analysis; Gaussian process; Interaction

1 Introduction

Functional data analysis (FDA) has received considerable attentions in various fields of application such as bioscience and meteorology. A number of methodological, theoretical, and empirical developments have occured over time which have improved and extended this technique (Ramsay and Silverman, 2005; Horváth and Kokoszka, 2012; Kokoszka and Reimherr, 2017). The basic idea behind FDA is to express each individual in repeatedly measured data as a smooth function and then draw information from the collection of functional data. We consider the problem of constructing a regression model where both a predictor and the response are given as functional data.

Functional regression analysis has been widely studied in the literature. When the predictor is a function but the response is a scalar, James (2002) considered such models in the framework of generalized linear models. In addition, many other extensions such as adaptive models and neural networks have been reported (Müller and Stadtmüller, 2005; Rossi et al., 2005). Although flexible models capture the complex structures representing the relationships among variables, the results can be difficult to mechanistically interpret. Yao and Müller (2010) proposed a functional regression model with a quadratic term of the functional predictor. Suppose we have a functional predictor $X(t)$ and a scalar response $Y$ ; then the functional quadratic model is given by

[TABLE]

where $\alpha$ is an intercept; $\beta(t)$ and $\gamma(s,t)$ are coefficient surfaces for linear and quadratic terms, respectively; and $\varepsilon$ is an additive error term. Therefore, this model accommodate an interaction between a functional predictor $X$ at two arbitrary different time points. Fuchs et al. (2015) and Usset et al. (2016) proposed functional regression models that consider interactions between multiple functional predictors. Wei et al. (2014) constructed a functional logistic regression model that considers an interaction between genetic variants and repeatedly measured environmental exposure to classify the disease status of patients. Their results suggested that disease classification was sensitive to the modeled variables and their interactions.

Functional regression models with a functional predictor and a functional response are also considered in Ramsay and Dalzell (1991), Yao et al. (2005), and Scheipl et al. (2015). This type of model is a useful tool for simulating the relationship between predictor and response functions at arbitrary times. However, hitherto, neither quadratic nor interaction terms have been considered with this type of model. In this paper, we introduce a quadratic term into the functional regression model with a functional predictor and a functional response. The predictor and coefficient functions are expressed by basis expansions; we show that, from these assumptions the problem of estimating coefficient functions becomes that of estimating parameter vectors. Furthermore, we consider estimating these parameters by the penalized likelihood method, where we assume the error function follows a Gaussian process, based on Shi and Choi (2011). Values of tuning parameters included in the estimation process are determined by model selection criteria rooted in information theory and a Bayesian approach (Konishi and Kitagawa, 2008). We also illustrate the efficacy of our method through Monte Carlo simulations and empirical analysis of meteorological data.

The remainder of the paper is organized as follows. Section 2 specifies a functional quadratic model for a functional predictor and a functional response. In Section 3, we introduce the method for model estimation and evaluation in a maximum likelihood framework. We report the results of simulation studies in Section 4 before applying the proposed method to the analysis of empirical data in Section 5. Finally Section 6 offers conclusions.

2 Functional quadratic model

Suppose we have $n$ sets of a functional predictor and a functional response $\{(x_{i}(s),y_{i}(t));i=1,\ldots,n,s\in\mathcal{S}\subset\mathbb{R},t\in\mathcal{T}\in\mathbb{R}\}$ . We model the relationship between the predictor $x_{i}$ and the response $y_{i}$ as the following functional quadratic model with interactions as follows:

[TABLE]

where $\alpha(t)$ is a baseline function, $\beta(s,t)$ is the coefficient surface for the linear term, $\gamma(r,s,t)$ is the coefficient hypersurface for the quadratic term, and $\varepsilon_{i}(t)$ is an error term. Each coefficient function explains the weight of the predictor on the response at different time points. In particular, the quadratic term considers the interaction of $x_{i}$ at two different time points $r$ and $s$ and $\gamma(r,s,t)$ represent weights of these interactions on the response function.

To estimate coefficient functions in functional regression models, several methods have been proposed. Here we apply the basis expansion method, that is, the predictors $x_{i}(s)$ are expressed by linear combinations of basis functions:

[TABLE]

where $\mbox{\boldmath{$ \phi $}}(s)=(\phi_{1}(s),\ldots,\phi_{M_{x}}(s))^{T}$ is a vector of $M_{x}$ basis functions and $\mbox{\boldmath{$ w $}}_{i}=(w_{i1},\ldots,w_{iM_{x}})^{T}$ is a vector of weights that are obtained using smoothing techniques (see, e.g., Green and Silverman, 1994; Araki et al., 2009). In addition, we assume that the baseline and coefficient functions are also expressed by basis expansions as follows:

[TABLE]

where $\mbox{\boldmath{$ \psi $}}(t)=(\psi_{1}(t),\ldots,\psi_{M_{y}}(t))^{T}$ is a vector of $M_{y}$ basis functions, $\mbox{\boldmath{$ \alpha $}}=(\alpha_{1},\ldots,\alpha_{M_{y}})^{T}$ , $B=(b_{kl})_{kl}$ and $\Gamma_{(3)}$ is an $M_{y}\times M_{x}^{2}$ matrix obtained by matricizing a 3-dimensional $M_{x}\times M_{x}\times M_{y}$ tensor $\underline{\Gamma}=(\gamma_{hkl})_{hkl}$ with respect to the 3rd array. The strict definition of $\Gamma_{(3)}$ is used following de Lathauwer et al. (2000). In general, the functional regression model with basis expansions reduces the number of model parameters more than the traditional regression model, because the number of basis functions is smaller than the number of time points. This leads to model dimension reduction and provides more stable estimates. Using the above assumptions, the functional quadratic model (1) is written as

[TABLE]

where $\Phi=\int\mbox{\boldmath{$ \phi $}}(s)\mbox{\boldmath{$ \phi $}}(s)^{T}ds$ , $\mbox{\boldmath{$ z $}}_{i}=(1,\mbox{\boldmath{$ w $}}_{i}^{T}\Phi,(\mbox{\boldmath{$ w $}}_{i}\otimes\mbox{\boldmath{$ w $}}_{i})^{T}(\Phi\otimes\Phi))^{T}$ and $\Theta=(\mbox{\boldmath{$ \alpha $}}~{}B^{T}~{}\Gamma_{(3)})^{T}$ is a matrix of parameters. Thus, the problem of estimating the baseline and coefficient functions becomes one of estimating the parameter matrix $\Theta$ .

More generally, when we have a $p$ -th order interaction term with respect to $x_{i}$ in the functional regression model, using the same assumptions described above, this term is expressed by

[TABLE]

where $\Gamma_{(p+1)}$ is an $M_{y}\times M_{x}^{p}$ matrix obtained by matricizing a $(p+1)$ -dimensional $M_{x}\times\cdots\times M_{x}\times M_{y}$ tensor $\underline{\Gamma}=(\gamma_{k_{1},\ldots,k_{p+1}})_{k_{1},\ldots,k_{p+1}}$ with respect to the $(p+1)$ -th array. Therefore, we can easily extend the quadratic model to the $p$ -th order polynomial model and estimate it similarly using the method described in the next section, but in the following sections, we return to the functional quadratic model (1).

3 Model estimation and evaluation

We consider estimating the model in a penalized likelihood framework of the penalized likelihood method. To do this, we assume that the error function $\varepsilon_{i}(t)$ has the following structure (Fan and Zhang, 2000; Shi and Choi, 2011):

[TABLE]

where $GP(0,k(\cdot,\cdot))$ denotes a Gaussian process with mean 0 and a covariance function $k(\cdot,\cdot)$ , and $\nu_{1}>0$ , $\nu_{2}>0$ , and $\nu_{3}>0$ are additional model parameters. By considering that the response $\mbox{\boldmath{$ y $}}_{i}=(y_{i1},\ldots,y_{in_{i}})^{T}$ is observed at $n_{i}$ time points $t_{i1},\ldots,t_{in_{i}}$ for each subject, we have the following probability density function:

[TABLE]

where $\mbox{\boldmath{$ \nu $}}=(\nu_{1},\nu_{2},\nu_{3})^{T}$ , $\Sigma_{i}=K_{i}+\nu_{3}^{2}I_{n_{i}}$ , $K_{i}=(k(t_{ij},t_{ij^{\prime}}))_{jj^{\prime}}$ and $\Psi_{i}=(\mbox{\boldmath{$ \psi $}}(t_{i1}),\ldots,\mbox{\boldmath{$ \psi $}}(t_{in_{i}}))^{T}$ .

3.1 Penalized likelihood method

To estimate parameters $\Theta$ and $\nu$ , we consider maximizing the penalized log-likelihood function given by

[TABLE]

where $\ell(\Theta,\mbox{\boldmath{$ \nu $}})=\sum_{i=1}^{n}\log f(\mbox{\boldmath{$ y $}}_{i}|\Theta,\mbox{\boldmath{$ \nu $}})$ is a log-likelihood function, $\lambda>0$ is a regularization parameter, and $P(\Theta)$ is a penalty function. To penalize the coefficient functions in the model (1) for the fluctuation in the $r$ , $s$ , and $t$ directions for linear and quadratic terms, we configure the following penalty function:

[TABLE]

where $\Omega_{x}$ and $\Omega_{y}$ are respectively $M_{x}\times M_{x}$ and $M_{y}\times M_{y}$ positive semi-definite matrices. An example for $\Omega_{x}$ and $\Omega_{y}$ is to use $D_{2}^{T}D_{2}$ with a second-order differential matrix $D_{2}$ . Furthermore, $\Omega_{x}^{*}={\rm blockdiag}\{0,\Omega_{x},\Omega_{x}\otimes I_{M_{x}}+I_{M_{x}}\otimes\Omega_{x}\}$ and $\Omega=I_{M_{y}}\otimes\Omega_{x}^{*}+\Omega_{y}\otimes I_{1+M_{x}+M_{x}^{2}}$ . The 1st term of the second equation of (6) corresponds to the penalty for the roughness of $\alpha(t)$ , whilst the 2nd and 3rd terms penalize the roughness of $\beta(s,t)$ with respect to $s$ and $t$ directions, respectively. Furthermore, the 4th, 5th, and 6th terms penalize the roughness of $\gamma(r,s,t)$ with respect to the $r$ , $s$ , and $t$ directions, respectively. Then if $\nu$ were known, $\Theta$ would be estimated as

[TABLE]

where $X=(X_{1}^{T},\ldots,X_{n}^{T})^{T}$ , $X_{i}=\mbox{\boldmath{$ z $}}_{i}^{T}\otimes\Psi_{i}$ , $\Sigma={\rm blockdiag}\{\Sigma_{1},\ldots,\Sigma_{n}\}$ , and $\mbox{\boldmath{$ y $}}=(\mbox{\boldmath{$ y $}}_{1}^{T},\ldots,\mbox{\boldmath{$ y $}}_{n}^{T})^{T}$ . In practice $\nu$ is unknown and it is difficult to derive the estimator of $\nu$ analytically. Therefore, several iterative algorithms have been proposed and used in the literature. Here we apply the Newton-Raphson method, updated by

[TABLE]

The first and second derivatives of $\ell_{\lambda}(\Theta,\mbox{\boldmath{$ \nu $}})$ with respect to $\nu_{j}$ $(j=1,2,3)$ are provided in the Appendix. Parameters $\Theta$ and $\nu$ are alternately updated until convergence, and then we arrive at penalized maximum likelihood estimators $\hat{\Theta}$ and $\hat{\mbox{\boldmath{$ \nu $}}}$ , respectively. Finally, we have a statistical model $f(\mbox{\boldmath{$ y $}}_{i}|\hat{\Theta},\hat{\mbox{\boldmath{$ \nu $}}})$ .

3.2 Model selection criteria

It is important to select optimal values of tuning parameters such as the number of basis functions $M_{y}$ in (3) and the regularization parameter $\lambda$ in (5). Here we introduce some model selection criteria for evaluating the estimated model.

First, we introduce model selection criteria based on effective degrees of freedom described in Eilers and Marx (1996). To derive the effective degrees of freedom of our model, we express the predicted value of the response $\hat{\mbox{\boldmath{$ y $}}}$ as

[TABLE]

Using Eilers and Marx (1996), the effective degrees of freedom are $df={\rm tr}\{S\}$ where $S=X\left\{X^{T}\Sigma^{-1}X+n\lambda\Omega\right\}^{-1}X^{T}\Sigma^{-1}$ is known as a hat matrix or a smoother matrix. If $\lambda=0$ and $n>(1+M_{x}+M_{x}^{2})M_{y}$ then the effective degrees of freedom becomes $(1+M_{x}+M_{x}^{2})M_{y}$ , which is just the same as the number of parameters included in $\Theta$ . The GCV and the modified AIC are respectively given by

[TABLE]

Note that we added the number of parameters in $\nu$ , 3, to the degrees freedom.

The AIC was originally derived for evaluating models estimated by the maximum likelihood method, so it is not suitable for evaluating those estimated by the penalized likelihood method. To solve this problem, Konishi and Kitagawa (2008) derived information criteria for evaluating models estimated by penalized likelihood methods. Specifically, rooted in information theory and a Bayesian approach, they proposed a generalized information criterion (GIC) and a generalized Bayesian information criterion (GBIC). Based on Konishi and Kitagawa (1996) and Konishi et al. (2004), the GIC and GBIC are respectively given by

[TABLE]

where $R(\hat{\Theta},\hat{\mbox{\boldmath{$ \nu $}}})$ and $Q(\hat{\Theta},\hat{\mbox{\boldmath{$ \nu $}}})$ are respectively given by

[TABLE]

Details concerning the $R(\hat{\Theta},\hat{\mbox{\boldmath{$ \nu $}}})$ , and $Q(\hat{\Theta},\hat{\mbox{\boldmath{$ \nu $}}})$ matrices are provided in the Appendix. Furthermore, $\eta=(1+M_{x}+M_{x}^{2})M_{y}$ is the size of matrix $\Omega$ , $\zeta$ is the number of nonzero eigenvalues of $\Omega$ , and $|\cdot|_{+}$ denotes the product of positive matrix eigenvalues. We select tuning parameter values that minimize these criteria and then treat the corresponding model as the optimal candidate.

4 Simulations

We conducted Monte Carlo simulations to investigate the efficacy of the proposed method. We artificially generated datasets from a regression model, and then compared the prediction accuracy of the proposed method with ordinary models.

The data generation procedure is given as follows. First we assume that the relationship between the functional predictor and the functional response is given by the functional quadratic model (1) with basis expansions (2) and (3). We set $\alpha(t)=0$ for simplicity. We used $B$ -splines for basis functions $\mbox{\boldmath{$ \phi $}}(s)$ and $\mbox{\boldmath{$ \psi $}}(t)$ , where the numbers of these basis functions, respectively $M_{x}$ and $M_{y}$ , are fixed at seven. The elements of matrices $B$ and $\Gamma$ included in (3) are generated from a Wishart distribution with 10 degrees of freedom and a Toeplitz scale matrix. Similarly, to generate the predictor function $x_{i}(t)$ , the coefficients $\mbox{\boldmath{$ w $}}_{i}$ in (2) are generated from a multivariate normal distribution with 0 mean vector and a Toeplitz variance covariance matrix. Then, the mean structure of the right-hand side of the model (1), denoted by $g_{i}(t)$ $(i=1,\ldots,n)$ , is obtained. There are $n_{t}=21$ equally spaced time points. We also added noise generated using a Gaussian process given in (4) to $g_{i}(t)$ and then configured a longitudinal response $y_{ij}$ , where subscript $j$ indexes time. We then added Gaussian noise to the predictor $x_{i}(t)$ at discrete time points and generated longitudinal data $x_{ij}$ . We treated $x_{ij}$ and $y_{ij}$ as observations, and therefore the objective of the simulation is to predict $g_{i}(t)$ using $x_{ij}$ and $y_{ij}$ .

Next, we analyzed data by the following procedure. We fitted $x_{ij}$ for each $i$ to construct functional data $x_{i}(t)$ using a smoothing technique with Gaussian radial basis functions (Kawano and Konishi, 2007). To avoid computational burden, we fixed the number $M_{y}$ of basis functions at six. After obtaining $\mbox{\boldmath{$ w $}}_{i}$ , we estimated the model using the proposed method. Here we conduct a comparative exercise to evaluate the proposed method.

For this exercise, first, we compared the proposed functional quadratic model with an interaction (F-INTER) with a functional linear model (F-LIN), a multivariate quadratic model with interactions (INTER), a quadratic model (QUAD), and a linear model (LIN). They are respectively given by

[TABLE]

where $\beta_{j}$ , $\gamma_{jj^{\prime}}$ , and $\gamma_{j}$ are coefficient parameters. We estimated the above five models using maximum likelihood estimation (MLE), but the generalized inverse is used for the inversion of $X^{T}\Sigma^{-1}X+n\lambda\Omega$ in (7). We used the generalized inverse because we wanted to compare results for small sample sizes, and in this case, the number of parameters exceeds the sample size for the proposed model and thus ordinal matrix inversion is not possible. For each model, we obtained the predicted response $\hat{y}_{i}(t)$ and then calculated the average squared error ${\rm ASE}=1/(n\cdot n_{t})\sum_{i=1}^{n}\sum_{j=1}^{n_{t}}(g_{i}(t_{j})-\hat{y}_{i}(t_{j}))^{2}$ . We repeated this procedure 100 times for several sample sizes and noise levels, and then investigated the prediction accuracy. Table 2 presents averages and standard deviations for 100 ASEs across the five models, with boxplots of ASEs provided in 2. In most cases, the proposed model (F-INTER) minimizes ASE compared to other models, and it also provides the most stable results. F-LIN is associated with smaller ASEs than F-INTER for smaller sample sizes and larger noise levels, but it gives less stable results than F-INTER and its performance deteriorates as sample size increases. ASEs for INTER are larger than other models, and have high variances, especially where the sample size is small. ASEs for QUAD are also large for small sample sizes, but they decrease as sample size increases. LIN gives smaller ASEs than other models for multivariate data, but they are still larger than the functional regression models.

In the second analysis, penalized maximum likelihood estimation (PMLE) is used, and the optimal value of the regularization parameter $\lambda$ is decided by the four model selection criteria introduced in Section 3.2. As per the previous analysis, we calculated the ASE with 100 repetitions. Table 2 and Figure 2 show results according to the four model selection criteria. In all cases the penalized likelihood method yields better or competitive results than the maximum likelihood method, especially for smaller sample sizes. There is no remarkable difference among the results from the four model selection criteria, but the ASEs for GIC are more stable than for other criteria.

5 Empirical example

We applied the proposed method to the analysis of meteorological data available from Chronological Scientific Tables 2005. We used monthly temperature and the natural logarithm of monthly precipitation averaged from 1971 to 2000, observed at 76 cities in Japan. The data are shown in Figure 3. In many cities in Japan, it rains in summer more than in winter, especially in June. On the other hand, several cities have much snow in winter, and therefore these cities have much precipitation in winter. We treated temperature and precipitation as predictor and response, respectively, and considered data at 12 time points.

First, we smoothed the temperature data using radial basis functions and then obtained coefficients $\mbox{\boldmath{$ w $}}_{i}$ for functional data. Next, we applied a functional quadratic model (1) to the data and estimated parameters $\Theta$ and $\nu$ by the penalized likelihood method. The number of basis functions $M_{y}$ and the regularization parameter $\lambda$ are selected by model selection criterion GIC. Then we investigated the predicted precipitation and coefficient functions for linear and quadratic terms.

The selected number of basis functions is $M_{y}=10$ and the selected regularization parameter is $\lambda=1.0\times 10^{-5}$ . Figure 4 compares predicted functions $\hat{y}_{i}(t)$ and residuals $y_{ij}-\hat{y}_{i}(t_{ij})$ for precipitation obtained from linear and quadratic models. Both models broadly capture precipitation trends, but the quadratic model captures individual variation in the response better the linear model. The superior performance of the quadratic model is also supposed by the fact that it is associated with smaller residuals compared to the linear model. Further, the quadratic model captures individual variations in winter: it can explain the precipitation in cities with heavy snow fall well.

Next, we focus on the estimated functional quadratic model. Figure 5 shows the estimated baseline and coefficient functions for the linear and quadratic terms. The baseline function $\hat{\alpha}(t)$ broadly captures the overall trend of precipitation functions. The coefficient surface for the linear term indicates the contribution of temperature $X(s)$ to precipitation $Y(t)$ at arbitrary times $s$ and $t$ . The result shows that higher temperatures around February contribute to the higher precipitation between fall and spring, whilst higher temperatures at the end of the year contribute to lower precipitation between fall and spring. Figure 6 shows the estimated hypersurface $\hat{\gamma}(r,s,t)$ visualized at discrete time points of the precipitation functions. This function indicates the interaction of the predictor between at $r$ and $s$ at arbitrary time points $t$ . For example, we can find that the estimated hypersurface $\gamma(r,s,t)$ has negative weights at the end of $r$ and $s$ for $t=11,12$ , and 1. This suggests that lower temperatures around November and December lead to higher precipitation in winter. On the other hand, it gives negative weights at the beginning of $s$ and the end of $t$ , which indicates that lower temperatures around winter lead to high precipitation between spring and summer.

6 Concluding remarks

Functional regression models elucidate the complex relationship between repeatedly measured variables. In this paper, we constructed quadratic regression models for functional data where both the predictor and response are given as functions. In addition to representing more flexible relationships between variables than linear models, quadratic variants offer another insight by considering predictor interactions between two time points. We assumed Gaussian process for the error function with model parameters estimated by the penalized maximum likelihood method. Tuning parameters included in the estimation procedure are selected by model selection criteria. The efficacy of our method is evaluated through simulation studies and an empirical example using real data.

We treated the data as functions of time and assumed annual periodicities. In doing so, this generally gives a us the paradoxical insight that future predictor information affects previous responses. For this problem, Malfait and Ramsay (2003) proposed a historical functional linear model that takes predictor information only at past response times. It would be fruitful for future research to apply a historical functional linear model to our context and analyze data without periodicity. For functional linear models with functional response, several estimation methods are proposed beyond what was used in this paper. It is thus important to compare the accuracy and computational speed of our method to other methods.

Appendix

We provide details of the derivatives used in our method.

The first derivatives of $\ell_{\lambda}(\Theta,\mbox{\boldmath{$ \nu $}})$ with respect to $\Theta$ and $\nu_{j}$ are respectively given by

[TABLE]

The first and second derivatives of the penalized log-likelihood function (5), used in the Newton-Raphson update (8) and the model selection criteria GIC and GBIC (9), are given as follows:

[TABLE]

If we assume a Gaussian process with a Gaussian covariance function as in (4), the first and second derivatives of $\Sigma_{i}$ with respect to $\nu_{j}$ $(j=1,2,3)$ are given by

[TABLE]

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Araki et al. (2009) Araki, Y., Konishi, S., Kawano, S., and Matsui, H. (2009), “Functional regression modeling via regularized Gaussian basis expansions,” Ann. Inst. Statist. Math. , 61, 811–833.
2de Lathauwer et al. (2000) de Lathauwer, L., de Moor, B., and Vandewalle, J. (2000), “A multilinear singular value decomposition,” SIAM J. Matrix Anal. Appl. , 21, 1253–1278.
3Eilers and Marx (1996) Eilers, P. H. C. and Marx, B. D. (1996), “Flexible smoothing with B-splines and penalties,” Statistical science , 11, 89–121.
4Fan and Zhang (2000) Fan, J. and Zhang, J. (2000), “Two-step estimation of functional linear models with applications to longitudinal data,” J. Roy. Statist. Soc. Ser. B , 62, 303–322.
5Fuchs et al. (2015) Fuchs, K., Scheipl, F., and Greven, S. (2015), “Penalized scalar-on-functions regression with interaction term,” Comput. Statist. Data Anal. , 81, 38–51.
6Green and Silverman (1994) Green, P. and Silverman, B. (1994), Nonparametric regression and generalized linear models: a roughness penalty approach , London: Chapman & Hall/CRC.
7Horváth and Kokoszka (2012) Horváth, L. and Kokoszka, P. (2012), Inference for functional data with applications , New York: Springer.
8James (2002) James, G. (2002), “Generalized linear models with functional predictors,” J. Roy. Statist. Soc. Ser. B , 64, 411–432.