On estimation of the effect lag of predictors and prediction in   functional linear model

Haiyan Liu; Georgios Aivaliotis; Jeanine Houwing-Duistermaat

arXiv:1907.09808·stat.ME·July 24, 2019

On estimation of the effect lag of predictors and prediction in functional linear model

Haiyan Liu, Georgios Aivaliotis, Jeanine Houwing-Duistermaat

PDF

Open Access

TL;DR

This paper introduces a functional linear model that predicts responses using multiple functional predictors, estimates their effect lags, and evaluates the model's properties and performance through simulations.

Contribution

It presents a novel method for estimating predictor effect lags in a functional linear model using basis expansions and penalized optimization.

Findings

01

Effective estimation of predictor effect lags demonstrated

02

Model shows strong predictive performance in simulations

03

Mathematical properties of estimators are established

Abstract

We propose a functional linear model to predict a response using multiple functional and longitudinal predictors and to estimate the effect lags of predictors. The coefficient functions are written as the expansion of a basis system (e.g. functional principal components, splines), and the coefficients of the fixed basis functions are estimated via optimizing a penalization criterion. Then time lags are determined by simultaneously searching on a prior grid mesh based on minimization of prediction error criterion. Moreover, mathematical properties of the estimated parameters and predicted responses are studied and performance of the method is evaluated by extensive simulations.

Tables1

Table 1. Table 1: NPEs based on correct lags

n	50	100	150	200
NPE $\times 100$	2.08	1.95	1.86	1.79

Equations88

Y (t) = β_{0} (t) + \int_{0}^{T} β_{1} (s, t) X (s) d s + ϵ (t), t \in [0, T]

Y (t) = β_{0} (t) + \int_{0}^{T} β_{1} (s, t) X (s) d s + ϵ (t), t \in [0, T]

Y (t) = β_{0} (t) + \int_{t - δ_{2}}^{t - δ_{1}} β_{1} (s, t) X (s) d s + ϵ (s), t \in [0, T]

Y (t) = β_{0} (t) + \int_{t - δ_{2}}^{t - δ_{1}} β_{1} (s, t) X (s) d s + ϵ (s), t \in [0, T]

Y_{ij} = β_{0} (t_{ij}) + \int_{δ_{11}}^{δ_{12}} β_{1} (s, t_{ij}) X_{1 i} (t_{ij} - s) d s + \int_{δ_{21}}^{δ_{22}} β_{2} (s, t_{ij}) X_{2 i} (t_{ij} - s) d s + e_{ij}

Y_{ij} = β_{0} (t_{ij}) + \int_{δ_{11}}^{δ_{12}} β_{1} (s, t_{ij}) X_{1 i} (t_{ij} - s) d s + \int_{δ_{21}}^{δ_{22}} β_{2} (s, t_{ij}) X_{2 i} (t_{ij} - s) d s + e_{ij}

Y_{ij} = β_{0} (t_{ij}) + \int_{t_{ij} - δ_{12}}^{t_{ij} - δ_{11}} β_{1} (t_{ij} - s, t_{ij}) X_{1 i} (s) d s + \int_{t_{ij} - δ_{22}}^{t_{ij} - δ_{21}} β_{2} (t_{ij} - s, t_{ij}) X_{2 i} (s) d s + e_{ij}

Y_{ij} = β_{0} (t_{ij}) + \int_{t_{ij} - δ_{12}}^{t_{ij} - δ_{11}} β_{1} (t_{ij} - s, t_{ij}) X_{1 i} (s) d s + \int_{t_{ij} - δ_{22}}^{t_{ij} - δ_{21}} β_{2} (t_{ij} - s, t_{ij}) X_{2 i} (s) d s + e_{ij}

β_{1} (s, t) = k = 1 \sum K_{1} B_{1 k} (s) b_{1 k} (t), s \in Δ_{1}, t \in [0, 1]

β_{1} (s, t) = k = 1 \sum K_{1} B_{1 k} (s) b_{1 k} (t), s \in Δ_{1}, t \in [0, 1]

β_{2} (s, t) = k = 1 \sum K_{2} B_{2 k} (s) b_{2 k} (t), s \in Δ_{2}, t \in [0, 1]

β_{2} (s, t) = k = 1 \sum K_{2} B_{2 k} (s) b_{2 k} (t), s \in Δ_{2}, t \in [0, 1]

Y_{ij} =

Y_{ij} =

+ k = 1 \sum K_{2} b_{2 k} (t_{ij}) \int_{δ_{21}}^{δ_{22}} B_{2 k} (s) X_{2 i} (t_{ij} - s) d s + e_{ij}

=:

=:

P S S E_{b_{1}, b_{2}} =

P S S E_{b_{1}, b_{2}} =

=

[\hat{b}_{1} (t_{j}^{0}) \hat{b}_{2} (t_{j}^{0})] = (Z_{j}^{T} Z_{j} + [ρ_{1} I_{K_{1}} 0 0 ρ_{2} I_{K_{2}}])^{- 1} (Z_{j}^{T} Y_{j})

[\hat{b}_{1} (t_{j}^{0}) \hat{b}_{2} (t_{j}^{0})] = (Z_{j}^{T} Z_{j} + [ρ_{1} I_{K_{1}} 0 0 ρ_{2} I_{K_{2}}])^{- 1} (Z_{j}^{T} Y_{j})

Z_{j} = \tilde{X}_{111} (t_{j}^{0}) ⋮ \tilde{X}_{1 n 1} (t_{j}^{0}) \dots \dots \tilde{X}_{11 K_{1}} (t_{j}^{0}) ⋮ \tilde{X}_{1 n K_{1}} (t_{j}^{0}) \tilde{X}_{211} (t_{j}^{0}) ⋮ \tilde{X}_{2 n 1} (t_{j}^{0}) \dots \dots \tilde{X}_{21 K_{2}} (t_{j}^{0}) ⋮ \tilde{X}_{2 n K_{2}} (t_{j}^{0}) .

Z_{j} = \tilde{X}_{111} (t_{j}^{0}) ⋮ \tilde{X}_{1 n 1} (t_{j}^{0}) \dots \dots \tilde{X}_{11 K_{1}} (t_{j}^{0}) ⋮ \tilde{X}_{1 n K_{1}} (t_{j}^{0}) \tilde{X}_{211} (t_{j}^{0}) ⋮ \tilde{X}_{2 n 1} (t_{j}^{0}) \dots \dots \tilde{X}_{21 K_{2}} (t_{j}^{0}) ⋮ \tilde{X}_{2 n K_{2}} (t_{j}^{0}) .

[\hat{b}_{1} (t) \hat{b}_{2} (t)] = ([\hat{C}_{11} (t) \hat{C}_{21} (t) \hat{C}_{12} (t) \hat{C}_{22} (t)] + [\frac{ρ _{1}}{n} I_{K_{1}} 0 0 \frac{ρ _{2}}{n} I_{K_{2}}])^{- 1} [\hat{C}_{1 Y} (t) \hat{C}_{2 Y} (t)]

[\hat{b}_{1} (t) \hat{b}_{2} (t)] = ([\hat{C}_{11} (t) \hat{C}_{21} (t) \hat{C}_{12} (t) \hat{C}_{22} (t)] + [\frac{ρ _{1}}{n} I_{K_{1}} 0 0 \frac{ρ _{2}}{n} I_{K_{2}}])^{- 1} [\hat{C}_{1 Y} (t) \hat{C}_{2 Y} (t)]

C_{\tilde{X}_{1 k}, \tilde{X}_{1 l}} (t)

C_{\tilde{X}_{1 k}, \tilde{X}_{1 l}} (t)

= \int_{δ_{11}}^{δ_{12}} \int_{δ_{11}}^{δ_{12}} B_{1 k} (s) B_{1 l} (u) E [X_{1} (t - s) X_{1} (t - u)] d u d s

= \int_{δ_{11}}^{δ_{12}} \int_{δ_{11}}^{δ_{12}} B_{1 k} (s) B_{1 l} (u) C_{X_{1}} (t - s, t - u) d u d s

\hat{C}_{X_{1}} (s, u) = \frac{1}{( m _{X 1} b ) ^{2}} j, k = 1 \sum m_{X 1} K (\frac{s - s _{1 j}}{b}, \frac{u - s _{1 k}}{b}) \frac{1}{n} i = 1 \sum n W_{1 ij} W_{1 ik}

\hat{C}_{X_{1}} (s, u) = \frac{1}{( m _{X 1} b ) ^{2}} j, k = 1 \sum m_{X 1} K (\frac{s - s _{1 j}}{b}, \frac{u - s _{1 k}}{b}) \frac{1}{n} i = 1 \sum n W_{1 ij} W_{1 ik}

C_{\tilde{X}_{2 k}, \tilde{X}_{2 l}} (t)

C_{\tilde{X}_{2 k}, \tilde{X}_{2 l}} (t)

= \int_{δ_{21}}^{δ_{22}} \int_{δ_{21}}^{δ_{22}} B_{2 k} (s) B_{2 l} (u) E [X_{2} (t - s) X_{2} (t - u)] d u d s

= \int_{δ_{21}}^{δ_{22}} \int_{δ_{21}}^{δ_{22}} B_{2 k} (s) B_{2 l} (u) C_{X_{2}} (t - s, t - u) d u d s

i = 1 \sum n \frac{1}{( m _{X 2 i} b ) ^{2}} j \neq = k = 1 \sum m_{X 2 i} K (\frac{s - s _{2 ij}}{b}, \frac{u - s _{2 ik}}{b}) (W_{2 ij} W_{2 ik} - α_{0} - α_{1} (s - s_{2 ij}) - α_{2} (u - s_{2 ik}))^{2}

i = 1 \sum n \frac{1}{( m _{X 2 i} b ) ^{2}} j \neq = k = 1 \sum m_{X 2 i} K (\frac{s - s _{2 ij}}{b}, \frac{u - s _{2 ik}}{b}) (W_{2 ij} W_{2 ik} - α_{0} - α_{1} (s - s_{2 ij}) - α_{2} (u - s_{2 ik}))^{2}

C_{\tilde{X}_{1 k}, \tilde{X}_{2 l}} (t)

C_{\tilde{X}_{1 k}, \tilde{X}_{2 l}} (t)

= \int_{δ_{11}}^{δ_{12}} \int_{δ_{21}}^{δ_{22}} B_{1 k} (s) B_{2 l} (u) E [X_{1} (t - s) X_{2} (t - u)] d u d s

= \int_{δ_{11}}^{δ_{12}} \int_{δ_{21}}^{δ_{22}} B_{1 k} (s) B_{2 l} (u) C_{X_{1}, X_{2}} (t - s, t - u) d u d s

C_{\tilde{X}_{1 l}, Y} (t)

C_{\tilde{X}_{1 l}, Y} (t)

= \int_{δ_{11}}^{δ_{12}} B_{1 l} (s) E [X_{1} (t - s) Y (t)] d s

= \int_{δ_{11}}^{δ_{12}} B_{1 l} (s) C_{X_{1}, Y} (t - s, t) d s

\hat{β}_{1} (s, t) = k = 1 \sum K_{1} B_{1 k} (s) \hat{b}_{1 k} (t), s \in Δ_{1}, t \in [0, 1]

\hat{β}_{1} (s, t) = k = 1 \sum K_{1} B_{1 k} (s) \hat{b}_{1 k} (t), s \in Δ_{1}, t \in [0, 1]

\hat{β}_{2} (s, t) = k = 1 \sum K_{2} B_{2 k} (s) \hat{b}_{2 k} (t), s \in Δ_{2}, t \in [0, 1] .

\hat{β}_{2} (s, t) = k = 1 \sum K_{2} B_{2 k} (s) \hat{b}_{2 k} (t), s \in Δ_{2}, t \in [0, 1] .

n \to \infty lim s, t \in Δ_{1} \times I_{t} sup ∣ \hat{β}_{1} (s, t) - β_{1} (s, t) ∣ = 0 in probability

n \to \infty lim s, t \in Δ_{1} \times I_{t} sup ∣ \hat{β}_{1} (s, t) - β_{1} (s, t) ∣ = 0 in probability

n \to \infty lim s, t \in Δ_{2} \times I_{t} sup ∣ \hat{β}_{2} (s, t) - β_{2} (s, t) ∣ = 0 in probability

n \to \infty lim s, t \in Δ_{2} \times I_{t} sup ∣ \hat{β}_{2} (s, t) - β_{2} (s, t) ∣ = 0 in probability

E [Y^{*} (t) ∣ X_{1}^{*}, X_{2}^{*}] = β_{0} (t) + \int_{δ_{11}}^{δ_{12}} β_{1} (s, t) X_{1}^{*} (t - s) d s + \int_{δ_{21}}^{δ_{22}} β_{2} (s, t) X_{2}^{*} (t - s) d s .

E [Y^{*} (t) ∣ X_{1}^{*}, X_{2}^{*}] = β_{0} (t) + \int_{δ_{11}}^{δ_{12}} β_{1} (s, t) X_{1}^{*} (t - s) d s + \int_{δ_{21}}^{δ_{22}} β_{2} (s, t) X_{2}^{*} (t - s) d s .

C_{X_{2}} (s, t) = l = 1 \sum \infty λ_{l} ϕ_{l} (s) ϕ_{l} (u)

C_{X_{2}} (s, t) = l = 1 \sum \infty λ_{l} ϕ_{l} (s) ϕ_{l} (u)

X_{2}^{*} (s) = l = 1 \sum \infty ξ_{l}^{*} ϕ_{l} (s)

X_{2}^{*} (s) = l = 1 \sum \infty ξ_{l}^{*} ϕ_{l} (s)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Genetic and phenotypic traits in livestock · Genetic Mapping and Diversity in Plants and Animals

Full text

On estimation of the effect lag of predictors

and prediction in functional linear model

Haiyan Liu, Georgios Aivaliotis, Jeanine Houwing-Duistermaat

Department of Statistics

University of Leeds

Abstract

We propose a functional linear model to predict a response using multiple functional and longitudinal predictors and to estimate the effect lags of predictors. The coefficient functions are written as the expansion of a basis system (e.g. functional principal components, splines), and the coefficients of the fixed basis functions are estimated via optimizing a penalization criterion. Then time lags are determined by simultaneously searching on a prior grid mesh based on minimization of prediction error criterion. Moreover, mathematical properties of the estimated parameters and predicted responses are studied and performance of the method is evaluated by extensive simulations.

Keywords: lag functional linear model, functional principal component analysis, sparse and irregular functional data.

1 Introduction

Temporal (time stamped) data are collected both routinely and ad hoc for various processes related to human activities and the natural world. In its two extreme forms, this data can be sampled densely and regularly in time (we call this dense data) or can include only records obtained at irregular time intervals with few measurements (we call this sparse longitudinal data). Naturally, intermediate situations are also available. Examples of dense data are hourly pollution and climate measurements in a particular site, or financial time series. Sparse datasets can arise from medical data (e.g. visits to GP) and other ad hoc observations for example measurements on wild species to which access is not easy.

Relationships between temporal data are often not synchronous and involve a delay in the effects. For example, historical exposure to high temperatures might not have an effect on the growth of trees anymore after a certain period and it may also take some time before high temperatures result in lower growth rate. It might take some time to have an effect on a person’s health and similarly the effects might fade away after some time if the exposure to a factor seizes (e.g. stop smoking).

In this paper, we consider estimation and prediction in a functional regression model where the dense functional predictor trajectory and the sparse longitudinal predictor trajectory from certain intervals of past have effects on the sparse response trajectory. We estimate the intervals through the corresponding lags of the effect of predictors on response. In our motivating example, we estimate the influence of dense functional temperature on sparse longitudinal tree diameters. Moreover, we want to estimate the effect lags of temperature on tree diameter, i.e. from when the predictors have influence on the response and until when this influence disappears.

The classical function-on-function linear model reads as follows :

[TABLE]

where $Y(t)$ is the response trajectory, $X(s)$ is the predictor trajectory, $\epsilon(t)$ is the error process, $\beta_{0}(t)$ is the intercept process, $\beta_{1}(s,t)$ is the two-dimensional regression coefficient function which shows the influence of $X$ on $Y$ . This model was first introduced by Ramsay and Dalzell (1991). For reviews of functional data analysis, see Ramsay and Silverman (2005), Horvath and Kokoszka(2012) and the references therein. Notice that in this model the entire predictor trajectory $X(s)$ including the future values, i.e. when $s>t$ , is assumed to influence the current value of response trajectory $Y$ at time $t$ . Clearly this is not appropriate in many applications.

As a result, the historical functional linear model has been investigated by Malfait and Ramsay (2003), Harezlak et al. (2007), Kim et al. (2009, 2011) where only the past of the predictor trajectory influences the response at the current time:

[TABLE]

where $\delta_{1}$ and $\delta_{2}$ ( $0<\delta_{1}<\delta_{2}<T$ ) are the lags for the influence of predictor trajectory on response trajectory. For one dense functional predictor, Malfait and Ramsay (2003) considers the triangular basis expansion of the coefficient function which is estimated at each observation point. A penalized approach which allows varying lags for the historical functional linear model has been developed by Harezlak et al. (2007). Kim et al. (2011) consider the situation that both predictor process and response process are sparsely and irregularly observed. Pomann et al. (2016) has extended the historical functional linear model to multiple homogeneous predictors, and the response is influenced by the predictors from a fixed starting effect time to current time.

The contribution of this paper is multiple heterogeneous (sparse longitudinal or dense functional) predictors are included, time lags (both starting and end points) that are fixed but unknown are determined, the asymptotic properties of the estimators have been investigated. To be precise, this paper addresses the historical functional linear model with multiple heterogeneous predictors, and the response is influenced by predictors from a fixed starting effect time to a fixed ending effect time. We estimate the coefficient functions, the effect lags and predict the response. Moreover, the asymptotic behavior of the estimated coefficient functions, and the predicted response curve is investigated.

The paper is organized as follows. In section 2, the history function-on-function linear model for multiple heterogeneous predictors is introduced. In section 3, we consider the estimation of the coefficient functions and the uniform consistency of our estimators are established. In section 4, the prediction of the response trajectories is proposed and the asymptotic property of the predicted trajectories is established. The determination of the lags is proposed in section 5. Extensive numerical examples are considered in section 6 to show the finite properties of our proposed estimators. In section 7, the Amazonian rainforest dataset is analysed and the lags are determined. We finish the paper with conclusion and discussion.

2 Model

Suppose our observations are $\{Y_{ij},t_{ij}:i=1,...,n,\ j=1,...,m_{Yi}\}$ , $\{W_{1ij},s_{1ij}:i=1,...,n,\ j=1,...,m_{X1i}\}$ and $\{W_{2ij},s_{2ij}:i=1,...,n,\ j=1,...,m_{X2i}\}$ , where $t_{ij},\ s_{1ij},\ s_{2ij}\in[0,1]$ . For example, the response $Y_{ij}$ corresponds to the tree diameter for subject $i$ at time $t_{ij}$ . The predictor $W_{1ij}$ corresponds to the temperature for subject $i$ at time $s_{1ij}$ . The predictor $W_{2ij}$ corresponds to the climatic water deficit for subject $i$ at time $s_{2ij}$ .

Let $W_{1ij}=X_{1i}(s_{1ij})+\epsilon_{1ij}$ , $W_{2ij}=X_{2i}(s_{2ij})+\epsilon_{2ij}$ and $X_{1i}(t)$ , $X_{2i}(t)$ are independent copies of underlying square-integrable random functions $X_{1}(t)$ and $X_{2}(t)$ over $[0,1]$ respectively. Without loss of generality, we assume $\mu_{X_{1}}(t)=E[X_{1}(t)]=0$ and $\mu_{X_{2}}(t)=E[X_{2}(t)]=0$ . We denote $C_{X_{1}}(s,t)=cov(X_{1}(s),X_{1}(t))$ the covariance of $X_{1}$ and $C_{X_{2}}(s,t)=cov(X_{2}(s),X_{2}(t))$ the covariance of $X_{2}$ . We assume that the first predictor curves $X_{1i}$ are observed on a dense and regular grid of points $s_{1ij}=s_{1j}$ . The observations $W_{1ij}$ are the discrete version of $X_{1i}$ with iid mean-zero and variance-finite noise $\epsilon_{1ij}$ which are independent of $X_{1i}$ . However, the second predictor curves $X_{2i}$ are observed on a sparse and irregular grid of points $s_{2ij}$ . Also observations $W_{2ij}$ are the discrete version of $X_{2i}$ with iid mean zero and variance-finite noise $\epsilon_{2ij}$ which are independent of $X_{2i}$ . For the responses $Y_{ij}$ , they are observed on a sparse and irregular grid of points $t_{ij}$ .

We define the lag historical functional linear model with two heterogeneous covariates $X_{1}$ and $X_{2}$ for the response $Y$ as

[TABLE]

where $i\in\{1,...,n\},\ j\in\{1,...,m_{Yi}\},\ \beta_{0}:[0,1]\to\mathbb{R}$ , $\Delta_{1}=[\delta_{11},\delta_{12}]\subset[0,1]$ , $\Delta_{2}=[\delta_{21},\delta_{22}]\subset[0,1]$ , $\beta_{1}:\Delta_{1}\times[0,1]\to\mathbb{R}$ and $\beta_{2}:\Delta_{2}\times[0,1]\to\mathbb{R}$ are continuous two-dimensional coefficient functions, and $e_{ij}$ are independent measurement errors with mean zero and finite variance $\sigma_{e}^{2}$ . Errors $e_{ij}$ are assumed to be independent of $X_{1i}$ and $X_{2i}$ .

Notice that (1) is equivalent to

[TABLE]

then the model (1) means that given the entire predictor curves $X_{1i}$ and $X_{2i}$ , the response for subject $i$ at time $t_{ij}$ is only affected by the values of $X_{1i}$ over time-window $[t_{ij}-\delta_{12},t_{ij}-\delta_{11}]$ and by the values of $X_{2i}$ over time-window $[t_{ij}-\delta_{22},t_{ij}-\delta_{21}]$ . That is, $t_{ij}-\delta_{12}$ is the starting effective time and $t_{ij}-\delta_{11}$ is the ending effective time for $X_{1i}$ to have effect on $Y_{i}$ at time $t_{ij}$ ; $t_{ij}-\delta_{22}$ is the starting effective time and $t_{ij}-\delta_{21}$ is the ending effective time for $X_{2i}$ to have effect on $Y_{i}$ at time $t_{ij}$ . The coefficient functions $\beta_{1}$ and $\beta_{2}$ , weigh the values $X_{1i}$ and $X_{2i}$ over the time-windows $[t_{ij}-\delta_{12},t_{ij}-\delta_{11}]$ and $[t_{ij}-\delta_{22},t_{ij}-\delta_{21}]$ respectively. The coefficient functions $\beta_{1}$ and $\beta_{2}$ quantify the effect of $X_{1i}$ and $X_{2i}$ respectively on the response $Y_{ij}$ .

3 Estimation

Let $\{B_{11}(s),...,B_{1K_{1}}(s)\}$ and $\{B_{21}(s),...,B_{2K_{2}}\}_{k}$ be two pre-specified functional bases on $\Delta_{1}$ and $\Delta_{2}$ . Then the two-dimensional coefficient functions $\beta_{1}(s,t)$ and $\beta_{2}(s,t)$ are assumed to be represented as

[TABLE]

and

[TABLE]

respectively, where $K_{1}$ and $K_{2}$ capture the resolution of the fit and should be chosen accordingly and $b_{1k}(t)$ and $b_{2k}(t)$ are the unknown time-varying coefficient functions defined on $[0,1]$ . As Kim et al. (2011) reported where only one sparse predictor was discussed, “the estimation is not sensitive to the choice of $K$ provided that there are enough number of basis functions used in the estimation, since the penalized solution (defined later in this session) prevents over-fitting”. Clearly, various basis functions such as Fourier, B-spline, wavelet basis can be used depending on the specific features of the coefficient functions. Since, we could not assume any prior on the coefficients and B-spline basis are computationally fast and have good properties, we use B-spline functions of degree 4 with 10 equally spaced interior knots over $\Delta_{1}$ and $\Delta_{2}$ (number of basis is 14). For details on B-spline basis, see for example Fan and Gijbels (1996) and Ramsay and Silverman (2005).

Plugging $\beta_{1}(s,t)$ and $\beta_{2}(s,t)$ into equation (1), we have

[TABLE]

where $\tilde{X}_{1ik}(t_{ij})=\int_{\delta_{11}}^{\delta_{12}}B_{1k}(s)X_{1i}(t_{ij}-s)ds$ , $\tilde{X}_{2ik}(t_{ij})=\int_{\delta_{21}}^{\delta_{22}}B_{2k}(s)X_{2i}(t_{ij}-s)ds$ , $\mathbf{b}_{1}(t_{ij})=(b_{11}(t_{ij}),...,b_{1K_{1}}(t_{ij}))^{T}$ , $\mathbf{b}_{2}(t_{ij})=(b_{21}(t_{ij}),...,b_{2K_{2}}(t_{ij}))^{T}$ , $\tilde{\mathbf{X}}_{1i}(t_{ij})=(\tilde{X}_{1i1}(t_{ij}),...,\tilde{X}_{1iK_{1}}(t_{ij}))^{T}$ , and $\tilde{\mathbf{X}}_{2i}(t_{ij})=(\tilde{X}_{2i1}(t_{ij}),...,\tilde{X}_{2iK_{2}}(t_{ij}))^{T}$ . Note the observed times $t_{ij}$ depend on subject $i$ . Then model (1) reduces to a varying coefficient model with $K_{1}$ induced predictors $\tilde{X}_{1ik}(t_{ij})$ and $K_{2}$ induced predictors $\tilde{X}_{2ik}(t_{ij})$ .

At first, notice that $\mu_{X_{1}}(t)=\mu_{X_{2}}(t)=0$ implies $\beta_{0}(t_{ij})=E[Y_{ij}]$ , so $\beta_{0}$ can be estimated by smoothing $Y_{ij}$ via local smoothing method based on the pooled data, see for example Yao et al. (2005), Beran and Liu (2014) and Liu and Houwing-Duistermaat (2018). We denote $Y_{ij}-\hat{\beta}_{0}(t_{ij})$ by $Y_{ij}$ , where $\hat{\beta}_{0}(t_{ij})$ is an estimator of $\beta_{0}(t)$ evaluated at time $t_{ij}$ .

In order to derive the estimator of $\{b_{11}(t),...,b_{1K_{1}}(t)\}$ and $\{b_{21}(t),...,b_{2K_{1}}(t)\}$ , we assume $t_{ij}=t_{j}^{0}$ only in this paragraph, i.e. the observation times for different subject are the same. We then estimate $b_{1k}(t_{j}^{0})$ and $b_{2k}(t_{j}^{0})$ by minimizing:

[TABLE]

where $\|\cdot\|$ is the Euclidean norm of a vector, $\mathbf{Y}_{j}=(Y_{1j},...,Y_{nj})^{T}$ , $\rho_{1}>0$ and $\rho_{2}>0$ are the regularization parameters which are assumed to be constants for any time $t\in[0,1]$ in order to reduce the high variability if they vary for each time. The penalization does not only prevent over-fitting but also guarantee the inverse of matrix while solving the minimization problem. Then the minimizer of (3) is

[TABLE]

where $I_{K}$ is the $K\times K$ identity matrix and

[TABLE]

Therefore, by using the probability limits of the covariance structure, for arbitrary $t\in[0,1]$ , we have

[TABLE]

where $\hat{\mathbf{C}}_{11}(t)=\left[\hat{C}_{\tilde{X}_{1k},\tilde{X}_{1l}}(t)\right]_{kl}$ is a $K_{1}\times K_{1}$ matrix with $\hat{C}_{\tilde{X}_{1k},\tilde{X}_{1l}}(t)$ an estimator of $C_{\tilde{X}_{1k},\tilde{X}_{1l}}(t)=cov\left(\tilde{X}_{1k}(t),\tilde{X}_{1l}(t)\right)$ , $\hat{\mathbf{C}}_{12}(t)=\left[\hat{C}_{\tilde{X}_{1k},\tilde{X}_{2l}}(t)\right]_{kl}$ is a $K_{1}\times K_{2}$ matrix with $\hat{C}_{\tilde{X}_{1k},\tilde{X}_{2l}}(t)$ an estimator of $C_{\tilde{X}_{1k},\tilde{X}_{2l}}(t)=cov\left(\tilde{X}_{1k}(t),\tilde{X}_{2l}(t)\right)$ , $\hat{\mathbf{C}}_{21}(t)=\left[\hat{C}_{\tilde{X}_{2k},\tilde{X}_{1l}}(t)\right]_{kl}$ is a $K_{2}\times K_{1}$ matrix with $\hat{C}_{\tilde{X}_{2k},\tilde{X}_{1l}}(t)$ an estimator of $C_{\tilde{X}_{2k},\tilde{X}_{1l}}(t)=cov\left(\tilde{X}_{2k}(t),\tilde{X}_{1l}(t)\right)$ , $\hat{\mathbf{C}}_{22}(t)=\left[\hat{C}_{\tilde{X}_{2k},\tilde{X}_{2l}}(t)\right]_{kl}$ is a $K_{2}\times K_{2}$ matrix with $\hat{C}_{\tilde{X}_{2k},\tilde{X}_{2l}}(t)$ an estimator of $C_{\tilde{X}_{2k},\tilde{X}_{2l}}(t)=cov\left(\tilde{X}_{2k}(t),\tilde{X}_{2l}(t)\right)$ , $\hat{\mathbf{C}}_{1Y}(t)=\left[\hat{C}_{\tilde{X}_{11},Y}(t),...,\hat{C}_{\tilde{X}_{1K_{1}},Y}(t)\right]^{T}$ is a vector and $\hat{C}_{\tilde{X}_{1l},Y}(t)$ is estimator of $C_{\tilde{X}_{1l},Y}(t)=cov\left(\tilde{X}_{1l}(t),Y(t)\right)$ , and $\hat{\mathbf{C}}_{2Y}(t)=\left[C_{\tilde{X}_{21},Y}(t),...,C_{\tilde{X}_{2K_{2}},Y}(t)\right]^{T}$ is a vector and $\hat{C}_{\tilde{X}_{2l},Y}$ is an estimator of $C_{\tilde{X}_{2l},Y}(t)=cov\left(\tilde{X}_{2l}(t),Y(t)\right)$ .

To obtain the necessary quantities in (4), we consider the covariances:

•

For $C_{\tilde{X}_{1k},\tilde{X}_{1l}}(t)$ , we have

[TABLE]

where $C_{X_{1}}(s,u)$ is the covariance between $X_{1}(s)$ and $X_{1}(u)$ . Since predictor $X_{1}$ is densely observed, $C_{X_{1}}(s,u)$ can be estimated by bivariate kernel smoothing, see Beran and Liu (2014):

[TABLE]

where $b$ is a bandwidth and $K$ is a bivariate kernel function.

•

For $C_{\tilde{X}_{2k},\tilde{X}_{2l}}(t)$ , we have

[TABLE]

where $C_{X_{2}}(s,u)$ is the covariance between $X_{2}(s)$ and $X_{2}(u)$ . Since predictor $X_{2}$ is sparsely observed, $C_{X_{2}}(s,u)$ can be estimated by local linear surface smoother (Yao et al. 2015) which is defined through minimizing

[TABLE]

with respect to $\alpha_{0},\ \alpha_{1},\ \alpha_{2}$ , where $b$ is a bandwidth and $K$ is a bivariate kernel function. And $\hat{C}_{X_{2}}(s,u)=\hat{\alpha}_{0}$ .

•

For $C_{\tilde{X}_{1k},\tilde{X}_{2l}}(t)$ , we have

[TABLE]

where $C_{X_{1},X_{2}}(s,u)$ is the covariance between $X_{1}(s)$ and $X_{2}(u)$ . Since predictor $X_{1}$ is densely observed and $X_{2}$ is sparsely observed, $C_{X_{1}}(s,u)$ can be estimated by local surface smoothing.

•

For $C_{\tilde{X}_{2k},\tilde{X}_{1l}}(t)$ , it is similar to $C_{\tilde{X}_{1k},\tilde{X}_{2l}}(t)$ .

•

For $C_{\tilde{X}_{1l},Y}(t)$ , we have

[TABLE]

where $C_{X_{1},Y}(s,u)$ is the covariance between $X_{1}(s)$ and $Y(u)$ . Since $X_{1}$ is densely observed and $Y$ is sparsely observed, $C_{X_{1},Y}(s,u)$ can be estimated by local linear surface smoothing.

•

For $C_{\tilde{X}_{2l},Y}(t)$ , it is similar to $C_{\tilde{X}_{1l},Y}(t)$ .

Once $\mathbf{\hat{b}}_{1}(t)$ and $\mathbf{\hat{b}}_{2}(t)$ are obtained (for given lags $\delta$ ’s and regularization parameters $\rho$ ’s), we can estimate coefficient functions by

[TABLE]

and

[TABLE]

Theorem 1

Under assumptions in Beran and Liu (2014) and Yao et al. (2005a, 2005b), denote $I_{t}=[\max\{\delta_{12},\delta_{22}\},1]$ ,

[TABLE]

**Proof: **Uniform consistency of $\hat{C}_{X_{1}}(s,u)$ is given in Theorem 4 of Beran and Liu (2014), uniform consistency of $\hat{C}_{X_{1},X_{2}},\ \hat{C}_{X_{2},X_{1}},\ \hat{C}_{X_{1},Y},\ \hat{C}_{X_{2},Y}$ is given in Lemma 1 of Yao et al. (2005b), uniform consistency of $\hat{C}_{X_{2}}(s,u)$ is given in Theorem 1 of Yao et al. (2005a). Then the uniform consistency of $\mathbf{\hat{C}}_{11}(t),\ \mathbf{\hat{C}}_{12}(t),\ \mathbf{\hat{C}}_{21}(t),\ \mathbf{\hat{C}}_{22}(t),\ \mathbf{\hat{C}}_{1Y}(t),\ \mathbf{\hat{C}}_{2Y}(t)$ can be obtained. Therefore the uniform consistency of $\hat{\mathbf{b}}_{1}(t)$ and $\hat{\mathbf{b}}_{2}(t)$ follows and thus that of $\hat{\beta}_{1}(s,t)$ and $\hat{\beta}_{2}(s,t)$ can be obtained.

4 Prediction

Suppose we observe a new discrete response curve $\mathbf{Y}^{*}_{j}=(Y^{*}(t_{1}^{*}),...Y^{*}(t_{m^{*}}^{*}))$ , discrete dense predictor trajectory $\mathbf{W}_{1}^{*}=(W_{1}^{*}(s_{11}),...,W_{1}^{*}(s_{1m_{X1}}))^{T}$ and discrete sparse predictor trajectory $\mathbf{W}_{2}^{*}=(W_{2}^{*}(s_{21}^{*}),...W_{2}^{*}(s_{2m_{X2}^{*}}^{*}))^{T}$ . From the original model (1), the predicted response curve is

[TABLE]

However, the lags $\delta_{11},\delta_{12},\delta_{21},\delta_{22}$ and regularization parameters $\rho_{1}$ and $\rho_{2}$ have to be determined and the functional representation of the predictor trajectories $X_{1}^{*}(s)$ and $X_{2}^{*}(s)$ have to be recovered from data.

For $X_{1}^{*}(s)$ , it can be easily recovered by kernel smoothing, since the sampling is dense.

However for $X_{2}^{*}(s)$ , since the sampling is sparse and irregular, we use functional principal component analysis (FPCA). As discussed, we assume $X_{2}^{*}(s)\sim X_{2}(s)\in L^{2}[0,1]$ and $E[X_{2}(s)]=0$ . Denote the covariance of $X_{2}(s)$ by $C_{X_{2}}(s,u)=cov(X_{2}(s),X_{2}(u))$ , then the Mercer’s theorem gives the following spectral decomposition of the covariance

[TABLE]

where $\lambda_{1}\geq\lambda_{2}\geq...\geq 0$ are eigenvalues and $\phi_{l}$ are orthonormal eigenfunctions. By KL expansion, $X_{2}^{*}(s)$ can be represented as

[TABLE]

where $\xi_{l}^{*}=\int_{0}^{1}X_{2}^{*}(s)\phi_{l}(s)ds$ are the functional principal component scores and are uncorrelated random variables with mean 0 and variance $\lambda_{l}$ . In practice, $X_{2}^{*}(s)$ is often truncated by only including the first several items, i.e.

[TABLE]

The covariance $C_{X_{2}}(s,t)$ can be estimated as we discussed in last section and the eigenfunctions $\phi_{l}$ can be estimated following the spectral decomposition of the estimated covariance. However the scores $\xi_{l}^{*}$ cannot be approximated by numerical integration as we usually do for dense functional data. In fact, under the Gaussian assumption, denote $\boldsymbol{\phi}_{l}=(\phi_{l}(s_{21}^{*}),...,\phi_{l}(s_{2m_{X_{2}}^{*}}^{*}))^{T}$ , the best linear predictor for $\xi_{l}^{*}$ is (see Mardia et al. 1978, Yao et al. 2005 or see the application in Liu et al. 2018):

[TABLE]

where $\Sigma=var(\mathbf{W}_{2}^{*})$ . Then the estimate of $\xi_{l}^{*}$ can be defined as

[TABLE]

The number of eigenfunctions $L$ can be selected to be the number of eigenfunctions that explain 95% of the functional covariance. Once obtaining the estimation of eigenfunctions $\phi_{l}$ , scores $\xi_{il}$ and $L$ , $X_{2}^{*}(s)$ can be recovered as

[TABLE]

After plugging the functional representation of the predictor curves $\hat{X}_{1}^{*}(s)$ and $\hat{X}_{2}^{*}(s)$ into (5), we have

[TABLE]

Define

[TABLE]

and

[TABLE]

Theorem 2

Under assumptions in Beran and Liu (2014) and Yao et al. (2005a, 2005b), denote $I_{t}=[\max\{\delta_{12},\delta_{22}\},1]$ , for all $t\in I_{t}$ , we have

[TABLE]

**Proof: **For fixed $L$ , we have

[TABLE]

For $I_{1}$ , from the uniform consistency of $\hat{\beta}_{1}(s,t)$ established in Theorem 1 and the uniform consistency of kernel smoother, we have $I_{1}\to 0$ as $n\to\infty$ .

For $I_{2}$ , from the uniform consistency of $\hat{\beta}_{2}(s,t)$ established in Theorem 1, the uniform consistency of $\hat{\xi}_{l}^{*}$ for $\tilde{\xi}_{l}^{*}$ from Theorem 3 in Yao et al. (2005a), and the uniform consistency of $\hat{\phi}_{l}$ from Theorem 2 in Yao et al. (2005a), we have $I_{2}\to 0$ as $n\to\infty$ .

For $I_{3}$ , following Lemma A.3 in Yao et al. (2005a), we have $I_{3}\to 0$ as $n\to\infty$ .

Therefore, Theorem 2 follows.

5 Implementation

The final question is to estimate the time lag $\delta$ ’s which is of great importance in our application. For selecting $\delta$ ’s and $\rho$ ’s, we consider the Normalized Prediction Error (NPE) criterion and the $K$ -fold cross validation criterion. Specifically, NPE in this situation is defined as

[TABLE]

where $\hat{Y}_{ij}$ is the predicted value for the $j$ th measurement on the $i$ th response trajectory $Y(t)$ obtained using $\delta$ ’s and $\lambda$ ’s, $N=\sum_{i=1}^{n}m_{Yi}$ . Divide the data into $K$ equal parts, for each $k=1,...,K$ , fit the model with parameter $\delta,\ \lambda$ to the other $K-1$ parts, giving the estimation of coefficient functions, further giving the prediction $\hat{Y}^{-k}_{ij}$ in the $k$ th part, and then compute the prediction error in the $k$ th part. The $K$ -fold cross validation score is defined as,

[TABLE]

Similar criteria are considered in Kim et al. (2011) and Pomann et al. (2016).

Then $\delta$ ’s and $\rho$ ’s are chosen in a hierarchical manner. Let $D_{1}$ and $D_{2}$ be the sets of potential lags for the first and second predictor, i.e. $\{(\delta_{11},\delta_{12})\}$ and $\{(\delta_{21},\delta_{22})\}$ , respectively. Let $D_{\rho}$ be the sets of potential regularization parameters $\{(\rho_{1},\rho_{2})\}$ . Firstly, for a fixed point of $\delta^{0}=\left(\delta_{11}^{0},\delta_{12}^{0},\delta_{21}^{0},\delta_{22}^{0}\right)\in D_{1}\times D_{2}$ , NPE values are calculated for all $\rho\in D_{\rho}=\{(\rho_{1},\rho_{2})\}$ . Then the $\rho$ that achieves the smallest NPE value is chosen as the optimal $\rho$ for the given fixed point of lags $\delta^{0}$ . Secondly, The optimal $\rho$ is used for calculating the cross validation score for $\delta^{0}$ . At last, we repeat the above steps for all $\delta\in D_{1}\times D_{2}$ and the cross validation score for all $\delta\in D_{1}\times D_{2}$ can be obtained. Then, the optimal $\delta$ is chosen to be the one with the smallest cross validation score. Actually $D_{1}$ and $D_{2}$ are meshes in $[0,1]$ and are chosen empirically, $D_{\rho}$ is also chosen empirically.

6 Simulations

We study efficiency of the NPE criterion for selecting the time lags $\delta$ ’s and regularization parameters $\rho$ ’s.

For $n=50,\ 100,\ 150,\ 200$ subjects, we first generate the response curve $Y(t)$ and two predictor curves $X_{1}(t)$ and $X_{2}(t)$ on a dense and equally spaced time points over $[0,1]$ , i.e. $\{j/99,j=0,...,99\}$ . The number of measurements made on the $i$ th response $m_{Yi}$ is randomly selected from 20 to 50, the number of measurements made on the $i$ th predictor $m_{Xi1}$ is 100 and the number of measurements made on the $i$ th predictor $m_{Xi2}$ is randomly selected from 30 to 50.

Define $X_{1i}(t)=\xi_{i1}\sin(2\pi t)+\xi_{i2}t^{2}$ with $\xi_{i1}\overset{iid}{\sim}N(0,1)$ and $\xi_{i2}\overset{iid}{\sim}N(0,1)$ , $X_{2i}(t)=\zeta_{i}\cos(2\pi t)$ with $\zeta_{i}\overset{iid}{\sim}N(0,1)$ . We take the same time lags for both $X_{1}$ and $X_{2}$ , i.e. $\delta_{11}=\delta_{21}=0.1$ , $\delta_{12}=\delta_{22}=0.4$ . For coefficient functions, we take $\beta_{0}(t)=t+t^{1/5}$ , $\beta_{1}(s,t)=\sin(2\pi s)\cos(\pi t),\ t\in[0,1],\ s\in[0.2,0.4]$ , $\beta_{2}(s,t)=\sin(4\pi s)\cos(2\pi t),\ t\in[0,1],\ s\in[0.2,0.4]$ . The measurement errors are taken to be independent normal with signal to noise ratio 20 for the predictors and response.

Figure 1 shows the simulated data with $n=100$ .

The estimation is based on the B-spline (B-spline functions of degree 4 with 10 equally spaced interior knots over $[0,1]$ ) expansion of the coefficients. The number of functional principal components is chosen based on leave-one-curve cross validation criterion and 99% variation is kept. The penalized parameters $\rho_{1}$ and $\rho_{2}$ are chosen on the dense grid of $\rho_{1},\ \rho_{2}\in[10^{-5},10^{-2};20]$ . We use NPE criterion and 10-fold cross validation criterion to determine the regularization parameters and the lags. Notice that in order to check the estimation performance, the estimation procedure is done under the correct lags, i.e. $\delta_{11}=\delta_{21}=0.1$ , $\delta_{12}=\delta_{22}=0.4$ . Figure 2 shows the result of one simulation, where $\rho_{1}$ is chosen as $4.28\times 10^{-4}$ , $\rho_{2}$ is chosen as $8.86\times 10^{-4}$ and the corresponding NPE is $1.95\times 10^{-2}$ . From Figure 2, we conclude that our model successfully reveals the structure of coefficient functions.

Table 1 shows the asymptotic properties of our estimation. For different number of observations $n=50,\ 100,\ 150,\ 200$ , the NPE are shown and also the estimation is based on the correct lags. As we can see, the NPE decrease as the $n$ increases which is correspond to the Theorem 1.

For evaluating the performance of our model on selecting the effect lags, the $\lambda$ s are determined based on the NPE criterion and the $\delta$ s are determined based on 10-fold cross-validation score. Since the true $\delta_{11}=\delta_{21}=0.1$ and $\delta_{21}=\delta_{22}=0.4$ , in order to save computational time, we fix the ending point i.e. $\delta_{11}=\delta_{21}=0.1$ and search the starting point $\delta_{21}=\delta_{22}\in\{0.3,0.4,0.5\}$ . That is we have three combinations but there is only one correct combination. Our model has 65 correct choices out of 100 simulations.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Beran, J. and Liu , H. (2014). On estimation of mean and covariance functions in repeated time series with long-memory errors. Lithuanian Mathematical Journal , 54(1), 8-34.
2[2] Fan, J. and Gijbels, I. (1996). Local polynomial modeling and its applications . CRC Press.
3[3] Harezlak, J., Coull, B. A., Laird, N. M., Magari, S. R., and Christiani, D. C. (2007). Penalized solutions to functional regression problems. Computational statistics and data analysis , 51(10), 4911-4925.
4[4] Horvath, L. and Kokoszka, P. (2012). Inference for functional data with applications. Springer Science and Business Media.
5[5] Kim, K., Sentürk, D., and Li, R. (2011). Recent history functional linear models for sparse longitudinal data. Journal of statistical planning and inference , 141(4), 1554-1566.
6[6] Liu, H. and Houwing-Duistermaat, J. (2018). On trend and its derivative estimation in repeated unevenly spaced time series with long-range dependent errors. ar Xiv:1803.05411.
7[7] Liu, H., Del Galdo, F. and Houwing-Duistermaat, J. (2018). Functional principal component analysis in predicting Scleroderma disease based on patients historical data.
8[8] Lopez-Gonzalez, G., Lewis, S.L., Burkitt, M. and Phillips, O.L. (2011). Forest Plots.net: a web application and research tool to manage and analyse tropical forest plot data. Journal of Vegetation Science 22: 610–613. doi: 10.1111/j.1654-1103.2011.01312.x