Adaptive Function-on-Scalar Regression with a Smoothing Elastic Net

Ardalan Mirshani; Matthew Reimherr

arXiv:1905.09881·stat.ME·May 27, 2019·J. Multivar. Anal.

Adaptive Function-on-Scalar Regression with a Smoothing Elastic Net

Ardalan Mirshani, Matthew Reimherr

PDF

TL;DR

This paper introduces AFSSEN, a novel adaptive Elastic Net approach for high-dimensional function-on-scalar regression that effectively selects predictors and ensures smooth estimates within a Hilbert space framework.

Contribution

The paper develops a new regularization method combining functional norms for simultaneous variable selection and smoothing in a high-dimensional setting.

Findings

01

AFSSEN outperforms existing methods in prediction accuracy.

02

It achieves better variable selection with fewer false positives.

03

Theoretical properties include a functional oracle property.

Abstract

This paper presents a new methodology, called AFSSEN, to simultaneously select significant predictors and produce smooth estimates in a high-dimensional function-on-scalar linear model with a sub-Gaussian errors. Outcomes are assumed to lie in a general real separable Hilbert space, H, while parameters lie in a subspace known as a Cameron Martin space, K, which are closely related to Reproducing Kernel Hilbert Spaces, so that parameter estimates inherit particular properties, such as smoothness or periodicity, without enforcing such properties on the data. We propose a regularization method in the style of an adaptive Elastic Net penalty that involves mixing two types of functional norms, providing a fine tune control of both the smoothing and variable selection in the estimated model. Asymptotic theory is provided in the form of a functional oracle property, and the paper concludes…

Figures2

Click any figure to enlarge with its caption.

Equations348

K = {h \in H; i = 1 \sum \infty \frac{⟨ h , v _{i} ⟩ _{H}^{2}}{θ _{i}} < \infty} .

K = {h \in H; i = 1 \sum \infty \frac{⟨ h , v _{i} ⟩ _{H}^{2}}{θ _{i}} < \infty} .

E [exp ⟨ x, X ⟩_{H}] \leq exp (\frac{1}{2} ⟨ x, C (x) ⟩_{H}) \forall x \in H,

E [exp ⟨ x, X ⟩_{H}] \leq exp (\frac{1}{2} ⟨ x, C (x) ⟩_{H}) \forall x \in H,

Y_{n} = i = 1 \sum I X_{n, i} β_{i}^{⋆} + ϵ_{n},

Y_{n} = i = 1 \sum I X_{n, i} β_{i}^{⋆} + ϵ_{n},

L_{λ} (β) = \frac{1}{2 N} ∥ Y - X β ∥_{H^{I}}^{2} + \frac{λ _{K}}{2} i = 1 \sum I ∥ L (β_{i}) ∥_{K}^{2} + λ_{H} i = 1 \sum I \tilde{w}_{i} ∥ β_{i} ∥_{H},

L_{λ} (β) = \frac{1}{2 N} ∥ Y - X β ∥_{H^{I}}^{2} + \frac{λ _{K}}{2} i = 1 \sum I ∥ L (β_{i}) ∥_{K}^{2} + λ_{H} i = 1 \sum I \tilde{w}_{i} ∥ β_{i} ∥_{H},

K_{1} (t, s)

K_{1} (t, s)

K_{2} (t, s)

K_{3} (t, s)

K_{4} (t, s)

b_{N}^{2} ≫ \frac{I _{0}^{2} lo g ( I )}{N} .

b_{N}^{2} ≫ \frac{I _{0}^{2} lo g ( I )}{N} .

\frac{I _{0} lo g ( I )}{N} ≪ λ_{H} ≪ \frac{b _{N}^{2}}{I _{0}} .

\frac{I _{0} lo g ( I )}{N} ≪ λ_{H} ≪ \frac{b _{N}^{2}}{I _{0}} .

\frac{1}{τ} \leq σ_{min} (\hat{Σ}_{11}) \leq σ_{ma x} (\hat{Σ}_{11}) \leq τ,

\frac{1}{τ} \leq σ_{min} (\hat{Σ}_{11}) \leq σ_{ma x} (\hat{Σ}_{11}) \leq τ,

∥ \hat{Σ}_{21} \hat{Σ}_{11}^{- 1} ∥_{o p} \leq ϕ < 1,

∥ \hat{Σ}_{21} \hat{Σ}_{11}^{- 1} ∥_{o p} \leq ϕ < 1,

λ_{K} ≪ \frac{b _{N}^{2}}{I _{0} d _{N}^{2}} .

λ_{K} ≪ \frac{b _{N}^{2}}{I _{0} d _{N}^{2}} .

\frac{I _{0} lo g ( I ) d _{N}}{N} ≪ \frac{λ _{H}}{λ _{K}} .

\frac{I _{0} lo g ( I ) d _{N}}{N} ≪ \frac{λ _{H}}{λ _{K}} .

λ_{H} ≪ \frac{b _{N}}{N I _{0}} .

λ_{H} ≪ \frac{b _{N}}{N I _{0}} .

\frac{λ _{H}}{λ _{K}} ≪ \frac{b _{N}}{N I _{0}} .

\frac{λ _{H}}{λ _{K}} ≪ \frac{b _{N}}{N I _{0}} .

P (\hat{β} = S β^{⋆}) \to 1,

P (\hat{β} = S β^{⋆}) \to 1,

∥ \hat{β} - \tilde{β}_{o} ∥_{H} = o_{P} (N^{- 1/2}),

∥ \hat{β} - \tilde{β}_{o} ∥_{H} = o_{P} (N^{- 1/2}),

∥ \hat{β} - \tilde{β}_{o} ∥_{K} = o_{P} (N^{- 1/2}) .

∥ \hat{β} - \tilde{β}_{o} ∥_{K} = o_{P} (N^{- 1/2}) .

f (x) \geq f (x_{0}) + ⟨ h, x - x_{0} ⟩_{H} \forall x \in H .

f (x) \geq f (x_{0}) + ⟨ h, x - x_{0} ⟩_{H} \forall x \in H .

\displaystyle\frac{\partial L_{\lambda}(\beta)}{\partial\beta_{i}}=K(-N^{-1}{\bf X}_{.i}^{\top}({\bf Y}-{\bf X}\boldsymbol{\beta}))+\lambda_{K}L^{2}(\beta_{i})+\lambda_{H}{\tilde{w}}_{i}\left\{\begin{array}[]{ll}K(\beta_{i})\|\beta_{i}\|_{{\mathbb{H}}}^{-1}&\beta_{i}\neq 0\\ \\ \\ \{h;\ \small{\|K^{-\nicefrac{{1}}{{2}}}(h)\|_{{\mathbb{K}}}\leq 1}\}&\beta_{i}=0,\\ \end{array}\right.

\displaystyle\frac{\partial L_{\lambda}(\beta)}{\partial\beta_{i}}=K(-N^{-1}{\bf X}_{.i}^{\top}({\bf Y}-{\bf X}\boldsymbol{\beta}))+\lambda_{K}L^{2}(\beta_{i})+\lambda_{H}{\tilde{w}}_{i}\left\{\begin{array}[]{ll}K(\beta_{i})\|\beta_{i}\|_{{\mathbb{H}}}^{-1}&\beta_{i}\neq 0\\ \\ \\ \{h;\ \small{\|K^{-\nicefrac{{1}}{{2}}}(h)\|_{{\mathbb{K}}}\leq 1}\}&\beta_{i}=0,\\ \end{array}\right.

\displaystyle\left\{\begin{array}[]{ll}\hat{\beta}_{i}=0&\quad\|\widecheck{\beta}_{i}\|_{{\mathbb{H}}}\leq\lambda_{H}{\tilde{w}}_{i}\\ \hat{\beta}_{i}=\left((1+\dfrac{\lambda_{H}{\tilde{w}}_{i}}{\|\hat{\beta}_{i}\|_{{\mathbb{H}}}})\mathds{I}+\lambda_{K}K^{-1}L^{2}\right)^{-1}\widecheck{\beta}_{i}&\quad\|\widecheck{\beta}_{i}\|_{{\mathbb{H}}}>\lambda_{H}{\tilde{w}}_{i}\\ \end{array}\right.

\displaystyle\left\{\begin{array}[]{ll}\hat{\beta}_{i}=0&\quad\|\widecheck{\beta}_{i}\|_{{\mathbb{H}}}\leq\lambda_{H}{\tilde{w}}_{i}\\ \hat{\beta}_{i}=\left((1+\dfrac{\lambda_{H}{\tilde{w}}_{i}}{\|\hat{\beta}_{i}\|_{{\mathbb{H}}}})\mathds{I}+\lambda_{K}K^{-1}L^{2}\right)^{-1}\widecheck{\beta}_{i}&\quad\|\widecheck{\beta}_{i}\|_{{\mathbb{H}}}>\lambda_{H}{\tilde{w}}_{i}\\ \end{array}\right.

1 = j = 1 \sum \infty \frac{⟨ β _{i} , v _{j} ⟩ ^{2}}{( ( 1 + λ _{K} η _{j}^{2} θ _{j}^{- 1} ) ∥ β ^ _{i} ∥ _{H} + λ _{H} w ~ _{i} ) ^{2}},

1 = j = 1 \sum \infty \frac{⟨ β _{i} , v _{j} ⟩ ^{2}}{( ( 1 + λ _{K} η _{j}^{2} θ _{j}^{- 1} ) ∥ β ^ _{i} ∥ _{H} + λ _{H} w ~ _{i} ) ^{2}},

K_{M} (h) := (K (h_{1}), \dots, K (h_{M})) \in H^{M} where h = (h_{1}, \dots, h_{M}) \in H^{M} .

K_{M} (h) := (K (h_{1}), \dots, K (h_{M})) \in H^{M} where h = (h_{1}, \dots, h_{M}) \in H^{M} .

Σ h := {j = 1 \sum M Σ_{1 j} h_{j}, \dots, j = 1 \sum M Σ_{M j} h_{j}} .

Σ h := {j = 1 \sum M Σ_{1 j} h_{j}, \dots, j = 1 \sum M Σ_{M j} h_{j}} .

\partial f (x) = 2 K (x) .

\partial f (x) = 2 K (x) .

\partial f (x) = K (x) ∥ x ∥_{H}^{- 1},

\partial f (x) = K (x) ∥ x ∥_{H}^{- 1},

\partial f (0) = {h \in H; ∥ K^{- \nicefrac 12} (h) ∥_{K} \leq 1} .

\partial f (0) = {h \in H; ∥ K^{- \nicefrac 12} (h) ∥_{K} \leq 1} .

\partial f (x) = 2 L^{2} (x) .

\partial f (x) = 2 L^{2} (x) .

f (y) - f (x) \geq ⟨ h, y - x ⟩_{K} \forall y \in K,

f (y) - f (x) \geq ⟨ h, y - x ⟩_{K} \forall y \in K,

∥ K^{\nicefrac 12} (y) ∥_{K}^{2} - ∥ K^{\nicefrac 12} (x) ∥_{K}^{2} \geq ⟨ 2 K (x), y - x ⟩_{K} .

∥ K^{\nicefrac 12} (y) ∥_{K}^{2} - ∥ K^{\nicefrac 12} (x) ∥_{K}^{2} \geq ⟨ 2 K (x), y - x ⟩_{K} .

⟨ 2 K (x), y - x ⟩_{K} = 2 ⟨ K^{\nicefrac 12} (x), K^{\nicefrac 12} (y) ⟩_{K} - 2∥ K^{\nicefrac 12} (x) ∥_{K}^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Adaptive Function-on-Scalar Regression with a Smoothing Elastic Net

Ardalan Mirshani

Department of Statistics

The Pennsylvania State University

University Park, PA 16802

[email protected]

&Matthew Reimherr

Department of Statistics

The Pennsylvania State University

University Park, PA 16802

[email protected] Corresponding Author

Abstract

This paper presents a new methodology, called AFSSEN, to simultaneously select significant predictors and produce smooth estimates in a high-dimensional function-on-scalar linear model with a sub-Gaussian errors. Outcomes are assumed to lie in a general real separable Hilbert space, ${\mathbb{H}}$ , while parameters lie in a subspace known as a Cameron Martin space, ${\mathbb{K}}$ , which are closely related to Reproducing Kernel Hilbert Spaces, so that parameter estimates inherit particular properties, such as smoothness or periodicity, without enforcing such properties on the data. We propose a regularization method in the style of an adaptive Elastic Net penalty that involves mixing two types of functional norms, providing a fine tune control of both the smoothing and variable selection in the estimated model. Asymptotic theory is provided in the form of a functional oracle property, and the paper concludes with a simulation study demonstrating the advantage of using AFSSEN over existing methods in terms of prediction error and variable selection.

KEYWORDS: Variable Selection, Functional Data Analysis, Hilbert Space, Elastic Net, Smooth Estimate, Reproducing Kernel Hilbert Space , Oracle Property

1 Introduction

In recent years, rapid advances in data gathering technologies and new complex modern studies have presented substantial challenges for extracting information from increasingly large and sophisticated data sets. Functional data analysis, FDA, is a branch of statistics for conducting statistical inferences on the complicated objects specially in high dimensional spaces (Ramsay and Silverman, 2007; Hsing and Eubank, 2015; Kokoszka and Reimherr, 2017). In addition, the emergence of inexpensive genotyping technologies has produced a substantial need for tools capable of handling large numbers of scalar predictors (Bierut et al., 2006; Scott et al., 2007; Repapi et al., 2010; Hebiri et al., 2011; Algamal and Lee, 2015; Craig et al., 2018). In this paper, we consider the function-on-scalar regression problem when the number of predictors is much larger than the number of subjects/units. We present a new approach, called AFSSEN, for Adaptive Function-on-Scalar Smoothing Elastic Net, which can separately control the smoothness of the underlying functional parameter estimates as well as select important predictors.

As in classic statistical analyses, the functional linear model, FLM, is one the principle modeling tools when working with functional data (Morris, 2015). In these cases, at least one of the outcomes or predictors are functional (Reiss et al., 2010). When the number of predictors is fixed, then techniques for fitting FLM and their statistical properties are now well understood (Morris, 2015; Kokoszka and Reimherr, 2017) in low dimensional FLM, However, in high dimensional cases where the number of predictors are relatively larger than the number of statistical units, little work has been done which most of them are for Scalar-on-Function settings (Matsui and Konishi (2011) ; Gertheiss et al. (2013); Lian (2013) ; Fan et al. (2015)).

In Function-on-Scalar regression, which is the problem we consider, Chen et al. (2016) considered a functional least squares with a Minimax Concave Penalty, MCP (Zhang et al., 2010) with fixed number of predictors. A pre-whitening technique was used to exploit the within function dependence of the outcomes. Barber et al. (2017) presented the Function-on-Scalar LASSO, FSL, which combines a functional least squares with an $\mathcal{L}_{1}$ penalty introduced in a separable Hilbert Space ${\mathbb{H}}$ . Since FSL is a convex optimization problem, it is computationally efficient even with a large number of predictors ( $I\gg N$ ). Additionally, FSL estimates achieve optimal convergence rates, but as with traditional LASSO, these estimates suffer from an asymptotic bias and do not achieve the functional oracle property. Fan and Reimherr (2017) suggested Adaptive Function-on-Scalar LASSO, AFSL, which uses a functional least squares with an adaptive $\mathcal{L}_{1}$ penalty to reduce the bias problem in FSL. They showed AFSL is computationally as efficient as FSL, but achieves a strong functional oracle property. However, AFSL provides limited control of the smoothness of the functional parameter estimates, which can affect on the prediction error. Parodi and Reimherr (2017) developed the Functional Linear Adaptive Mixed Estimation, FLAME, which simultaneously selects important predictors and estimates the smooth parameters. They assume that while the data lie in a general real separable Hilbert space, ${\mathbb{H}}$ , the model parameters lie in a Reproducing Kernel Hilbert Space (RKHS), ${\mathbb{K}}$ . The RKHS is a subspace of ${\mathbb{H}}$ , which can be identified with a linear operator, $K.$ They demonstrated that FLAME achieved a weak functional oracle property, meaning it recovered the correct support with probability tending to one and FLAME estimator is equivalent to the oracle estimator only on certain nice projections. To show that FLAME achieved the strong oracle property required stronger structural assumptions. In their framework, they used a coordinate descent algorithm which made it computationally very efficient.

The main obstacle for FLAME is having to simultaneously control the smoothness and sparsity with a single penalty and tuning parameter. In particular, a tuning parameter value that practically works well for smoothing may not work well for variable selection, and vice versa. To address this issue we propose a method that more carefully controls smoothing and sparsity separately. We assume the data live in an arbitrary Hilbert space, ${\mathbb{H}}$ , but that some linear constraint of the parameters are enforced to lie in a Cameron-Martin space, CMS, ${\mathbb{K}}$ . CMS are closely related to Reproducing Kernel Hilbert Spaces, RKHS, when ${\mathbb{H}}=L^{2}[0,1]$ , and the two terms are often used interchangeably (Bogachev, 1998). However, since our ${\mathbb{H}}$ will be more general, we refrain from using the term RKHS to avoid confusion. In our approach, AFSSEN, we use an idea similar to the scalar adaptive elastic net penalty (Zou and Hastie, 2005; Zou, 2006), but for functional data. In particular, AFSSEN exploits a combination of a penalized functional least squares and an adaptive smoothing elastic net penalty containing a $\mathcal{L}_{1}$ term in ${\mathbb{H}}$ for variable selection and a separate $\mathcal{L}_{2}$ term in ${\mathbb{K}}$ for controlling the smoothness of the estimated parameters. The AFSSEN parameter estimates inherit the nature properties of the kernel function of ${\mathbb{K}}$ , such as smoothness or periodicity. We also show that AFSSEN enjoys better mathematical properties than AFSL or FLAME, even when relaxing the Gaussian error assumption to $C$ -subgaussian. In particular, we show that AFSSEN achieves a strong oracle property in both ${\mathbb{H}}$ and the strong norm ${\mathbb{K}}$ . We also provide a very fast coordinate descent algorithm in the R programming language (R Core Team, 2018), whose backend is written in C++ (Eddelbuettel and François, 2011).

The remainder of the paper is organized as follows. In Section 2 we propose some primary materials and main assumptions for the results presented in next sections. Section 3 provides the framework and demonstrates the strong oracle property for AFSSEN under some non-strong assumptions. In section 4 we introduce the implementation and numerical illustration including the coordinate descent algorithm and practical considerations. A simulation study with a discussion on comparing the performance of AFSSEN and FLAME in two different smooth and rough scenarios is given in section 5. You can see a conclusion in Section 6. All mathematical proofs and derivations can be found in the supplemental.

2 Background and Methodology

Throughout this paper we consider ${\mathbb{H}}$ is a real separable Hilbert space with inner product $\langle.,.\rangle_{{\mathbb{H}}}$ and induced norm $\|.\|_{{\mathbb{H}}}$ . Let $K:{\mathbb{H}}\to{\mathbb{H}}$ be a compact, positive definite, self-adjoint linear operator, meaning $\langle Kx,x\rangle_{{\mathbb{H}}}>0$ when $x\neq 0$ , $\langle Kx,y\rangle_{{\mathbb{H}}}=\langle x,Ky\rangle_{{\mathbb{H}}}$ for all $x,y\in{\mathbb{H}}$ , and it has a finite trace. According to the spectral theorem (Dunford and Schwartz, 1963), we can decompose K as $K(x)=\sum_{i=1}^{\infty}\theta_{i}\langle v_{i},x\rangle_{{\mathbb{H}}}v_{i}$ , where $\{v_{1},v_{2},\dots\}$ is an orthonormal basis of ${\mathbb{H}}$ and $\theta_{1}\geq\theta_{2}\geq\dots\geq 0$ is a positive sequence of real numbers. The eigenvalues, $\{\theta_{i}\}$ and eigenfunctions, $\{v_{i}\}$ of $K$ induce a subspace of ${\mathbb{H}}$ , denoted ${\mathbb{K}}$ , called the Cameron-Martin (Bogachev, 1998) space defined as

[TABLE]

Equivalently, ${\mathbb{K}}$ can be viewed as the image $K^{1/2}({\mathbb{H}})$ (Bogachev, 1998). Then ${\mathbb{K}}$ is also a Hilbert space under the inner product $\langle x,y\rangle_{{\mathbb{K}}}=\sum\limits_{i=1}^{\infty}\dfrac{\langle x,v_{i}\rangle_{{\mathbb{H}}}\langle y,v_{i}\rangle_{{\mathbb{H}}}}{\theta_{i}}$ . The most commonly encountered Hilbert space in FDA is $L^{2}[0,1],$ which is the space of real valued square integrable functions over $[0,1]$ with corresponding norm $\|x\|_{{\mathbb{H}}}^{2}=\int_{0}^{1}x^{2}(t)dt$ . When ${\mathbb{H}}=L^{2}[0,1]$ and $K$ is an integral operator with kernel $K(t,s)$ , then ${\mathbb{K}}$ is isomorphic to a Reproducing Kernel Hilbert Space (Berlinet and Thomas-Agnan, 2011).

In previous works (Fan and Reimherr, 2017; Parodi and Reimherr, 2017), the modelling noise was assumed to be a Gaussian process. In this paper, we relax this assumption by consider a $C$ -subgaussian noise, which, to the best of our knowledge, has not been considered before in functional data models. A mean zero random element $X$ in ${\mathbb{H}}$ is called $C$ -subgaussian if

[TABLE]

where $C$ is a covariance operator in ${\mathbb{H}}$ (Buldygin and Kozachenko, 1980; Antonioni, 1997). In other words, the moment generating function of the $X$ is dominated, uniformly across ${\mathbb{H}}$ , by the moment generating function of a Gaussian process with covariance $C$ . This is a convenient assumption as it provides the necessary tail probability inequalities (Hsu et al., 2012) without making explicit assumptions on the distribution of $X$ . In particular, one can see a Gaussian process in ${\mathbb{H}}$ with covariance operator $C$ will be a $C$ -subgaussian process in ${\mathbb{H}}$ (Antonioni, 1997).

We now introduce our primary modelling assumptions.

Assumption 1.

Let $Y_{n}\in{\mathbb{H}}\$ for $n\in 1,\dots,N$ satisfy

[TABLE]

where ${\bf X}=\{X_{n,i}\}\in{\mathbb{R}}^{N\times I}$ is the design matrix with standardized columns and $\epsilon_{n}$ are i.i.d $C$ -subgaussian random elements of ${\mathbb{H}}$ . We assume that only the first $I_{0}$ predictors are significant, meaning their corresponding coefficient functions, $\beta^{\star}_{1},\dots,\beta^{\star}_{I_{0}}$ are nonzero. We denote ${\bf X}=({\bf X}_{1},{\bf X}_{2})$ to partition the predictors into ${\bf X}_{1}$ and ${\bf X}_{2}$ which are called the significant and null predictors respectively. The true support we denote as ${\mathcal{S}}=\{1,\dots,I_{0}\}$ .

The oracle estimate is defined as $\boldsymbol{\tilde{\boldsymbol{\beta}}_{o}}=\{\boldsymbol{\tilde{\boldsymbol{\beta}}_{o}}_{1},\textbf{0}\}$ where $\boldsymbol{\tilde{\boldsymbol{\beta}}_{o}}_{1}$ is the $\mathcal{L}_{2}$ penalized estimate given the true support and $\textbf{0}\in{\mathbb{H}}^{I-I_{0}}$ consists of $I-I_{0}$ zero functions. In this paper, we provide an estimator of $\boldsymbol{\beta}$ that achieves two types of strong oracle properties, namely, our estimator asymptotically has the correct support and also is equivalent to the oracle estimator in the ${\mathbb{H}}$ topology as well as the stronger ${\mathbb{K}}$ topology.

We propose estimating $\boldsymbol{\beta}^{*}$ by minimizing the the following target function over ${\mathbb{H}}^{I}$

[TABLE]

where ${\bf Y}\in{\mathbb{H}}^{N}$ , ${\bf X}\in{\mathbb{R}}^{N\times I}$ and $\boldsymbol{\beta}\in{\mathbb{K}}^{I}$ . The operator $L:{\mathbb{K}}\to{\mathbb{K}}$ is a continuous linear operator which is included to provide slightly more generality. In particular, if one wishes to only penalize the second derivative of $\beta_{i}$ , then that is equivalent to using a Sobolev kernel for ${\mathbb{K}}$ and choosing $L$ as a projection onto the orthogonal compliment of the constant and linear functions (since they have second derivative zero) (Bawa, 2005; Yuan et al., 2010).

The estimator produced by minimizing (1) we call Adaptive Function-on-Scalar Smoothing Elastic Net, AFSSEN, as it is an extension of the classic Elastic Net (Zou and Hastie, 2005; Zou, 2006) to functional response models. Setting $\lambda_{K}=0$ , AFSSEN reduces to adaptive function-on-scalar lasso, AFSL, (Fan and Reimherr, 2017). Comparing AFSSEN with another approach, FLAME (Parodi and Reimherr, 2017) utilizes an $\mathcal{L}_{1}$ penalty with the ${\mathbb{K}}$ norm, which simultaneously selects significant predictors and produces smooth estimates of the parameters, while in AFSSEN, this task is split between two penalties. The first is an $\mathcal{L}_{1}$ penalty using the ${{\mathbb{H}}}$ norm, which is responsible for variable selection, while smoothing is achieved by using an $\mathcal{L}_{2}$ penalty with the squared ${{\mathbb{K}}}$ norm. We show that tuning sparsity and smoothness of the estimates separately produces stronger asymptotic results and can dramatically increase statistical utility.

The AFSSEN target function (1) requires a kernel, weights ${\tilde{w}}_{i}$ , and the values of penalty parameters $\lambda_{K}$ and $\lambda_{H}$ . When ${\mathbb{H}}=L^{2}[0,1]$ , there are many options for choosing the kernel functions, each of which imparts different properties to the parameter estimates. We explore four popular kernels:

[TABLE]

Each kernel is from the Matérn family of covariances (Stein, 2012) with smoothness parameters $\nu=\infty,5/2,3/2,1/2$ respectively, though the first is also known as the Gaussian or squared exponential kernel and the last is also known as the exponential, Laplacian, or Ornstein-Uhlenbeck kernel. There are also different options to choose the adaptive weights. One can use the data driven weights ${\tilde{w}}_{i}=\nicefrac{{1}}{{\|\tilde{\beta}_{i}\|_{{\mathbb{H}}}}}$ where $\boldsymbol{\tilde{\beta}}=(\tilde{\beta}_{1},\dots,\tilde{\beta}_{I})^{\top}$ is the FSL parameter estimate (Barber et al., 2017), which has the added benefit of screening out small effects before fitting the more complicated AFSSEN model. Another option is to run a nonadaptive version of AFSSEN, setting ${\tilde{w}}_{i}=1$ , then compute the parameter estimate $\boldsymbol{\hat{\beta}}$ and finally define the new weights ${\tilde{w}}_{i}=\nicefrac{{1}}{{\|\hat{\beta}_{i}\|_{{\mathbb{H}}}}}$ for running the adaptive step. Huang et al. (2008) suggest using one over the norm of the parameter estimation come from running a marginal regression. The second method, running the nonadaptive step with all wights set to one, is the approach we take in the simulation section. Finally for determining the penalty parameters $\lambda_{K}$ and $\lambda_{H}$ , we consider a fine range and then find their optimal values based on cross-validation, though there are other options, such as BIC (Barber et al., 2017).

3 Theoritical Properties

In this section, we present our main theoretical results. We begin by explicitly introducing more technical assumptions needed on the tuning parameters. We decompose our assumptions into three sets. The first, Assumption 2, ensures that AFSSEN, asymptotically, recovers the true support of $\boldsymbol{\beta}$ . The second, Assumption 3, ensures that AFSSEN is also asymptotically equivalent to the oracle estimate in the ${\mathbb{H}}$ topology, which completes the strong oracle property. Finally, under Assumption 4, one can show that AFSSEN also achieves the strong oracle property in the stronger ${\mathbb{K}}$ topology, which we have not seen from any other estimator, but is useful for estimating quantities such as derivatives.

Assumption 2.

Suppose Assumption 1 is satisfied. Denote the true support as ${\mathcal{S}}$ . We assume the following six conditions hold.

Minimum Signal.* Let $b_{N}=\min\limits_{i\in{\mathcal{S}}}\|\beta^{\star}_{i}\|_{{\mathbb{H}}}$ , then we assume*

[TABLE] 2. 2.

Sparsity Tuning Parameter.* We assume the sparsity tuning parameter, $\lambda_{H}$ , satisfies*

[TABLE] 3. 3.

Design Matrix.* Let ${\hat{\small\Sigma}}_{11}=N^{-1}{\bf X}_{1}^{\top}{\bf X}_{1}$ , the design matrix for true predictors, then we assume minimum and maximum eigenvalue of ${\hat{\small\Sigma}}_{11}$ will be bounded by*

[TABLE]

where $\tau$ is a fixed positive number. 4. 4.

–ّIrrepresentable Condition.

Let ${\hat{\small\Sigma}}_{21}=N^{-1}{\bf X}_{2}^{\top}{\bf X}_{1}$ , the cross covariance between the true and null predictors, then we assume that

[TABLE]

where $\|A\|_{op}=\sup_{\|x\|=1}\|Ax\|$ is an operator norm defined for the arbitrary matrix A.

Maximum Signal. Let $d_{N}=\max\limits_{i\in{\mathcal{S}}}\|\beta^{\star}_{i}\|_{{\mathbb{K}}}$ , we assume the smoothing tuning parameter, $\lambda_{K}$ , satisfies

[TABLE]

Smoothing Tuning Parameter. then we assume

[TABLE]

The above assumptions are common in the high dimensional regression literature. The first condition indicates the minimum magnitude of the signals for detecting the relevant predictors. It allows the smallest value of $\|\boldsymbol{\beta}^{\star}_{i}\|_{{\mathbb{H}}}$ vary with the sample size, the number of significant and whole predictors, $I_{0}$ and $I$ , but cannot be too small. The second condition, on the sparsity tuning parameter, states a familiar rate for $\lambda_{H}$ allowing it to grow but not too fast. The third condition, on the design matrix, guarantees that the oracle estimator is well defined, which, in turn ensures that the AFSSEN estimates are well behaved when restricted to the true predictors. The Irrepresentable condition implies that the true and null predictors should not be too correlated. This is an essential assumption for achieving the oracle property (Zhao and Yu, 2006). The fifth condition, on the maximum signal, essentially indicates that the smoothing tuning parameter, $\lambda_{K}$ , cannot be increased too quickly. Finally, the last condition gives a trade-off between the smoothing and sparsity parameters. It indicates that the sparsity parameter cannot be too small relative to the smoothing parameter. The above assumptions will imply that AFSSEN is consistent in terms of variable selection. We require slightly stronger assumptions to show the AFSSEN estimates are asymptotically equivalent to the oracle estimates under $\|.\|_{{\mathbb{H}}}$ and $\|.\|_{{\mathbb{K}}}$ .

Assumption 3.

The sparsity tuning parameter $\lambda_{H}$ , satisfies

[TABLE]

Assumption 4.

Assume that $\eta_{j}^{2}\geq M\sqrt{\theta_{j}}$ where $\theta_{j}$ and $\eta_{j}$ are the eigenvalues of the $K$ and $L$ , respectively, and $M>0$ is a constant scalar, then the smoothing and sparsity tuning parameters $\lambda_{K},\lambda_{H}$ satisfy

[TABLE]

Assumption 3 assigns a tighter upper bound than the Sparsity Tuning Parameter condition in Assumption 2. Assumption 4 gives another trade-off between $\lambda_{K}$ and $\lambda_{H}$ and does not allow their ratio grow really fast. Now with using the above assumptions, we can present our main theorem which shows the AFSSEN chooses the true support with probability one and their estimates are asymptotically equivalent with the oracle estimates.

Theorem 1.

Suppose $\boldsymbol{\tilde{\beta}}$ and $\boldsymbol{\tilde{\boldsymbol{\beta}}_{o}}$ are the FSL (Barber et al., 2017) and oracle estimates respectively. Assume $L$ is a self-adjoint nonnegative definite continuous linear operator with the same eigenfunctions as $K$ . Let $\boldsymbol{\hat{\beta}}$ be the the AFSSEN estimate with the data driven weights ${\tilde{w}}_{i}=\|\tilde{\beta}_{i}\|_{{\mathbb{H}}}^{-1}$ . If the regression model satisfies Assumptions 1 and 2, the AFSSEN estimates $\boldsymbol{\hat{\beta}}$

has the correct support:

[TABLE] 2. 2.

is equivalent to oracle estimate under $\|.\|_{{\mathbb{H}}}$ if Assumption 3 also holds:

[TABLE] 3. 3.

and is equivalent to oracle estimate under $\|.\|_{{\mathbb{K}}}$ if Assumption 4 also holds:

[TABLE]

4 Implementation

Here we present a coordinate descent algorithm to find the parameter estimates efficiently. We employ functional subgradients to update the individual parameter estimates in each step. Subgradients extend derivatives to non necessarily differentiable convex functionals. We call $h\in{\mathbb{H}}$ a subgradient of $f$ at $x_{0}\in{\mathbb{H}}$ if

[TABLE]

The collection of the all subgradients of $f$ at $x_{0}\in{\mathbb{H}}$ is called the subdifferential of $f$ at $x_{0}$ and denoted by $\partial f(x_{0})$ . It is clear from (9) that if $0\in\partial f(x_{0})$ , then $x_{0}$ is a minimizer of $f$ . For more details and background we refer interested readers to Boyd and Vandenberghe (2004); Bauschke and Combettes (2011); Barbu and Precupanu (2012); Shor (2012). We show in the supplemental material that the subgradient of the target function (1) is

[TABLE]

where ${\bf X}_{.i}^{\top}=\left(X_{1i},\dots,X_{Ni}\right)\in{\mathbb{R}}^{N}$ is the $i^{th}$ column of the design matrix ${\bf X}$ . Then we can conclude the following useful lemma.

Lemma 1.

The AFSSEN estimate satisfies the equations

[TABLE]

where $\mathds{I}$ is the identity operator from ${\mathbb{H}}$ to ${\mathbb{H}}$ and $\widecheck{\beta}_{i}=N^{-1}{\sum\limits}_{n=1}^{N}\ X_{ni}(Y_{n}-{\sum\limits}_{j\neq i}\ X_{nj}\hat{\beta}_{j}))$ .

The only challenge in using the Lemma 1 is the presence of $\|\hat{\beta}_{i}\|_{{\mathbb{H}}}$ which the following Lemma 2 can help us to derive an estimation for it.

Lemma 2.

The $\|\hat{\beta}_{i}\|_{{\mathbb{H}}}$ in Lemma 1 can be solved numerically by

[TABLE]

where the $\eta_{j}$ and $\theta_{j}$ are the eigenvalue of the $L$ and $K$ operators respectively for common eigenfunction $v_{i}$ . Now one can run the coordinate descend algorithm iteratively and obtain a sequence $\boldsymbol{\hat{\beta}}^{(t)}$ from the estimated parameters which converges to desired $\boldsymbol{\hat{\beta}}$ asymptotically. In practice, we follow an approach similar to the one outlined in FLAME (Parodi and Reimherr, 2017). We run our algorithm in nonadaptive and adaptive steps. First, in the nonadaptive step, we set all ${\tilde{w}}_{i}=1$ and find the estimated $\hat{\beta}_{j}^{ndp}$ . Then for adaptive step, we set ${\tilde{w}}_{i}=\nicefrac{{1}}{{\|\hat{\beta}_{j}^{ndp}\|_{{\mathbb{K}}}}}$ . To choose penalty parameters, we select $\lambda_{K}$ from $\{10,1,0.01,0.0001,0\}$ and $\lambda_{H}$ from 100 points between $\lambda_{max}$ to $r_{\lambda}\lambda_{max}$ where $\lambda_{max}$ is the smallest tuning parameter such that all parameters are set to zero, while $r_{\lambda}$ is a specified ratio. In order to increase the computational efficiency, for any fixed $\lambda_{K}$ , we start with $\lambda_{H}=\lambda_{max}$ and initial $\boldsymbol{\beta}=0$ . Then we decrease the $\lambda_{H}$ and in each step, using a warm start which means the previous estimated $\boldsymbol{\hat{\beta}}$ is used as the initial value of $\boldsymbol{\beta}$ . We employ a kill switch variable, where this iterative process is stopped once if the number of active predictors exceeds a chosen threshold (since one is search for spase solutions). Small changes in $\lambda_{H}$ combined with a warm start imply a very quick convergence of $\boldsymbol{\hat{\beta}}$ in each step. It is also more efficient to define a maximum number of iterations $T$ for $t$ and a threshold as stopping criteria on the improvement parameter estimation $\|\boldsymbol{\hat{\beta}}^{(t)}-\boldsymbol{\hat{\beta}}^{(t-1)}\|_{{\mathbb{H}}^{I}}$ . We use a 10-fold cross validation to find the optimum values of $\lambda_{K}$ and $\lambda_{H}$ . Finally, we run 100 iterations for each setting to find the average of prediction error $\left(\sum_{n=1}^{N}\|{\bf X}_{n.}^{\top}\boldsymbol{\hat{\beta}}-{\bf X}_{n.}^{\top}\boldsymbol{{\beta}}^{\star}\|_{{\mathbb{H}}}\right)$ , prediction error derivatives $\left(\sum_{n=1}^{N}\|{\bf X}_{n.}^{\top}\boldsymbol{\hat{\beta}^{{}^{\prime}}}-{\bf X}_{n.}^{\top}{\boldsymbol{\beta}^{\star}}^{{}^{\prime}}\|_{{\mathbb{H}}}\right)$ and the number of true and false positive predictors.

5 –ٍ

Empirical Study In this section, we compare the performance of AFSSEN with FLAME in two high-dimensional simulation settings, one with rougher and and one with smoother $\beta^{\star}$ coefficients. Mimicking FLAME, we generate $N=500$ functional observations from ${\mathbb{H}}=L^{2}[0,1]$ and $I=1000$ scalar predictors, with $I_{0}=10$ significant. The design matrix ${\mathbf{X}}$ is generated using standard normal random variables. Observation errors $\varepsilon_{n}(t)$ are generated according to a 0-mean Matern process with parameters $(\nu=\nicefrac{{3}}{{2}},\textrm{range}=\nicefrac{{1}}{{4}},\sigma^{2}=1)$ . Here we consider the four different RKHS kernels, K, in (2) with varying range parameters: $\{0.5,1,2,4,8,16,32\}$ . Denote $\theta_{1}\geq\theta_{2}\geq\dots\geq 0$ and $v_{1},v_{2},\dots\in{\mathbb{H}}$ as the ordered eigenvalues and their corresponding eigenfunctions of $K$ and use the eigenfunctions, computed numerically on a grid of $m=50$ evenly spaced points between 0 and 1, as an orthonormal basis of ${\mathbb{H}}$ . They allow us to compute the $\|.\|_{{\mathbb{H}}}$ and $\|.\|_{{\mathbb{K}}}$ quickly. In order to have more computational efficiency, we take the number of FPCs that explain more than $99\%$ of the variability in FPCA, which means $\sum_{i=1}^{M}\theta_{i}\geq 0.99\sum_{i=1}^{\infty}\theta_{i}$ . We also considered $r_{\lambda}=10^{-6}$ and a $0.001$ threshold as the stopping criteria for the coefficient increments ( $\|\boldsymbol{\hat{\beta}}^{(T)}-\boldsymbol{\hat{\beta}}^{(T-1)}\|_{{\mathbb{H}}^{I}}\leq 0.001$ ) and a kill switch $2I_{0}=20$ for maximum number of non zero predictors before we stop decreasing $\lambda_{H}$ .

5.1 Rough Setting

In this scenario, The true coefficients $\beta_{i}^{\star}(t)$ are sampled from a Matern process with 0 average and parameters $(\nu=\nicefrac{{5}}{{2}},\textrm{range}=\nicefrac{{1}}{{4}},\sigma^{2}=1)$ . The results are presented in Figure 1. Turning to average of prediction error and prediction error derivative, the AFSSEN performs 5 to 10 times better than FLAME. They also looks to be consistent in terms of range parameter for all kernels which is a significant improvement than FLAME. The behavior of AFSSEN in variable selection is not as much consistent and seems to work better for smaller range parameters and can beat FLAME in those cases. It seems the FLAME is the winner for larger range parameters. However, in all AFSSEN situations, the average of false positive number of predictors are still remain less than one which shows a quite small uncertainty. Lastly, the rougher kernel, i.e. exponential, in AFSSEN seems to be more efficient than the others for rough $\beta_{i}^{\star}$ predictors.

5.2 Smooth Setting

For the smooth setting, we just generate the true coefficients from a Matern process with 0 average and parameters $(\nu=\nicefrac{{7}}{{2}},$ range $=1,\sigma^{2}=1)$ and keep the other parameters same as rough setting. Figure 2 illustrates the AFSSEN performs $50-200\%$ better than FLAME in prediction error and prediction derivative error. For false positive predictors, the AFSSEN beats FLAME but still cannot force them to be zero. In the number of true positive predictors, the AFSSEN works better for smaller range parameters but not for the larger ones. In the smooth setting, it seems using the smoother kernels implies the smaller prediction errors but performs worth in variable selections. A final remark is consistency of AFSSEN than FLAME in prediction errors and also number of true positive predictors. However in terms of false positive, FLAME is more consistent but has higher values than AFSSEN.

6 Conclusion

We have presented a method, called AFSSEN, which can control the dimension reduction and smoothness of the parameter estimates in a high-dimensional function-on-scalar linear model with a sub-Gaussian errors. In our work, the parameters live in a Reproducing Kernel Hilbert Space (RKHS), ${\mathbb{K}}$ , and inherit its properties, such as smoothness or periodicity. In our framework, the data is not enforced to lie in the RKHS. We showed under some non-strong assumptions, our parameter estimates and the true parameters have the same support and then illustrated the strong functional oracle property would be achieved under both norm ${\mathbb{H}}$ and norm ${\mathbb{K}}$ . Using a simulation study, we depicted a hugely improvement on prediction error and prediction error derivative and consistency using AFSSEN than previous works. Additionally, in terms of true and false positive error, we showed AFSSEN beats FLAME in smooth coefficient parameters and have a highly reliable performance in the rough cases.

Supplementary Material

In this section, we provide the proof of lemmas and theorems discussed in the main content. We start defining some notations which are necessary throughout this section.

Definition 1.

Let $K:{\mathbb{H}}\to{\mathbb{H}}$ be an operator. We define the coordinate wise extension, $K_{M}$ from ${\mathbb{H}}^{M}$ to ${\mathbb{H}}^{M}$ as

[TABLE]

*We define $L_{M}:{\mathbb{K}}^{M}\to{\mathbb{H}}^{M}$ analogously *

Definition 2.

Let $\Sigma\in{\mathbb{R}}^{M\times M}$ be a matrix. Then, using an abuse of notation, we define the linear operation $\Sigma:{\mathbb{H}}^{M}\to{\mathbb{H}}^{M}$ as

[TABLE]

Note that, as defined, the operators $\Sigma$ and $K_{M}$ are interganchable in the sense that $\Sigma K_{M}{\bf h}=K_{M}\Sigma{\bf h}$ . The following lemma are needed for demonstrating the main theoretical properties.

Lemma 3.

Let $\partial f(x)$ denote the subdifferential of a functional $f:{\mathbb{K}}\to{\mathbb{R}}$ at $x$ . Then we have the following.

Consider the functional $f(x)=\|x\|_{{\mathbb{H}}}^{2}$ . Then $f$ is convex and everywhere differrentiable with respect to $\|.\|_{{\mathbb{K}}}$ with

[TABLE] 2. 2.

Consider the functional $f(x)=\|x\|_{{\mathbb{H}}}$ . Then $f$ is convex and differrentiable with respect to $\|.\|_{{\mathbb{K}}}$ when $x\neq 0$ with

[TABLE]

and when $x=0$ with

[TABLE] 3. 3.

Consider the functional $f(x)=\|L(x)\|_{{\mathbb{K}}}^{2}$ where $L$ is a self-adjoint linear operator from ${\mathbb{K}}$ to ${\mathbb{K}}$ . Then $f$ is convex and everywhere differrentiable with respect to $\|.\|_{{\mathbb{K}}}$ with

[TABLE]

Proof.

Recall if $f:{\mathbb{K}}\rightarrow{\mathbb{R}}$ is a convex functional, $h\in{\mathbb{K}}$ is called a subgradient of $f$ in $x\in{\mathbb{K}}$ with respect to $\|.\|_{{\mathbb{K}}}$ when

[TABLE]

and the collection of the all subgradients of $f$ at $x\in{\mathbb{K}}$ is called the subdifferential of $f$ in $x$ and denoted by $\partial f(x)$ .

Part 1: According to the fact that $\|x\|_{{\mathbb{H}}}^{2}=\|K^{\nicefrac{{1}}{{2}}}(x)\|_{{\mathbb{K}}}^{2}$ , we need to prove

[TABLE]

The right hand side can be written as

[TABLE]

and the left hand side is

[TABLE]

So Cauchy Schwarz inequality gives the desired result.

Part 2: For $x\neq 0$ , we need to show that

[TABLE]

or equivalently

[TABLE]

which is true based on Cauchy Schwarz inequality. Let’s assume $x=0$ . We should find all $h\in{\mathbb{H}}$ such that

[TABLE]

So based on the following application of Cauchy Schwarz inequality

[TABLE]

The part 2 trivially holds when $\|K^{-\nicefrac{{1}}{{2}}}(h)\|_{{\mathbb{K}}}\leq 1$ .

Part 3. It is enough to show that

[TABLE]

The right hand side can be written as

[TABLE]

Same as part 1, the inequality is satisfied with using the Cauchy Schwarz inequality. ∎

Lemma 4.

The subgradient of target function (1) is

[TABLE]

where ${\bf X}_{.i}^{\top}=\left(X_{1i},\dots,X_{Ni}\right)\in{\mathbb{R}}^{N}$ is the vector of $i^{th}$ column of design matrix ${\bf X}$ .

Proof.

[TABLE]

where $Y_{n}\in{\mathbb{H}}$ is the $n^{th}$ observation and ${\bf X}_{n.}^{\top}=\left(X_{n1},\dots,X_{nI}\right)\in{\mathbb{R}}^{I}$ is the $n^{th}$ row of the design matrix ${\bf X}$ . According to Lemma 3, we can take the subgradient of $L_{\lambda}(\beta)$ for any $\beta_{i}$ with respect to $\|.\|_{\mathbb{K}}$ as follows

[TABLE]

∎

Now we introduce the lemma which will play an important role in proof of the functional oracle property.

Lemma 5.

Let’s assume the AFSSEN estimation $\boldsymbol{\hat{\beta}}$ and true parameters $\boldsymbol{\beta}^{\star}$ have the same support, ${\mathcal{S}}=\{1,\dots,I_{0}\}$ . The nonzero parts of $\boldsymbol{\hat{\beta}}=(\boldsymbol{\hat{\beta}}_{1},\textbf{0})$ can be written concisely by

[TABLE]

where

[TABLE]

Proof.

Let’s denote $\boldsymbol{\beta}^{\star}=(\boldsymbol{\beta}^{\star}_{1},\textbf{0})$ with true support ${\mathcal{S}}=\{1,\dots,I_{0}\}$ . We assumed the AFSSEN estimation $\boldsymbol{\hat{\beta}}=(\boldsymbol{\hat{\beta}}_{1},\textbf{0})$ have the same support as $\boldsymbol{\beta}^{\star}$ , so we can consider $\hat{\beta}_{i}\neq 0$ for all $i\in{\mathcal{S}}$ and $\hat{\beta}_{i}=0$ for $i\not\in{\mathcal{S}}$ . Since $\boldsymbol{\hat{\beta}}$ is going to be the minimizer of the convex function (1), according to Lemma 4, for $i\not\in{\mathcal{S}}$

[TABLE]

So the above equality exists when

[TABLE]

or equivalently

[TABLE]

In the other side, when $\|\dfrac{1}{N}{\bf X}_{.i}^{\top}({\bf Y}-{\bf X}\boldsymbol{\hat{\beta}})\|_{{\mathbb{H}}}>\lambda_{H}{\tilde{w}}_{i}$ , we will have $\hat{\beta}_{i}\neq 0$ for $i\in{\mathcal{S}}$ and then

[TABLE]

According to Definition 1, ${\hat{\small\Sigma}}_{11}=\frac{1}{N}{\bf X}_{1}^{\top}{\bf X}_{1}\in{\mathbb{R}}^{I_{0}\times I_{0}}$ and $\tilde{\boldsymbol{s}}=\{\tilde{w}_{i}\hat{\beta}_{i}\|\hat{\beta}_{i}\|_{{\mathbb{H}}}^{-1};i\in{\mathcal{S}}\}$ we have

[TABLE]

According to Definition 2, we can simplify it by

[TABLE]

then

[TABLE]

With substitution of ${\bf Y}={\bf X}_{1}\boldsymbol{\beta}^{\star}_{1}+\boldsymbol{\epsilon}$ , we will have

[TABLE]

Finally with introducing $G_{I_{0}}=\left({\hat{\small\Sigma}}_{11}\mathds{I}_{I_{0}}+\lambda_{K}K_{I_{0}}^{-1}L_{I_{0}}^{2}\right)^{-1}$ as a linear operator from ${\mathbb{H}}^{I_{0}}$ to ${\mathbb{H}}^{I_{0}}$ , we will have

[TABLE]

∎

We now introduce the following lemmas which are useful in proof of the Theorem 1.

Lemma 6.

Let $\boldsymbol{\epsilon}=(\epsilon_{1},\dots,\epsilon_{N})$ where $\epsilon_{i}$ s are independent mean zero $C$ -subgaussian process in ${\mathbb{H}}$ and ${\bf T}$ is an arbitrary operator from ${\mathbb{H}}^{N}$ to ${\mathbb{H}}$ , then ${\bf T}\boldsymbol{\epsilon}$ will be a $C_{T}$ -subgaussian process in ${\mathbb{H}}$ with

[TABLE]

Proof.

Since $\epsilon_{i}$ s are independent, we can write

[TABLE]

∎

The following lemma can be considered as an extension of the lemma used in (Parodi and Reimherr, 2017) for $C$ -subgaussian noise.

Lemma 7.

Let’s consider $X$ is a mean zero $C$ -subgaussian process in Hilbert space ${\mathbb{H}}$ , then we have

[TABLE]

where $\|C\|_{1}$ , $\|C\|_{2}$ and $\|C\|_{\infty}$ represent $\sum\limits_{i=1}^{\infty}\gamma_{i}$ , $\sqrt{\sum\limits_{i=1}^{\infty}\gamma_{i}^{2}}$ and $\max\limits_{i}{\gamma_{i}}$ respectively when $\gamma_{i}$ are eigenvalues of Covariance operator $C$ .

Proof.

The idea is same as (Barber et al., 2017). Let’s assume $\gamma_{i}>0$ and $\psi_{i}\in{\mathbb{H}}$ as the eigenvalues and corresponding eigenfunctions of $C$ . According to the KL-expansion theorem

[TABLE]

where $Z_{j}$ is a subgaussian process in ${\mathbb{R}}$ (Antonioni, 1997) with parameter $\left\langle\dfrac{\psi_{j}}{\sqrt{\gamma_{j}}},\dfrac{C(\psi_{j})}{\sqrt{\gamma_{j}}}\right\rangle=1$ . Define the events

[TABLE]

Since $\|C\|_{1}\geq\sum\limits_{i=1}^{J}\gamma_{i}$ and $\|C\|_{2}^{2}\geq\sum\limits_{i=1}^{J}\gamma_{i}^{2}$ , based on (Hsu et al., 2012) we can see

[TABLE]

Since $A_{1}\subset A_{2}\subset\ldots$ and using continuity from below, we can conclude

[TABLE]

∎

Lemma 8.

If $Q$ is an operator in ${\mathbb{H}}$ such that $\|Q^{2}x\|\leq\|Qx\|$ for any $x\in{\mathbb{H}}$ , the eigenvalues of $Q$ will be in $[0,1]$ .

Proof.

let’s denote $\theta_{j}$ and $v_{j}$ as the eigenvalues and eigenfunctions of $Q$ . So $Qv_{i}=\theta_{i}v_{i}$ and then

[TABLE]

Then

[TABLE]

or equivalently

[TABLE]

which implies $0\leq\theta_{i}\leq 1$ . ∎

Lemma 9.

Assume that $Q$ is a continuous linear operator and $C$ a covariance operator over ${\mathbb{H}}$ . For an arbitrary covariance operator, $A$ , let $\|A\|_{m}$ denote the $m$ -norm of the eigenvalues of $A$ . Then we have that

[TABLE]

where $\|Q\|_{op}$ is the operator norm of $Q$ , equivalently the largest singular value of $Q$ .

Proof.

Let’s define $\theta^{\prime}_{i}$ as eigenvalues of $QCQ^{\star}$ , then we have

[TABLE]

where ${\mathbb{H}}_{i}=\{x;\ \ \|x\|_{{\mathbb{H}}}=1\ \ \&\ \ \langle x,v_{j}\rangle_{{\mathbb{H}}}=0\ \ \ \forall j=1,\dots,i-1\}$ . Since $C$ is a covariance operator, it is self-adjoint with positive eigenvalues, then we have

[TABLE]

So based on $\theta_{i}^{\prime}\leq\|Q\|_{op}^{2}\theta_{i}$ , we can conclude

[TABLE]

then we will have

[TABLE]

∎

Finally for some technical proofs, we recall the following lemma from Barber et al. (2017).

Lemma 10.

If Assumption 2 holds, the FSL estimate $\tilde{\beta_{i}}$ and true coefficient $\beta_{i}^{\star}$ satisfy

[TABLE]

where $r_{N}=\dfrac{I_{0}\log(I)}{N}$ .

Proof of the Lemma 1

Let’s fix an $i\in\{1,\dots,I\}$ . We want to find the $\hat{\beta}_{i}$ which minimizes the target function (1). The idea is same as Lemma 5 for a univariate case. Lets denote $\widecheck{\beta}_{i}=\dfrac{1}{N}{\sum\limits}_{n=1}^{N}X_{ni}(Y_{n}-{\sum\limits}_{j\neq i}X_{nj}\hat{\beta}_{j})$ . According to the Lemma 4, when $\hat{\beta}_{i}=0$

[TABLE]

So the above equality exists when $\|K^{\frac{1}{2}}(\widecheck{\beta}_{i})\|_{{\mathbb{K}}}\leq\lambda_{H}{\tilde{w}}_{i}$ or equivalently $\|\widecheck{\beta}_{i}\|_{{\mathbb{H}}}\leq\lambda_{H}{\tilde{w}}_{i}$ .

In the other side, when $\|\widecheck{\beta}_{i}\|_{{\mathbb{H}}}\geq\lambda_{H}{\tilde{w}}_{i}$ , we have

[TABLE]

Proof of the Lemma 2

Taking $\|.\|_{{\mathbb{H}}}^{2}$ from the both hand side of equation (10)

[TABLE]

For ease of notation, we use $A=\left((1+\dfrac{\lambda_{H}{\tilde{w}}_{i}}{\|\hat{\beta}_{i}\|_{{\mathbb{H}}}})\mathds{I}+\lambda_{K}K^{-1}L^{2}\right)^{-1}$ for the following parts.

[TABLE]

or equivalently

[TABLE]

Proof of Theorem (1)

part 1:

One can see

[TABLE]

where the $\hat{{\mathcal{S}}}$ and ${\mathcal{S}}$ are support of the estimated AFSSEN and true predictors respectively. So we can see (15) can be induced from

[TABLE]

or equivalently

[TABLE]

According to Lemma 5 we have

[TABLE]

where $G_{I_{0}}=\left({\hat{\small\Sigma}}_{11}\mathds{I}_{I_{0}}+\lambda_{K}K_{I_{0}}^{-1}L_{I_{0}}^{2}\right)^{-1}$ . In some sense, $-\lambda_{K}G_{I_{0}}K_{I_{0}}^{-1}L^{2}_{I_{0}}(\boldsymbol{\beta}^{\star}_{1})$ and $N^{-1}G_{I_{0}}({\bf X}_{1}^{\top}\boldsymbol{\epsilon})-\lambda_{H}G_{I_{0}}(\tilde{\boldsymbol{s}})$ play the role of Bias and Variance respectively. Since $\hat{\beta}_{i}=0$ for all $i\not\in S$ , we can see

[TABLE]

where

[TABLE]

So with using above achievements we can see (15) is equivalent to

[TABLE]

It is easy to see that $\{\hat{{\mathcal{S}}}\neq{\mathcal{S}}\}\subseteq\bigcup\limits_{i=1}^{6}B_{i}$ where

[TABLE]

So for proving Theorem 1, we just need to show that $P(B_{i})$ asymptotically goes to zero for all $i=1,\dots,6$ .

Step 1: $P(B_{1})\rightarrow 0$

Recall that

[TABLE]

where $G_{I_{0}}=\left({\hat{\small\Sigma}}_{11}\mathds{I}_{I_{0}}+\lambda_{K}K_{I_{0}}^{-1}L_{I_{0}}^{2}\right)^{-1}$ . We aim to show that $\dfrac{\lambda_{K}\|e_{i}^{\top}G_{I_{0}}K_{I_{0}}^{-1}L^{2}_{I_{0}}(\boldsymbol{\beta}^{\star}_{1})\|_{{\mathbb{H}}}}{\|{\beta}_{i}^{\star}\|_{{\mathbb{H}}}}\rightarrow 0$ . We can write

[TABLE]

where $d_{N}=\max\limits_{i\in{\mathcal{S}}}\|{\beta}_{i}^{\star}\|_{{\mathbb{K}}}$ and $b_{N}=\min\limits_{i\in{\mathcal{S}}}\|{\beta}_{i}^{\star}\|_{{\mathbb{H}}}$ . So We need to find an upper bound for $\|G_{I_{0}}K_{I_{0}}^{-1/2}L^{2}_{I_{0}}\|_{op}$ which is the maximum eigenvalue of $G_{I_{0}}K_{I_{0}}^{-1/2}L^{2}_{I_{0}}$ . According to the tensor product definition (Kokoszka and Reimherr, 2017), we have

[TABLE]

where ${\bf I}_{I_{0}}$ is an identity $I_{0}$ by $I_{0}$ matrix. In order to find the eigenvalues of $G_{I_{0}}K_{I_{0}}^{-1/2}L^{2}_{I_{0}}$ , let’s denote $u_{i}$ as the eigenfunction of ${\hat{\small\Sigma}}_{11}$ and $v_{j}$ as the eigenfunctions of $K$ and $L$ . Then

[TABLE]

where $\tau_{i}$ is the eigenvalue of ${\hat{\small\Sigma}}_{11}$ and then $(\tau_{i}\theta_{j}^{1/2}\eta_{j}^{-2}+\lambda_{K}\theta_{j}^{-1/2})^{-1}$ can be considered as the eigenvalues of $G_{I_{0}}K_{I_{0}}^{-1/2}L^{2}_{I_{0}}$ . Since

[TABLE]

the maximum value of (17) occures in $\theta_{j}=\lambda_{K}\tau\eta_{1}^{2}$ . Then

[TABLE]

So we can conclude

[TABLE]

If we assume $\lambda_{K}\ll\dfrac{b_{N}^{2}}{d_{N}^{2}I_{0}}$ , we can easily see $P(B_{1})\rightarrow 0$ asymptotically.

Step 2: $P(B_{2})\rightarrow 0$

Recall that

[TABLE]

where $G_{I_{0}}=\left({\hat{\small\Sigma}}_{11}\mathds{I}_{I_{0}}+\lambda_{K}K_{I_{0}}^{-1}L_{I_{0}}^{2}\right)^{-1}$ . We notice $B_{1}=\bigcup\limits_{i\in{\mathcal{S}}}A_{i}$ such that

[TABLE]

where

[TABLE]

is a continuous linear operator from ${\mathbb{H}}^{I_{0}}$ to ${\mathbb{H}}$ . Then we can see

[TABLE]

where $b_{N}=\min\limits_{i\in{\mathcal{S}}}\|{\beta}_{i}^{\star}\|_{{\mathbb{H}}}$ . So we just need to find an upper bound for the right hand side of (20).

Since $\boldsymbol{\epsilon}=(\epsilon_{1},\dots,\epsilon_{N})\in{\mathbb{H}}^{N}$ where $\epsilon_{i}$ are independent mean zero $C$ -subgaussian process in ${\mathbb{H}}$ , then ${\bf X}_{1}^{\top}\boldsymbol{\epsilon}$ will be a $C_{1}$ -subgaussian in ${\mathbb{H}}^{I_{0}}$ such that

[TABLE]

where, since $C_{I_{0}}$ is applied coordinate wise, $\hat{\Sigma}_{11}$ and $C_{I_{0}}$ are interchangeable and thus this is a valid covariance matrix. Based on an extension of Lemma 6 in ${\mathbb{H}}^{I_{0}}$ , $Q_{i}({\bf X}_{1}^{\top}\boldsymbol{\epsilon})$ will be a $C_{q}$ -subgaussian process with

[TABLE]

According to Lemma 7 we have

[TABLE]

Then based on Lemma 9 and (19)

[TABLE]

So we are going find an upper bound for $\|G_{I_{0}}\|_{op}$ . Using tensor product notation as in step 1, we have

[TABLE]

where $\mathds{I}$ is an identity operator from ${\mathbb{H}}$ to ${\mathbb{H}}$ and ${\bf I}_{I_{0}}$ is an identity $I_{0}$ by $I_{0}$ matrix. Now we can write the eigenvalues of $G_{I_{0}}$ as

[TABLE]

According to Assumption 5

[TABLE]

then we can conclude

[TABLE]

and finally we will have

[TABLE]

So we are looking to find a $\hat{t}$ such that

[TABLE]

Since $C$ is a covariance operator, its nuclear property will implify there exists a constant D which

[TABLE]

then

[TABLE]

We Choose $\hat{t}=\dfrac{Nb_{N}^{2}}{9\tau^{3}D}$ . Therefore (20) will be written as

[TABLE]

According to Assumption 2, the right hand side of (23) goes to zero because $Nb_{N}^{2}\to\infty$ and

[TABLE]

step 3: $P(B_{3})\rightarrow 0$

Recall that

[TABLE]

where $\tilde{\boldsymbol{s}}=\{\tilde{w}_{i}\hat{\beta}_{i}\|\hat{\beta}_{i}\|_{{\mathbb{H}}}^{-1};i\in{\mathcal{S}}\}$ . Our aim is to show that $\dfrac{\lambda_{H}\|e_{i}^{\top}G_{I_{0}}(\tilde{\boldsymbol{s}})\|_{{\mathbb{H}}}}{\|{\beta}_{i}^{\star}\|_{{\mathbb{H}}}}\rightarrow 0$ . By using the Assumption 2 we can write

[TABLE]

So we need to find the upper bounds of $\|G_{I_{0}}\|_{op}$ and $\|\tilde{\boldsymbol{s}}\|_{{\mathbb{H}}^{I_{0}}}$ .

First, same as what we did in (21), we have

[TABLE]

Second, we can write

[TABLE]

where ${\tilde{w}}_{i}=\|\tilde{\beta}_{i}\|_{{\mathbb{H}}}^{-1}$ and $w_{i}=\|{\beta}_{i}^{\star}\|_{{\mathbb{H}}}^{-1}$ .

With using Taylor Expansion for functional data $f(x+h)-f(x)=\langle h,f^{\prime}(x)\rangle+o(h^{2})$ where $f(x)=\dfrac{1}{\|x\|^{2}}$ and $f^{\prime}(x)=\dfrac{-2x}{\|x\|^{4}}$ , we can write

[TABLE]

According to Cauchy-Schwarz inequality and Lemma 10

[TABLE]

where $r_{N}=\dfrac{I_{0}\log{(I)}}{N}$ . By using Assumption 2 we can see $\dfrac{r_{N}^{\frac{1}{2}}}{b_{N}}\rightarrow 0$ and then

[TABLE]

According to 25, we can conclude

[TABLE]

so if $\lambda_{H}\ll\dfrac{b_{N}^{2}}{\sqrt{I_{0}}}$ , then $P(B_{3})\rightarrow 0$ .

step 4: $P(B_{4})\rightarrow 0$

The idea is same as Part 1. Recall that

[TABLE]

where $G_{I_{0}}=\left({\hat{\small\Sigma}}_{11}\mathds{I}_{I_{0}}+\lambda_{K}K_{I_{0}}^{-1}L_{I_{0}}^{2}\right)^{-1}$ . We aim to show $\dfrac{\lambda_{K}\|{\bf X}_{.i}^{\top}{\bf X}_{1}G_{I_{0}}K_{I_{0}}^{-1}L^{2}_{I_{0}}(\boldsymbol{\beta}^{\star}_{1})\|_{{\mathbb{H}}}}{N\lambda_{H}{\tilde{w}}_{i}}\rightarrow 0$ . According to Lemma 10, Assumption 6 and equation (18) we have

[TABLE]

where $r_{N}=\dfrac{I_{0}\log{(I)}}{N}$ . So if $\dfrac{I_{0}\sqrt{\log(I)}d_{N}}{\sqrt{N}}\ll\dfrac{\lambda_{H}}{\sqrt{\lambda_{K}}}$ , then $P(B_{4})\rightarrow 0$ asymptotically.

step 5: $P(B_{5})\rightarrow 0$

The idea is same as Part 3. Recall that

[TABLE]

where $\tilde{\boldsymbol{s}}=\{\tilde{w}_{i}\hat{\beta}_{i}\|\hat{\beta}_{i}\|_{{\mathbb{H}}}^{-1};i\in{\mathcal{S}}\}$ . We aim to show that $\dfrac{\|{\bf X}_{.i}^{\top}{\bf X}_{1}G_{I_{0}}(\tilde{\boldsymbol{s}})\|_{{\mathbb{H}}}}{N{\tilde{w}}_{i}}\rightarrow 0$ . By using Assumption 6, (26) and (27) and the fact that

[TABLE]

where $r_{N}=\dfrac{I_{0}\log(I)}{N}$ , we can write

[TABLE]

So if $\dfrac{I_{0}^{2}\log(I)}{N}\ll b_{N}^{2}$ , then $P(B_{5})\rightarrow 0$ asymptotically.

Step 6: $P(B_{6})\rightarrow 0$

Recall that

[TABLE]

where $H_{N}=\left(\mathds{I}_{N}-{\bf X}_{1}\left({\bf X}_{1}^{\top}{\bf X}_{1}\mathds{I}_{I_{0}}+\lambda_{K}NK_{I_{0}}^{-1}L^{2}_{I_{0}}\right)^{-1}{\bf X}_{1}^{\top}\right)$ . We notice that $B_{6}=\bigcup\limits_{i\not\in{\mathcal{S}}}A_{i}$ where $A_{i}=\bigg{\{}\frac{1}{N}\|{\bf X}_{.i}^{\top}H_{N}\epsilon\|_{{\mathbb{H}}}\geq\dfrac{\lambda_{H}{\tilde{w}}_{i}}{3}\bigg{\}}$ . According to Lemma 10

[TABLE]

where we denoted $T=O_{p}(1)$ which is bounded in probability with $M$ for $\dfrac{\varepsilon}{2I}$ . Using conditional probability on T, we have

[TABLE]

We just need to show that $\sum\limits_{i\not\in{\mathcal{S}}}P\left(\|{\bf X}_{.i}^{\top}H_{N}\boldsymbol{\epsilon}\|_{{\mathbb{H}}}^{2}\geq(\dfrac{N\lambda_{H}}{3Mr_{N}^{\frac{1}{2}}})^{2}\right)$ goes to zero. Since $\boldsymbol{\epsilon}=(\epsilon_{1},\dots,\epsilon_{N})$ and $\epsilon_{i}$ s are independent mean zero $C$ -subgaussian process in ${\mathbb{H}}$ , Lemma 6 implies that ${\bf X}_{.i}^{\top}H_{N}\boldsymbol{\epsilon}$ is a $C_{h}$ -subgaussian process where

[TABLE]

According to Lemma 7 we have

[TABLE]

and based on Lemma 9 and the fact that ${\bf X}_{.i}$ is standardized, we have

[TABLE]

Now we just need to bound $\|H_{N}\|_{op}$ . Let’s denote $P=\left({\bf X}_{1}^{\top}{\bf X}_{1}\mathds{I}_{I_{0}}+\lambda_{K}NK_{I_{0}}^{-1}L^{2}_{I_{0}}\right)^{-1}$ and then write $H_{N}=\mathds{I}_{N}-{\bf X}_{1}P{\bf X}_{1}^{\top}$ . So if we can prove eigenvalues of ${\bf X}_{1}P{\bf X}_{1}^{\top}$ are in $[0,1]$ , it will be obvious the eigenvalues of $H$ will be in $[0,1]$ and consequently $\|H_{N}\|_{op}\leq 1$ . For doing so, we want to use the Lemma 8 and prove

[TABLE]

It is a basic linear algebra exercise to show that ${\bf X}_{1}P{\bf X}_{1}^{\top}$ and $P{\bf X}_{1}^{\top}{\bf X}_{1}$ have the same eigenvalues. Then

[TABLE]

where $\mathds{I}$ is an identity operator from ${\mathbb{H}}$ to ${\mathbb{H}}$ and ${\bf I}_{I_{0}}$ is an identity $I_{0}$ by $I_{0}$ matrix. Then we can see

[TABLE]

Since $(1+\lambda_{K}\tau_{i}\theta_{j}^{-1}\eta_{j}^{2})^{-1}\leq 1$ , the eigenvaluse of $P{\bf X}_{1}^{\top}{\bf X}_{1}$ are smaller than one and we can conclude

[TABLE]

therefore $\|H_{N}\|_{op}\leq 1$ . So (31) will be simplified to

[TABLE]

According to (30), we can see

[TABLE]

So we are looking for a $\hat{t}$ such that

[TABLE]

Same as (22), we have

[TABLE]

So we just need to find $\hat{t}$ such that

[TABLE]

Let’s denote $D_{2}=9DM^{2}$ , then one can see $\hat{t}=\dfrac{N\lambda_{H}^{2}}{D_{2}r_{N}}$ implies

[TABLE]

Based on (29), we can bound

[TABLE]

So if $\lambda_{H}\gg\dfrac{\sqrt{I_{0}}\log(I)}{N}$ , then we can conclude $P(B_{6})\rightarrow 0$ asymptotically.

Proof of part 2:

We need to show that

[TABLE]

or equivalently, since the support is recovered with probability tending to one,

[TABLE]

where $\boldsymbol{\hat{\beta}}=(\boldsymbol{\hat{\beta}}_{1},\textbf{0})$ and $\boldsymbol{\tilde{\boldsymbol{\beta}}_{o}}=(\boldsymbol{\tilde{\boldsymbol{\beta}}_{o}}_{1},\textbf{0})$ . So

[TABLE]

where $G_{I_{0}}=\left({\hat{\small\Sigma}}_{11}\mathds{I}_{I_{0}}+\lambda_{K}K_{I_{0}}^{-1}L_{I_{0}}^{2}\right)^{-1}$ and the oracle estimate is obtained by taking the subgradient of target function (1) with $\lambda_{H}=0$ given the true support. So the norm of the difference is given by

[TABLE]

According to (21) and (27) we will have

[TABLE]

then we can conclude

[TABLE]

So if $\lambda_{H}\ll\dfrac{b_{N}}{\sqrt{N}\sqrt{I_{0}}}$ , the probability asymptotically goes to zero.

Proof of part 3:

Here we want to show that

[TABLE]

or equivalently, since the correct support is recovered with probability tending to one,

[TABLE]

Similar to (34), we can see

[TABLE]

Since

[TABLE]

where ${\bf I}_{I_{0}}$ is an identity $I_{0}$ by $I_{0}$ matrix. Then we can see

[TABLE]

Then we can conclude

[TABLE]

If there exists a constant $M>0$ such that $\eta_{j}^{2}\geq M\sqrt{\theta_{j}}$ , we have

[TABLE]

and then based on (27), we can see

[TABLE]

So if $\dfrac{\lambda_{H}}{\lambda_{K}}\ll\dfrac{b_{N}}{\sqrt{N}\sqrt{I_{0}}}$ , the proof will be completed.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Algamal and Lee (2015) Algamal, Z. Y. and M. H. Lee (2015). Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification. Computers in biology and medicine 67 , 136–145.
2Antonioni (1997) Antonioni (1997). Subgaussian random variable in hilbert spaces.
3Barber et al. (2017) Barber, R. F., M. Reimherr, T. Schill, et al. (2017). The function-on-scalar lasso with applications to longitudinal gwas. Electronic Journal of Statistics 11 (1), 1351–1389.
4Barbu and Precupanu (2012) Barbu, V. and T. Precupanu (2012). Convexity and optimization in Banach spaces . Springer Science & Business Media.
5Bauschke and Combettes (2011) Bauschke, H. H. and P. L. Combettes (2011). Convex analysis and monotone operator theory in Hilbert spaces . Springer Science & Business Media.
6Bawa (2005) Bawa, R. K. (2005). Spline based computational technique for linear singularly perturbed boundary value problems. Applied mathematics and computation 167 (1), 225–236.
7Berlinet and Thomas-Agnan (2011) Berlinet, A. and C. Thomas-Agnan (2011). Reproducing kernel Hilbert spaces in probability and statistics . Springer Science & Business Media.
8Bierut et al. (2006) Bierut, L. J., P. A. Madden, N. Breslau, E. O. Johnson, D. Hatsukami, O. F. Pomerleau, G. E. Swan, J. Rutter, S. Bertelsen, L. Fox, et al. (2006). Novel genes identified in a high-density genome wide association study for nicotine dependence. Human molecular genetics 16 (1), 24–35.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Adaptive Function-on-Scalar Regression with a Smoothing Elastic Net

Abstract

1 Introduction

2 Background and Methodology

Assumption 1**.**

3 Theoritical Properties

Assumption 2**.**

Assumption 3**.**

Assumption 4**.**

Theorem 1**.**

4 Implementation

Lemma 1**.**

Lemma 2**.**

5 –ٍ

5.1 Rough Setting

5.2 Smooth Setting

6 Conclusion

Supplementary Material

Definition 1**.**

Definition 2**.**

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

Lemma 7**.**

Proof.

Lemma 8**.**

Proof.

Lemma 9**.**

Proof.

Lemma 10**.**

Proof of the Lemma 1

Proof of the Lemma 2

Proof of Theorem (1)

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Theorem 1.

Lemma 1.

Lemma 2.

Definition 1.

Definition 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.