On functional logistic regression: some conceptual issues

Beatriz Bueno-Larraz; Jos\'e R. Berrendero; Antonio Cuevas

arXiv:1812.00721·math.ST·July 14, 2021

On functional logistic regression: some conceptual issues

Beatriz Bueno-Larraz, Jos\'e R. Berrendero, Antonio Cuevas

PDF

Open Access

TL;DR

This paper critically examines the conceptual foundations of functional logistic regression, proposing an RKHS-based model, analyzing its validity, and addressing issues related to the existence of maximum likelihood estimators in the functional setting.

Contribution

It introduces an RKHS-based approach to functional logistic regression, compares it with traditional $L_2$ models, and explores the conditions for the existence of ML estimators.

Findings

01

RKHS-based model validity conditions derived

02

ML estimators often do not exist in the functional case

03

Proposes a restricted RKHS-based ML estimator

Abstract

The main ideas behind the classical multivariate logistic regression model make sense when translated to the functional setting, where the explanatory variable $X$ is a function and the response $Y$ is binary. However, some important technical issues appear (or are aggravated with respect to those of the multivariate case) due to the functional nature of the explanatory variable. First, the mere definition of the model can be questioned: while most approaches so far proposed rely on the $L_{2}$ -based model, we suggest an alternative (in some sense, more general) approach, based on the theory of Reproducing Kernel Hilbert Spaces (RKHS). The validity conditions of such RKHS-based model, as well as its relation with the $L_{2}$ -based one are investigated and made explicit in two formal results. Some relevant particular cases are considered as well. Second we show that, under very general…

Figures18

Click any figure to enlarge with its caption.

Equations86

P (Y = 1∣ X = x) = \frac{1}{1 + exp { - β _{0} - ⟨ β , x ⟩ _{2} }},

P (Y = 1∣ X = x) = \frac{1}{1 + exp { - β _{0} - ⟨ β , x ⟩ _{2} }},

\mathcal{K}(f)(\cdot)=\int_{0}^{1}K(s,\cdot)f(s)\mathrm{d}s={\mathbb{E}}\big{[}\langle X-m,f\rangle_{2}\big{(}X(\cdot)-m(\cdot)\big{)}\big{]}.

\mathcal{K}(f)(\cdot)=\int_{0}^{1}K(s,\cdot)f(s)\mathrm{d}s={\mathbb{E}}\big{[}\langle X-m,f\rangle_{2}\big{(}X(\cdot)-m(\cdot)\big{)}\big{]}.

{\mathcal{L}}_{0}(X)=\big{\{}U\in L^{2}(\Omega)\ :\ U=\sum_{i=1}^{n}a_{i}\big{(}X(t_{i})-m(t_{i})\big{)},\ a_{i}\in{\mathbb{R}},\ t_{i}\in[0,1],\ n\in{\mathbb{N}}\big{\}},

{\mathcal{L}}_{0}(X)=\big{\{}U\in L^{2}(\Omega)\ :\ U=\sum_{i=1}^{n}a_{i}\big{(}X(t_{i})-m(t_{i})\big{)},\ a_{i}\in{\mathbb{R}},\ t_{i}\in[0,1],\ n\in{\mathbb{N}}\big{\}},

Ψ_{X} (U) (s) = E [U (X (s) - m (s))] = ⟨ U, X (s) - m (s)⟩ \in H (K), for U \in L (X)

Ψ_{X} (U) (s) = E [U (X (s) - m (s))] = ⟨ U, X (s) - m (s)⟩ \in H (K), for U \in L (X)

P (Y = 1 ∣ X = x) = \frac{1}{1 + exp { - β _{0} - ⟨ β , x ⟩ _{K} }},

P (Y = 1 ∣ X = x) = \frac{1}{1 + exp { - β _{0} - ⟨ β , x ⟩ _{K} }},

P (Y = 1 ∣ X = x) = \frac{1}{1 + exp { - β _{0} - ⟨ x , β ⟩ _{K} }} \equiv \frac{1}{1 + exp { - β _{0} - Ψ _{x}^{- 1} ( β ) }},

P (Y = 1 ∣ X = x) = \frac{1}{1 + exp { - β _{0} - ⟨ x , β ⟩ _{K} }} \equiv \frac{1}{1 + exp { - β _{0} - Ψ _{x}^{- 1} ( β ) }},

\frac{d P _{m_{0}}}{d P _{m_{1}}} (X) = \frac{d P _{m_{0} - m_{1}}}{d P _{0}} (X - m_{1}) = exp {⟨ X - m_{1}, m_{0} - m_{1} ⟩_{K} - \frac{1}{2} ∥ m_{0} - m_{1} ∥_{K}^{2}} .

\frac{d P _{m_{0}}}{d P _{m_{1}}} (X) = \frac{d P _{m_{0} - m_{1}}}{d P _{0}} (X - m_{1}) = exp {⟨ X - m_{1}, m_{0} - m_{1} ⟩_{K} - \frac{1}{2} ∥ m_{0} - m_{1} ∥_{K}^{2}} .

P (Y = 1 ∣ X) = \frac{p \frac{d P _{m_{1}}}{d P _{m_{0}}} ( X )}{p \frac{d P _{m_{1}}}{d P _{m_{0}}} ( X ) + ( 1 - p )} = (1 + \frac{1 - p}{p} \frac{d P _{m_{0}}}{d P _{m_{1}}} (X))^{- 1} .

P (Y = 1 ∣ X) = \frac{p \frac{d P _{m_{1}}}{d P _{m_{0}}} ( X )}{p \frac{d P _{m_{1}}}{d P _{m_{0}}} ( X ) + ( 1 - p )} = (1 + \frac{1 - p}{p} \frac{d P _{m_{0}}}{d P _{m_{1}}} (X))^{- 1} .

P (Y = 1 ∣ X) = (1 + \frac{1 - p}{p} exp {⟨ X, m_{0} - m_{1} ⟩_{K} - E_{m_{1}} [⟨ X, m_{0} - m_{1} ⟩_{K}] - \frac{1}{2} ∥ m_{0} - m_{1} ∥_{K}^{2}})^{- 1} .

P (Y = 1 ∣ X) = (1 + \frac{1 - p}{p} exp {⟨ X, m_{0} - m_{1} ⟩_{K} - E_{m_{1}} [⟨ X, m_{0} - m_{1} ⟩_{K}] - \frac{1}{2} ∥ m_{0} - m_{1} ∥_{K}^{2}})^{- 1} .

\log\Big{(}\frac{\mathrm{d}P_{m_{1}}}{\mathrm{d}P_{m_{0}}}(x)\Big{)}=\langle x-m_{0},\,\mathcal{K}^{-1}(m_{1}-m_{0})\rangle_{2}-\frac{1}{2}\,\langle m_{1}-m_{0},\,\mathcal{K}^{-1}(m_{1}-m_{0})\rangle_{2},

\log\Big{(}\frac{\mathrm{d}P_{m_{1}}}{\mathrm{d}P_{m_{0}}}(x)\Big{)}=\langle x-m_{0},\,\mathcal{K}^{-1}(m_{1}-m_{0})\rangle_{2}-\frac{1}{2}\,\langle m_{1}-m_{0},\,\mathcal{K}^{-1}(m_{1}-m_{0})\rangle_{2},

\frac{d P _{m_{0}}}{d P _{m_{1}}} (x) = C exp (- ⟨ x, β ⟩_{2}),

\frac{d P _{m_{0}}}{d P _{m_{1}}} (x) = C exp (- ⟨ x, β ⟩_{2}),

H (K) = {K^{1/2} (f), f \in L^{2} [0, 1]},

H (K) = {K^{1/2} (f), f \in L^{2} [0, 1]},

⟨ f, g ⟩_{K} = ⟨ K^{- 1/2} (f), K^{- 1/2} (g) ⟩_{2} .

⟨ f, g ⟩_{K} = ⟨ K^{- 1/2} (f), K^{- 1/2} (g) ⟩_{2} .

β (\cdot) = j = 1 \sum p β_{j} K (t_{j}, \cdot),

β (\cdot) = j = 1 \sum p β_{j} K (t_{j}, \cdot),

\mathbb{P}(Y=1|X)=\bigg{(}1+\exp\Big{\{}-\beta_{0}-\sum_{j=1}^{p}\beta_{j}(X(t_{j})-m(t_{j}))\Big{\}}\bigg{)}^{-1}.

\mathbb{P}(Y=1|X)=\bigg{(}1+\exp\Big{\{}-\beta_{0}-\sum_{j=1}^{p}\beta_{j}(X(t_{j})-m(t_{j}))\Big{\}}\bigg{)}^{-1}.

P (Y = 1 ∣ X = x) = \frac{1}{1 + exp { - β _{0} - \sum _{j = 1}^{p} β _{j} ( x ( t _{j} ) - m ( t _{j} )) }}

P (Y = 1 ∣ X = x) = \frac{1}{1 + exp { - β _{0} - \sum _{j = 1}^{p} β _{j} ( x ( t _{j} ) - m ( t _{j} )) }}

P (Y = 1 ∣ X = x) = \frac{1}{1 + exp { - β _{0} - \int _{0}^{1} α ( t ) ( x ( t ) - m ( t )) d t }},

P (Y = 1 ∣ X = x) = \frac{1}{1 + exp { - β _{0} - \int _{0}^{1} α ( t ) ( x ( t ) - m ( t )) d t }},

P (Y = 1 ∣ X = x) = \frac{1}{1 + exp { - β _{0} - \sum _{j = 1}^{p} β _{j} ⟨ x - m , u _{j} ⟩ _{2} }},

P (Y = 1 ∣ X = x) = \frac{1}{1 + exp { - β _{0} - \sum _{j = 1}^{p} β _{j} ⟨ x - m , u _{j} ⟩ _{2} }},

Ψ_{X} (X (t_{j}) - m (t_{j})) (s) = E [(X (s) - m (s)) (X (t_{j}) - m (t_{j}))] = K (s, t_{j}) .

Ψ_{X} (X (t_{j}) - m (t_{j})) (s) = E [(X (s) - m (s)) (X (t_{j}) - m (t_{j}))] = K (s, t_{j}) .

⟨ X, β ⟩_{K} = j = 1 \sum p β_{j} ⟨ X, k (\cdot, t_{j}) ⟩_{K} = j = 1 \sum p β_{j} (X (t_{j}) - m (t_{j})) .

⟨ X, β ⟩_{K} = j = 1 \sum p β_{j} ⟨ X, k (\cdot, t_{j}) ⟩_{K} = j = 1 \sum p β_{j} (X (t_{j}) - m (t_{j})) .

Ψ_{X} (U) (s) = E [\int_{0}^{1} α (t) (X (t) - m (t)) d t \cdot (X (s) - m (s))] = \int_{0}^{1} K (s, t) α (t) d t = β (s),

Ψ_{X} (U) (s) = E [\int_{0}^{1} α (t) (X (t) - m (t)) d t \cdot (X (s) - m (s))] = \int_{0}^{1} K (s, t) α (t) d t = β (s),

⟨ X, β ⟩_{K} = j = 1 \sum p β_{j} ⟨ u_{j}, X - m ⟩_{2} .

⟨ X, β ⟩_{K} = j = 1 \sum p β_{j} ⟨ u_{j}, X - m ⟩_{2} .

L_{n}(b,b_{0})=\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\log\Big{(}\frac{e^{-b_{0}-b^{\prime}x_{i}^{0}}}{1+e^{-b_{0}-b^{\prime}x_{i}^{0}}}\Big{)}+\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\log\Big{(}\frac{1}{1+e^{-b_{0}-b^{\prime}x_{i}^{1}}}\Big{)}.

L_{n}(b,b_{0})=\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\log\Big{(}\frac{e^{-b_{0}-b^{\prime}x_{i}^{0}}}{1+e^{-b_{0}-b^{\prime}x_{i}^{0}}}\Big{)}+\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\log\Big{(}\frac{1}{1+e^{-b_{0}-b^{\prime}x_{i}^{1}}}\Big{)}.

\beta_{0}+\Psi_{X}^{-1}(\beta)\ \equiv\ \beta_{0}+\langle X,\beta\rangle_{K}\ =\ \log\Big{(}\frac{p_{\beta,\beta_{0}}(X)}{1-p_{\beta,\beta_{0}}(X)}\Big{)},

\beta_{0}+\Psi_{X}^{-1}(\beta)\ \equiv\ \beta_{0}+\langle X,\beta\rangle_{K}\ =\ \log\Big{(}\frac{p_{\beta,\beta_{0}}(X)}{1-p_{\beta,\beta_{0}}(X)}\Big{)},

f_{β, β_{0}} (x, y)

f_{β, β_{0}} (x, y)

L_{n}(\beta,\beta_{0})\ =\ \frac{1}{n}\sum_{i=1}^{n}\log\big{(}p_{\beta,\beta_{0}}(x_{i})^{y_{i}}\left(1-p_{\beta,\beta_{0}}(x_{i})\right)^{1-y_{i}}\big{)},

L_{n}(\beta,\beta_{0})\ =\ \frac{1}{n}\sum_{i=1}^{n}\log\big{(}p_{\beta,\beta_{0}}(x_{i})^{y_{i}}\left(1-p_{\beta,\beta_{0}}(x_{i})\right)^{1-y_{i}}\big{)},

L (β, β_{0}) = E_{Z} [lo g f_{β, β_{0}} (X, Y)] = E_{Z} [lo g (p_{β, β_{0}} (X)^{Y} (1 - p_{β, β_{0}} (X))^{1 - Y})],

L (β, β_{0}) = E_{Z} [lo g f_{β, β_{0}} (X, Y)] = E_{Z} [lo g (p_{β, β_{0}} (X)^{Y} (1 - p_{β, β_{0}} (X))^{1 - Y})],

L (β^{*}, β_{0}^{*}) \geq L (β, β_{0}), \mbox f or a l l (β, β_{0}) \in Θ.

L (β^{*}, β_{0}^{*}) \geq L (β, β_{0}), \mbox f or a l l (β, β_{0}) \in Θ.

L_{n}(\beta,\beta_{0})\ =\frac{1}{n}\sum_{\{i:\,y_{i}=1\}}\log\Big{(}\frac{1}{1+e^{-\beta_{0}-\langle\beta,x_{i}\rangle_{K}}}\Big{)}+\frac{1}{n}\sum_{\{i:\,y_{i}=0\}}\log\Big{(}\frac{e^{-\beta_{0}-\langle\beta,x_{i}\rangle_{K}}}{1+e^{-\beta_{0}-\langle\beta,x_{i}\rangle_{K}}}\Big{)}.

L_{n}(\beta,\beta_{0})\ =\frac{1}{n}\sum_{\{i:\,y_{i}=1\}}\log\Big{(}\frac{1}{1+e^{-\beta_{0}-\langle\beta,x_{i}\rangle_{K}}}\Big{)}+\frac{1}{n}\sum_{\{i:\,y_{i}=0\}}\log\Big{(}\frac{e^{-\beta_{0}-\langle\beta,x_{i}\rangle_{K}}}{1+e^{-\beta_{0}-\langle\beta,x_{i}\rangle_{K}}}\Big{)}.

β_{m} (\cdot) = c_{m} K (t_{0}, \cdot) .

β_{m} (\cdot) = c_{m} K (t_{0}, \cdot) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Methods and Models · Statistical Methods and Inference · Statistical and numerical algorithms

Full text

On functional logistic regression: some conceptual issues

José R. Berrendero1 Beatriz Bueno-Larraz2 and Antonio Cuevas1

1 Departamento de Matemáticas, Universidad Autónoma de Madrid

2 Independent Data Scientist

Abstract

The main ideas behind the classical multivariate logistic regression model make sense when translated to the functional setting, where the explanatory variable $X$ is a function and the response $Y$ is binary. However, some important technical issues appear (or are aggravated with respect to those of the multivariate case) due to the functional nature of the explanatory variable. First, the mere definition of the model can be questioned: while most approaches so far proposed rely on the $L_{2}$ -based model, we suggest an alternative (in some sense, more general) approach, based on the theory of Reproducing Kernel Hilbert Spaces (RKHS). The validity conditions of such RKHS-based model, as well as its relation with the $L_{2}$ -based one are investigated and made explicit in two formal results. Some relevant particular cases are considered as well. Second we show that, under very general conditions, the maximum likelihood (ML) of the logistic model parameters fail to exist in the functional case. Third, on a more positive side, we suggest an RKHS-based restricted version of the ML estimator. This is a methodological paper, aimed at a better understanding of the functional logistic model, rather than focussing on numerical and practical issues.

Keywords: Functional data, logistic regression, reproducing kernel Hilbert spaces, kernel methods in statistics.

1 Introduction: statement of the model

Logistic regression: some basic ideas and references

Throughout this work we study the situation in which a binary (0-1) response variable must be predicted in terms of a random explanatory variable $X$ , defined on a probability space $\Omega$ . The logistic regression model is an extremely popular approach to such problem. The basic ideas of this model date back to the end of nineteenth century (a complete historical overview can be found in Cramer (2003, Ch. 9)) but the logistic methodology is still under constant attention, especially as a (supervised) classification tool. The book by Hilbe (2009) is a fairly complete reference about logistic regression.

The logistic model is a particular case of the wider family of generalized linear models (we refer to McCullagh and Nelder (1989) for details) which presents some interesting characteristics. According to Hosmer et al. (2013, p. 52), one of its most appealing features is that the coefficients of the model are easily interpretable in terms of the values of the predictors. This technique stems from the attempt to apply well-known linear regression procedures to problems with categorical responses, like binary classification. There is no point in imposing that the categorical response is linear in the predictors $x$ , but we might instead assume that $\log(p(x)/(1-p(x))$ is linear in $x$ , where $p(x)={\mathbb{P}}(Y=1|X=x)$ . In this quotient the logarithm could be replaced with other link functions. However, an important aspect of the logarithm-based model is that it holds whenever the predictor variable $X$ in both classes is Gaussian with a common covariance matrix.

This finite-dimensional logistic model has been widely studied. Apart from the already mentioned references, Efron (1975) provides a comparison between logistic predictors and Fisher discriminant analysis. In addition, Munsiwamy and Wakweya (2011) gives a useful overview of asymptotic results of the estimators (firstly proved in Fahrmeir and Kaufmann (1985) and Fahrmeir and Kaufmann (1986)).

Logistic regression in the functional case: the “classical” $L^{2}$ -model

The motivations for extending logistic regression to functional data are quite obvious given the current increasing availability of functional data in experimental sciences. An historical overview of several approaches to functional logistic regression can be found in Mousavi and Sørensen (2018).

We start by establishing the framework of the problem in this functional context. The goal is to explore the relationship between a dichotomous response variable $Y$ , taking values on $\{0,1\}$ , and a functional predictor $X$ . We will assume throughout that $X$ is an $L^{2}$ -stochastic process with trajectories in $L^{2}[0,1]$ . Thus, the random variable $Y$ conditional to the realizations $x$ of the process follows a Bernoulli distribution with parameter $p(x)$ and the prior probability of class 1 is denoted by $p=\mathbb{P}(Y=1)$ . In this setting, the most common functional logistic regression (FLR) model is

[TABLE]

where $\beta_{0}\in{\mathbb{R}}$ , $\beta\in L^{2}[0,1]$ and $\langle\cdot,\cdot\rangle_{2}$ denotes the inner product in $L^{2}[0,1]$ . This model is the direct extension of the $d$ -dimensional one, where the product in ${\mathbb{R}}^{d}$ is replaced by its functional counterpart.

The standard approach to this problem is to reduce the dimension of the curves using Principal Components Analysis (PCA). That is, the curves are projected into the subspace defined by the eigenfunctions corresponding to the $d$ largest eigenvalues of the covariance operator. Then, we replace every curve in the functional data sample with the $d$ -dimensional coordinates of the projections with respect to the basis formed by the eigenfunctions. Finally, standard logistic regression is applied to the resulting $d$ -dimensional vectors. Among others, this strategy has been explored by Escabias et al. (2004) and James (2002) from an applied perspective though, in fact, the latter reference deals with generalized linear models (and not only with logistic regression). These more general models are also studied by Müller and Stadtmüller (2005), but with a more mathematical focus.

Some preliminaries and notation for an alternative approach

Our functional data will be trajectories in $L^{2}[0,1]$ of an $L^{2}$ -process $X=X(t)$ with continuous covariance and mean function, denoted by $K=K(s,t)$ and $m=m(t)$ , respectively. The covariance operator $\mathcal{K}$ associated with the covariance function $K$ of the process is given by

[TABLE]

Since the approach we will explore here for functional logistic regression is based on the theory of Reproducing Kernel Hilbert Spaces (RKHS’s), we will briefly remind here, for the sake of clarity, some basic ideas and notations about RKHS’s; see Berlinet and Thomas-Agnan (2004) and Appendix F of Janson (1997) for further details and references).

Let ${\mathcal{H}}_{0}(K):=\{f\in L^{2}[0,1]\ :\ f(\cdot)=\sum_{i=1}^{n}a_{i}K(t_{i},\cdot),\ a_{i}\in{\mathbb{R}},\ t_{i}\in[0,1],\ n\in{\mathbb{N}}\}$ , be the space of all finite linear combinations of evaluations of $K$ . This space is endowed with the inner product $\langle f,g\rangle_{K}=\sum_{i,j}\alpha_{i}\beta_{j}K(t_{i},s_{j})$ , where $f(\cdot)=\sum_{i}\alpha_{i}K(t_{i},\cdot)$ and $g(\cdot)=\sum_{j}\beta_{j}K(s_{j},\cdot)$ .

Then, the RKHS associated with $K$ is defined as the completion of ${\mathcal{H}}_{0}(K)$ . In other words, ${\mathcal{H}}(K)$ is made of all functions obtained as pointwise limits of Cauchy sequences in ${\mathcal{H}}_{0}(K)$ . The inner product is extended accordingly to the whole space ${\mathcal{H}}(K)$ .

These spaces are named after the so-called reproducing property, $\langle f,K(s,\cdot)\rangle_{K}=f(s)$ , for all $f\in{\mathcal{H}}(K),s\in[0,1]$ , which is particularly important in the applications. On account of this property it is sometimes said that the RKHS are spaces of “true functions”, in the sense that the pointwise values $f(s)$ , at a given $s$ do matter, by contrast with $L^{2}[0,1]$ whose elements are in fact equivalence classes of functions.

A property of RKHS’s especially useful in statistical applications is given by the following isometry result: let $L^{2}(\Omega)$ be the Hilbert space of real random variables with finite second moment, endowed with the usual inner product and with associated norm $\|U\|^{2}={\mathbb{E}}(U^{2})$ . Define

[TABLE]

where $m(t)={\mathbb{E}}[X(t)]$ , and let ${\mathcal{L}}(X)$ be the completion of ${\mathcal{L}}_{0}(X)$ in $L^{2}(\Omega)$ ; hence, in other words, is a subspace of $L^{2}(\Omega)$ defined as the closure of the linear span of the centred one-dimensional marginals of the process $X$ . It turns out that the transformation $\Psi_{X}$ , from ${\mathcal{L}}(X)$ to ${\mathcal{H}}(K)$ , defined by

[TABLE]

is an isometry (sometimes called Loève’s isometry) between ${\mathcal{L}}(X)$ and ${\mathcal{H}}(K)$ , that is, $\Psi_{X}(U)$ is bijective and preserves the inner product (see Lukić and Beder (2001, Lemma 1.1)). As a consequence, the Hilbert spaces ${\mathcal{L}}(X)$ and ${\mathcal{H}}(K)$ can be identified. Note that, in informal terms $\Psi_{X}$ is the completion of the transformation from ${\mathcal{L}}_{0}(X)$ to ${\mathcal{H}}_{0}(K)$ given by $\sum_{i=1}^{n}a_{i}\big{(}X(t_{i})-m(t_{i})\big{)}\mapsto\sum_{i=1}^{n}a_{i}K(t_{i},\cdot)$ .

It is worth mentioning that while ${\mathcal{H}}(K)$ is, in several aspects, a natural Hilbert space associated with the process $X$ , typically the trajectories of the process $X$ themselves do not belong to ${\mathcal{H}}(K)$ with probability one (see, e.g., (Lukić and Beder, 2001, Cor. 7.1), (Pillai et al., 2007, Th. 11)). Then, one cannot directly write $\langle x,K(s,\cdot)\rangle_{K}$ , for $x$ a realization of the process. However, following Parzen (1961a), we will use the convenient notation $\langle x,K(s,\cdot)\rangle_{K}$ interpreting this expression in terms of Loève’s isometry, $\Psi_{X}$ ; more precisely, we will identify $\langle x,f\rangle_{K}$ with $\Psi_{x}^{-1}(f):=(\Psi_{X}^{-1}(f))(\omega)$ , for $x=X(\omega)$ and $f\in{\mathcal{H}}(K)$ , which in particular means $\Psi_{X}^{-1}(\sum_{i=1}^{n}a_{i}K(t_{i},\cdot))=\sum_{i}a_{i}(X(t_{i})-m(t_{i}))$ .

The intuition behind the definition of $\langle x,f\rangle_{K}$ is reminiscent of the definition of Itô’s isometry, which is used to define the stochastic integral with respect to the Wiener measure (Brownian motion), overcoming the fact that the Brownian trajectories are not of bounded variation. As we will see below, the transformation $x\mapsto\langle\beta,x\rangle_{K}=\Psi_{X}^{-1}(\beta)$ will play a central role in the alternative functional logistic model we are going to propose.

An RKHS-based proposal for logistic regression in the functional case

We propose a new model for functional logistic regression problems, based on ideas borrowed from the theory of RKHS’s. To be more specific, our proposal is to study the following model, instead of (1),

[TABLE]

where the inner product stands for $\Psi_{x}^{-1}(\beta)$ , the inverse of Loève’s isometry defined in Equation (3). Throughout this paper we motivate this model and study some relevant theoretical aspects about it.

Some RKHS related literature

The book by Hsing and Eubank (2015) provides an excellent mathematical background on mathematical methods, including RKHS theory, for the statistical analysis of functional data. The papers by Hsing and Ren (2009) and Kneip et al. (2020) offer also very general perspectives and results on the applicability of RKHS methods in functional regression models, though not particularly focussed on the logistic case.

Some closely related ideas, aimed at the prediction problem in functional linear models are also present in Shin and Hsing (2012), even if the RKHS methodology is not explicitly mentioned there.

Some other more specific references (a few of them especially dealing with the functional logistic model) will be cited below.

The contents of this work

In the first place (see Theorem 1, Section 2), we will analyze specific conditions under which the RKHS-based logistic model (4) holds.

In the second place (see Theorem 2, Section 3), we will show that model (4) covers some relevant cases of practical interest not included in the standard $L_{2}$ -model (1), though, in fact, it is also shown in Theorem 2 that the $L_{2}$ model can also be obtained as a particular case of the RKHS model (4) under some conditions.

In the third place, in Section 4, we will prove two results of non-existence for the maximum likelihood estimator of the slope function $\beta$ in model (4). Such negative results can be seen as an aggravated, functional counterpart of the well-known partial non-existence results arising in finite dimensional logistic models; see Candès and Sur (2020) and references therein.

This is a methodological and theoretical paper: our aim is to contribute to a better understanding of the functional logistic model, rather than focussing on numerical or practical issues. However, we provide in Section 5 an specific suggestion (based on a restricted maximum likelihod approach) to deal with the estimation of the slope function in model 4.

2 The RKHS-based functional logistic model: validity conditions in the Gaussian case

In this section we motivate the reasons why model (4) is meaningful. In Theorem 1 we show that the standard assumption that both $X|Y=0$ and $X|Y=1$ are Gaussian implies (4). We also analyze under which conditions the more standard $L^{2}$ -model (1) is implied and we clarify the difference between both approaches.

In our functional setting, for $i=0,1$ , we assume that $\{X(t):\,t\in[0,1]\}$ given $Y=i$ is a Gaussian process with continuous mean function $m_{i}$ and continuous covariance function $K$ (the same for $i=0,1$ ). We will assume throughout that all the eigenvalues $\lambda_{i}$ of the covariance operator ${\mathcal{K}}$ , associated with $K$ are strictly positive (so ${\mathcal{K}}$ is injective). Note that, as a consequence of Spectral Theorem (see, e.g., (Hsing and Eubank, 2015, p. 98)) ${\mathcal{K}}x=\sum_{j}\lambda_{i}\langle x,e_{i}\rangle e_{i})$ , where $e_{i}$ stands for a unit eigenvector associated with $\lambda_{i}$ ; thus, the inverse ${\mathcal{K}}^{-1}$ is defined on the range of ${\mathcal{K}}$ , ${\mathcal{K}}(L^{2})$ , as a linear (not continuous) transformation, by ${\mathcal{K}}^{-1}y=\sum_{i}\frac{\langle y,e_{i}\rangle}{\lambda_{i}}e_{i})$ , for $y=\sum_{i}\langle y,e_{i}\rangle e_{i}\in{\mathcal{K}}(L^{2})$ .

Let $P_{m_{0}}$ and $P_{m_{1}}$ be the probability measures (i.e., the distributions) induced by the process $X$ conditional to $Y=0$ and $Y=1$ respectively. Recall that when $m_{0}$ and $m_{1}$ both belong to ${\mathcal{H}}(K)$ , we have that $P_{m_{0}}$ and $P_{m_{1}}$ are mutually absolutely continuous; see Theorem 5A of Parzen (1961a). The following theorem provides a very natural motivation for the RKHS model (4) in this Gaussian setting.

Theorem 1.

Let $P_{m_{0}},P_{m_{1}}$ and ${\mathcal{K}}$ be as in the previous lines. Then,

(a)

if $m_{1}-m_{0}\in\mathcal{H}(K)$ , then $P_{m_{0}}$ and $P_{m_{1}}$ are mutually absolutely continuous and model (4) holds,

[TABLE]

with $\beta:=m_{1}-m_{0}$ and $\beta_{0}:=-\mathbb{E}_{m_{1}}[\Psi_{x}^{-1}(\beta)]+\|m_{1}-m_{0}\|^{2}_{K}/2-\log((1-p)/p)$ (where $p=\mathbb{P}(Y=1)$ and $\mathbb{E}_{m_{1}}(\cdot)$ stands for the expectation when the process has mean function equal to $m_{1}$ ). If $m_{1}-m_{0}\notin\mathcal{H}(K)$ , then $P_{m_{0}}$ and $P_{m_{1}}$ are mutually singular.

(b)

if $m_{1}-m_{0}\in\mathcal{K}(L^{2})=\{\mathcal{K}(f)\,:\,f\in L^{2}[0,1]\}$ , then $P_{m_{0}}$ and $P_{m_{1}}$ are mutually absolutely continuous and model (1) holds.

(c)

if $m_{1}-m_{0}\not\in\mathcal{K}(L^{2})$ model (1) is never recovered, but different situations are possible, according to the condition in part (a). In particular if $m_{0}=0$ , $m_{1}\in\mathcal{H}(K)$ recovers scenario (a), but if $m_{1}\not\in\mathcal{H}(K)$ , $P_{m_{0}}$ and $P_{m_{1}}$ are mutually singular.

Proof.

(a) Let $P_{0}$ be the measure induced by a Gaussian process with covariance function $K$ but zero mean function, $m\equiv 0$ . From Theorem 7A in Parzen (1961b) $m_{0}-m_{1}\in\mathcal{H}(K)$ implies that $P_{m_{0}-m_{1}}$ and $P_{0}$ are mutually absolutely continuous, and $m_{0}-m_{1}\notin\mathcal{H}(K)$ implies that $P_{m_{0}-m_{1}}$ and $P_{0}$ are mutually singular. By Lemma 1.1 in Pitcher (1960), $P_{m_{0}-m_{1}}$ and $P_{0}$ are mutually absolutely continuous if and only if $P_{m_{0}}$ and $P_{m_{1}}$ are mutually absolutely continuous and, in this case, the corresponding Radon-Nikodym derivatives fulfill

[TABLE]

The last equality also follows from Theorem 7A in Parzen (1961b) (or Theorem 5A of Parzen (1961a)). Notice that by the definition of Loève’s isometry we have $\langle X-m_{1},m_{0}-m_{1}\rangle_{K}=\langle X,m_{0}-m_{1}\rangle_{K}-\mathbb{E}_{m_{1}}[\langle X,m_{0}-m_{1}\rangle_{K}]$ .

The conditional probability of $Y=1$ can be expressed in terms of the Radon-Nikodym derivative of $P_{1}$ with respect to $P_{0}$ (see Baíllo et al. (2011, Th.1)) by

[TABLE]

From the last two displayed equations, one can rewrite

[TABLE]

Then, reordering terms in this expression we get the logistic model in part (a).

(b) Under the assumptions, Theorem 6.1 in Rao and Varadarajan (1963) gives the following expression:

[TABLE]

for $x\in L^{2}[0,1]$ . This entails (using the Chain Rule for Radon-Nikodym derivatives)

[TABLE]

where $\beta={\mathcal{K}}^{-1}(m_{1}-m_{0})$ and $C=\exp(\langle m_{0}+m_{1},\beta\rangle_{2}/2$ . Now, replacing this expression in (5) we get the $L^{2}$ -model (1) with $\beta_{0}=-\log\big{(}\frac{1-p}{p}C\big{)}$ .

(c) Also as a consequence of Theorem 6.1 in Rao and Varadarajan (1963), if $m_{1}-m_{0}\notin\mathcal{K}(L^{2})$ it is not possible to express the Radon-Nikodym derivative in terms of inner products in $L_{2}$ or, equivalently, there is not any continuous linear functional $L(x)$ and $c\in\mathbb{R}$ such that $\log(\frac{\mathrm{d}P_{1}}{\mathrm{d}P_{0}}(x))=L(x)+c$ . Finally, the last sentence of the statement is a consequence of Theorem 5A of Parzen (1961a). ∎

*Some comments on the meaning of Theorem 1 *

Similarly to the finite-dimensional case, our model holds when the conditional distributions of the process given the two possible values of $Y$ are Gaussian with the same covariance structure. Another interesting property of this new RKHS-based model is that for some particular choices of the slope function of type $\beta(\cdot)=\sum_{i}^{p}a_{i}K(t_{i},\cdot)$ , the model (4) amounts to a finite-dimensional logistic regression model for which the explanatory variables are a finite number of projections of the trajectories of the process. Thus, the impact-point model studied by Lindquist and McKeague (2009) appears as a particular case of the RKHS-based model and, more generally, model (4) can be seen as a true extension of the finite-dimensional logistic regression model, which is obtained when a finite-dimensional covariance matrix plays the role of the kernel. As an important by-product, this provides a mathematical ground for variable selection in logistic regression.

Part (b) of this theorem has been recently observed by Petrovich et al. (2019), see Theorem 1, without reference to RKHS theory. Note that, in general, $m_{1}-m_{0}\notin\mathcal{K}(L^{2})$ does not imply that $P_{m_{1}}$ and $P_{m-0}$ are orthogonal. Parts (a) and (c) of the theorem above clarifies this point.

On the other hand, in order to better interpret the above theorem in RKHS terms, let us recall that the space ${\mathcal{H}}(K)$ can be also defined as the image of the square root of the covariance operator defined in (2) (e.g. Definition 7.2 of Peszat and Zabczyk (2007)),

[TABLE]

where now the inner product is defined, for $f,g\in{\mathcal{H}}(K)$ , as

[TABLE]

It can be seen that this definition of ${\mathcal{H}}(K)$ is equivalent to that given in Section 1. Then, from part (c) of the theorem it follows that the RKHS functional logistic regression can be seen as a generalization of the usual $L_{2}$ functional logistic regression model, in the sense that this $L_{2}$ model is recovered when a higher degree of smoothness on the mean functions is imposed (since clearly $\mathcal{K}(L^{2})\subsetneq{\mathcal{H}}(K)$ ). Indeed, the functions in $\mathcal{K}(L^{2})$ are convolutions of the functions in $L^{2}[0,1]$ with the covariance function of the process. The discussion of the next section makes clear that this difference is of key importance in practice and not merely a technicality.

As mentioned above, in the finite dimensional case the logistic model holds whenever $X|Y=i$ are Gaussian and homoscedastic, but in fact this model is more general in the sense that it also holds for other non-Gaussian assumptions on the conditional distributions $X|Y=i$ . Clearly this is also the case for the functional logistic model (4). In fact, the connection between the functional model and the finite-dimensional one is even deeper, as we will show in the following section.

3 The RKHS model: some important particular cases

Dimension reduction in the functional logistic regression model may be often appropriate in terms of interpretability of the model and classification accuracy. This reduction must be done losing as little information as possible. We propose to perform variable selection on the curves. By variable selection we mean to replace each curve $x_{i}$ by the finite-dimensional vector $(x_{i}(t_{1}),\ldots,x_{i}(t_{p}))$ , for some $t_{1},\ldots,t_{p}$ chosen in an optimal way. In this section we analyze under which conditions it is possible to perform functional variable selection, which is only feasible under the RKHS-model. In the following section we suggest how to do it: the idea is incorporating the points $t_{1},\ldots,t_{p}$ to the estimation procedure as additional parameters (in particular to the modified maximum likelihood estimator we propose).

Whenever the slope function $\beta$ has the form

[TABLE]

the model in (4) is reduced to the finite-dimensional one,

[TABLE]

The main difference between the standard finite-dimensional model and this one is that now the proper choice of the points $T=(t_{1},\ldots,t_{p})\in[0,1]^{p}$ is a part of the estimation procedure. In this sense, model (7) is truly functional since we will use the whole trajectories $x_{i}(t)$ to select the points. This fact leads to a critical difference between the functional and the multivariate problems. Then, our aim is to approximate the general model described by Equation (4) with finite-dimensional models as those of Equation (7). This amounts to get an approximation of the slope function in terms of a finite linear combination of kernel evaluations $K(t_{j},\cdot)$ . This model, for $p=1$ and a particular type of Gaussian process $X$ , is analyzed in Lindquist and McKeague (2009).

From the discussion above, it is clear that the differences between the RKHS model and the $L^{2}$ one are not minor technical questions. The functions of type $\beta(\cdot)=K(\cdot,t)$ belong to ${\mathcal{H}}(K)$ but do not belong to $\mathcal{K}(L^{2})$ . This fact implies that within the setting of the RKHS model it is possible to regress $Y$ on any finite dimensional projection of $X$ , whereas this does not make sense if we consider the $L^{2}$ model. This feature is clearly relevant if one wishes to analyze properties of variable selection methods.

Theorem 2.

Assume model (4) holds. Then,

(a)

If there exist a positive integer $p$ , $\beta_{1},\ldots,\beta_{p}\in\mathbb{R}$ , and $t_{1},\ldots,t_{p}\in[0,1]$ such that $\beta(\cdot)=\sum_{j=1}^{p}\beta_{j}K(\cdot,t_{j})$ , then

[TABLE] 2. (b)

If $\beta\in\mathcal{K}(L^{2})=\{\mathcal{K}(f):\,f\in L^{2}[0,1]\}$ , then

[TABLE]

where $\alpha\in L^{2}[0,1]$ fulfills $\beta=\mathcal{K}(\alpha)$ . 3. (c)

Let $\{u_{j}\}$ be an orthonormal basis of $L^{2}[0,1]$ . If there exist a positive integer $p$ , and $\beta_{1},\ldots,\beta_{p}\in\mathbb{R}$ such that $\beta=\sum_{j=1}^{p}\beta_{j}\mathcal{K}(u_{j})$ , then

[TABLE]

Proof.

(a) Observe that for $j=1,\ldots,p$ , $X(t_{j})-m(t_{j})\in\mathcal{L}_{0}(K)$ , and for all $s\in[0,1]$ ,

[TABLE]

Therefore $\Psi_{X}^{-1}(k(\cdot,t_{j}))=X(t_{j})-m(t_{j})$ , and

[TABLE]

(b) Let $U:=\int_{0}^{1}\alpha(t)(X(t)-m(t))dt$ . It holds that $U\in\mathcal{L}(X)$ (see e.g. Ash and Gardner (2014), page 34). Moreover, by Fubini’s theorem, for all $s\in[0,1]$ we have

[TABLE]

because $\mathcal{K}(\alpha)=\beta$ . Therefore, $\langle X,\beta\rangle_{K}=U=\int_{0}^{1}\alpha(t)(X(t)-m(t))dt$ .

(c) Putting $\alpha(t)=u_{j}(t)$ in (8) we get $\Psi_{X}(\langle u_{j},X-m\rangle_{2})=\mathcal{K}(u_{j})$ . As a consequence,

[TABLE]

∎

*Some comments on the meaning of Theorem 2 *

Part (a) of the previous result means that the impact point model, as that considered in Lindquist and McKeague (2009), is a particular case of the RKHS model (4). Just take as parameter function $\beta$ a finite linear combination of evaluations of $K$ .

Part (b) implies that the usual functional model based on the $L^{2}$ inner product is also a particular case of (4). What we need is that $\beta$ belongs to the image of the covariance operator, $\mathcal{K}(L^{2})$ . Notice that this condition is stronger than $\beta\in\mathcal{H}(K)=\mathcal{K}^{1/2}(L^{2})$ . As illustrated by part (a), the difference between $\mathcal{K}^{1/2}(L^{2})$ and $\mathcal{K}(L^{2})$ may be important in practice.

A very common methodology to fit functional regression models requires to project the functional regressors on a subspace defined by a finite set of orthonormal functions $u_{1},\ldots,u_{p}$ , and use the projections as regressor variables. Part (c) implies that model (4) also includes this situation for $\beta$ in the span of $\mathcal{K}(u_{1}),\ldots,\mathcal{K}(u_{p})$ . Note that if $\{u_{j}\}$ is the orthonormal basis of eigenfunctions of $\mathcal{K}$ , we have $\mathcal{K}(u_{j})$ is proportional to $u_{j}$ , and the condition on $\beta$ reduces to the fact that $\beta$ belongs to the span of $u_{1},\ldots,u_{p}$ . If this is the case, there is no loss in using the first $p$ principal components of the regressors instead of the whole trajectories.

4 Maximum likelihood estimation: non-existence results

In the finite-dimensional setting, it is well-known that the maximum likelihood (ML) estimator does not exist when there is an hyperplane separating the observations of the two classes; see below for details. As we will show in this section, this fact worsens dramatically for the case of functional data; more specifically, we will see that:

For a wide class of process (including the Brownian motion), the MLE just does not exist, with probability one (see Subsection 4.1).
Under some different conditions, in the Gaussian case, the probability of non-existence of the MLE tends to one when the sample size tends to infinity (see Subsection 4.2).

A brief overview of the finite dimensional case

Despite the fact that ML estimation of the slope function for multiple logistic regression is widely used, it has an important drawback that is sometimes overlooked. Given a sample $x_{i}^{0}\in{\mathbb{R}}^{d}$ for $i=1,\ldots,n_{0}$ drawn from population zero and another sample $x_{i}^{1}\in{\mathbb{R}}^{d}$ for $i=1,\ldots,n_{1}$ drawn from population one, the classical MLE in logistic regression is the vector $(b_{0},b)\in{\mathbb{R}}\times{\mathbb{R}}^{d}$ that maximizes the log-likelihood

[TABLE]

The existence and uniqueness of such a maximum was carefully studied by Albert and Anderson (1984) (and previously by Silvapulle (1981) and Gourieroux and Monfort (1981)). As stated in Theorem 1 of Albert and Anderson (1984), the latter expression can be made arbitrarily close to zero (note that the log-likelihood is always negative) whenever the samples of the two populations are linearly separable. In that case the maximum can not be attained and then the MLE does not exist (the idea behind the proof is similar to the one of Theorem 3 below). There is another scenario where this estimator does not exist; the samples are linearly separable except for some points of both populations that fall into the separation hyperplane (named “quasicomplete separation”). In this case the supremum of the log-likelihood function is strictly less than zero, but it is anyway unattainable.

The likelihood function in the logistic functional model

Before going on with the functional case (which is our main target here), we need to derive the likelihood function. Let assume that $\{X(s),s\in[0,1]\}$ follows the RKHS logistic model described in Equation (4). That is,

[TABLE]

where $p_{\beta,\beta_{0}}(X)=\mathbb{P}(Y=1|X,\beta,\beta_{0})$ , $\beta_{0}\in{\mathbb{R}}$ and $\beta\in\mathcal{H}(K)$ . The random element $(X(\cdot),Y)$ takes values in the space $Z=L^{2}[0,1]\times\{0,1\}$ , which is a measurable space with measure $z=P_{X}\times\mu$ , where $P_{X}$ is the distribution induced by the process $X$ and $\mu$ is the counting measure on $\{0,1\}$ . We can define in $Z$ the measure $P_{(X,Y);\beta,\beta_{0}}$ , the joint probability induced by $(X(\cdot),Y)$ for a given slope function $\beta$ and an intercept $\beta_{0}$ . Then we define,

[TABLE]

In view of this density function, the log-likelihood function for a given sample in $L^{2}[0,1]\times\{0,1\}$ is

[TABLE]

where $(x_{i},y_{i})\in L^{2}[0,1]\times\{0,1\}$ is a sample of the underlying random variable $(X,Y)$ .

The maximum likelihood estimator is the pair $(\widehat{\beta},\widehat{\beta}_{0})$ that maximizes this function $L_{n}$ . The population counterpart of $L_{n}$ is the expected log-likelihood function,

[TABLE]

where ${\mathbb{E}}_{Z}[\cdot]$ denotes the expectation with respect to the measure $\mathrm{d}z$ .

The main idea behind ML-estimation stands in the infinite-dimensional situation. If our “parameter space” is $\Theta\subset{\mathcal{H}}(K)\times{\mathbb{R}}$ and the “true” value of the parameter is $(\beta^{*},\beta_{0}^{*})\in\Theta$ , then a simple, standard argument based on Jensen’s inequality shows that the population log likelihood function $L(\beta,\beta_{0})$ fulfils

[TABLE]

This leads to the usual, natural idea of maximizing a consistent estimator of $L(\beta^{*},\beta_{0}^{*})$ that, in our logistic model, is the log-likelihood function $L_{n}(\beta,\beta_{0})$ defined above.

4.1 Non-existence of the MLE in functional settings

We first show that, when moving from the finite-dimensional model to the functional one, the problem of the non-existence of the MLE is drastically worsened.

This situation is quite similar to that arising, for example, in non-parametric density estimation where non-parametric (and non-penalized) ML estimators of the density function do not exist, unless some drastic restrictions, such as monotonicity (e.g., Grenander (1981)) or log-concavity (see, Cule et al. (2010)) are imposed on the underlying density function.

Since the analogous non-existence result for the case of the functional logistic regression model is not perhaps so direct, it is established in Theorem 3 below. We confine ourselves to the RKHS-based model (4), although the result can be easily extended, with a completely similar method of proof, for the standard $L^{2}$ based model of Equation (1).

We first will need to establish a condition which plays, in the functional case, a similar role to that of the linear separability condition mentioned above in the setting of finite-dimensional logistic regression.

Assumption 1 (SC).

The multivariate process $Z(t)=(X_{1}(t),\ldots,X_{n}(t))$ , $t\in[0,1]$ satisfies the “Sign Choice” (SC) property when for all possible choice of signs $(s_{1},\ldots,s_{n})$ , where $s_{j}$ is either $+$ or $-$ , we have that, with probability one, there exists some $t_{0}\in[0,1]$ such that $\mbox{sign}(X_{1}(t_{0}))=s_{1},\ \ldots,\ \mbox{sign}(X_{n}(t_{0}))=s_{n}$ .

Now, the non-existence result is as follows. Without loss of generality we confine ourselves to the case ${\mathbb{E}}(X(t))=0$ .

Theorem 3.

Let $X(s)$ , $s\in[0,1]$ , be an $L^{2}$ stochastic process with ${\mathbb{E}}[X(s)]=0$ . Denote by $K$ the corresponding covariance function. Consider a logistic model (4) based on $X(s)$ . Let $X_{1},\ldots,X_{n}$ be independent copies of $X$ . Assume that the $n$ -dimensional process $Z_{n}(s)=(X_{1}(s),\ldots,X_{n}(s))$ fulfills the SC property. Then, with probability one, the MLE estimator of $(\beta,\beta_{0})$ (i.e., the maximizer of the log-likelihood function $L_{n}(\beta,\beta_{0})$ ) does not exist for any sample size $n$ .

Proof.

Let $x_{1}(s)\ldots,x_{n}(s)$ be a random sample drawn from $X(s)$ . From the SC assumption there is (with probability 1) one point $t_{0}$ such that $x_{i}(t_{0})>0$ for all $i$ such that $y_{i}=1$ and $x_{i}(t_{0})<0$ for those indices $i$ with $y_{i}=0$ . Note that that the sample log-likelihood function can be split in two terms, as follows,

[TABLE]

Note also that $L_{n}(\beta,\beta_{0})\leq 0$ for all $\beta$ . Now, take a numerical sequence $0<c_{m}\uparrow\infty$ and define

[TABLE]

Then, by the definition of Loève’s isometry, if $y_{i}=0$ ,

[TABLE]

since we have taken $t_{0}$ such that $x_{i}(t_{0})>0$ for those indices $i$ with $y_{i}=1$ . Likewise, $\langle\beta_{m},x_{i}\rangle_{K}$ goes to $-\infty$ whenever $y_{i}=0$ since we have chosen $t_{0}$ such that $x_{i}(t_{0})<0$ for those indices. As a consequence $L_{n}(\beta_{m},0)\to 0$ as $m\to\infty$ . Therefore the likelihood function can be made arbitrarily large so that the MLE does not exist. ∎

Remark 1.

A non-existence result for the MLE estimator, analogous to that of Theorem 3, can be also obtained with a very similar reasoning for the $L^{2}$ -based logistic model of Equation (1). The main difference in the proof would be the construction of $\beta_{m}$ which, in the $L^{2}$ case, should be obtained as an approximation to the identity (that is, a linear “quasi Dirac delta”) centered at the point $t_{0}$ .

Although the SC property could seem a somewhat restrictive assumption, the following proposition shows that it applies to some important and non-trivial situations.

Proposition 1.

(a) The $n$ -dimensional Brownian motion fulfills the SC property.

(b) The same holds for any other $n$ -dimensional process in $[0,1]$ whose independent marginals have a distribution absolutely continuous with respect to that of the Brownian motion.

Proof.

(a) Given the $n$ dimensional Brownian motion ${\mathcal{B}}_{n}=(B_{1},\ldots,B_{n})$ , where the $B_{j}$ are independent copies of the standard Brownian motion $B(t)$ , $t\in[0,1]$ , take a sequence of signs $(s_{1},...,s_{n})$ and define the event

[TABLE]

We may express this event by

[TABLE]

where, for each $t\in(0,1]\cap\mathbb{Q}$ ,

[TABLE]

Now, the result follows directly from Blumenthal’s 0-1 Law for n-dimensional Brownian processes (see, e.g., Mörters and Peres (2010, p. 38)). Such result establishes that for any event $A\in{\mathcal{F}}^{+}(0)$ we have either ${\mathbb{P}}(A)=0$ or ${\mathbb{P}}(A)=1$ . Here ${\mathcal{F}}^{+}(0)$ denotes the germ $\sigma$ -algebra of events depending only on the values of ${\mathcal{B}}_{n}(t)$ where $t$ lies in an arbitrarily small interval on the right of 0. More precisely,

[TABLE]

From (10) and (11) it is clear that the above defined event $A$ belongs to the germ $\sigma$ -algebra ${\mathcal{F}}^{+}(0)$ . However, we cannot have ${\mathbb{P}}(A)=0$ since (from the symmetry of the Brownian motion) for any given $t_{0}$ the probability of $\mbox{sign}(B_{j}(t_{0}))=s_{j},\ j=1,\ldots,n$ is $1/2^{n}$ . So, we conclude ${\mathbb{P}}(A)=1$ as desired.

(b) If $X(t)$ is another process whose distribution is absolutely continuous with respect to that of the n-dimensional Brownian motion ${\mathcal{B}}_{n}$ , then the set $A$ , defined by (10) and (11) in terms of ${\mathcal{B}}_{n}$ has also probability one when it is defined in terms of the process $X(t)$ : recall that, from the definition of absolute continuity, if the set $A^{c}$ has probability zero under the Brownian motion, then its probability must be zero as well when $B(t)$ is replaced with $X(t)$ . Therefore, the probability of $A$ under $X=X(t)$ must be one. ∎

Remark 2.

Following the comment in Mörters and Peres (2010) about processes with strong Markov property, this result based on RKHS theory can be extended for Lévy processes whenever the covariance function was continuous (like Poisson process in the real line). However note that, apart from the Brownian motion, this type of processes have discontinuous trajectories.

The situation considered in Theorem 3 would be the functional counterpart of having a finite-dimensional problem where the supports of both classes (0 and 1) are linearly separable. However, as we have just seen, this separability issue does not only appear in degenerate problems in the functional setting. In the next section we suggest a technique to completely avoid the problem.

From a theoretical perspective, in view of Theorem 3, it is clear that there is no hope of obtaining a general convergence result of the standard maximum likelihood estimator (MLE) defined by the maximization of the likelihood function $L_{n}(\beta,\beta_{0})$ . That is, one should define a different estimator or impose some conditions on the process $X$ to avoid the SC property. For instance, Lindquist and McKeague (2009) prove consistency results of the model with a single impact point $\theta\in[0,1]$ for processes $X(t)=Z+B_{\theta}(t)$ , where $B_{\theta}$ is a two-sided Brownian motion centered in $\theta$ (i.e. two independent Brownian motions starting at $\theta$ and running in opposite directions) and $Z$ is a real random variable independent of $B_{\theta}$ . Then, due to the independence assumption, it is clear that accumulation points (like 0 for the Brownian motion) are avoided.

4.2 Asymptotic non-existence for Gaussian processes

In the previous section we have seen that the problem of non-existence of the MLE is aggravated for the case functional data. But this is not the only issue with MLE in functional logistic regression. In this section we see that the probability that the MLE does not exist goes to one as the sample size increases, for any Gaussian process satisfying very mild assumptions.

We use the following notation: for $T=\{t_{1},\ldots,t_{p}\}\subset[0,1]$ and $f\in L^{2}[0,1]$ , let $f(T):=(f(t_{1}),\ldots,f(t_{p}))^{\prime}$ and let $\Sigma_{T}$ be the $p\times p$ matrix whose $(i,j)$ entry is $K(t_{i},t_{j})$ .

Theorem 4.

Let $(x_{1},y_{1}),\ldots,(x_{n},y_{n})$ be a random sample of independent observations satisfying model (4). Assume that $X$ is a Gaussian process such that $K$ is continuous and $\Sigma_{T}$ is invertible for any finite set $T\subset(0,1)$ . It holds

[TABLE]

Proof.

Let $\beta^{*}\in\mathcal{H}_{K},\beta_{0}^{*}$ be the true values of the parameters. Since $\|\beta^{*}\|_{K}<\infty$ , we have $h(\beta_{0}^{*},\|\beta^{*}\|_{K})<\infty$ , where $h$ is the function defined in Candès and Sur (2020), Equation (2.2) (see Remark 3 below). Let $p_{n}$ be an increasing sequence of natural numbers such that $\lim_{n\to\infty}p_{n}/n=\kappa>h(\beta_{0}^{*},\|\beta^{*}\|_{K})$ . Consider the set of equispaced points $0<t_{1}<t_{2}<\cdots<t_{p_{n}}<1$ and denote $T_{n}=\{t_{1},\ldots,t_{p_{n}}\}$ . Define $\alpha_{T_{n}}=\Sigma_{T_{n}}^{-1}\beta^{*}(T_{n})$ . Now, consider the following sequence of finite-dimensional logistic regression models

[TABLE]

where $a^{\prime}b$ stands here for the inner product of two vectors $a,b$ in ${\mathbb{R}}^{p_{n}}$ , and the following sequence of events

[TABLE]

Recall that the event $E_{n}$ amounts to non-existence of MLE for finite-dimensional logistic regression models (see Albert and Anderson (1984)).

Now let us prove the validity of condition (1.3) in Candès and Sur (2020), which is required for the validity of Theorem 2.1. in that paper. In our case, such condition amounts to

[TABLE]

but this directly follows from Theorem 6E of Parzen (1959). Since $\lim_{n\to\infty}p_{n}/n=\kappa>h(\beta_{0}^{*},\|\beta^{*}\|^{2}_{K})$ we apply Theorem 2.1. in Candès and Sur (2020) to get $\lim_{n}\mathbb{P}(E_{n})=1$ .

Now we define the auxiliary sequence of events

[TABLE]

with strict inequalities. Assume that $\widetilde{E}_{n}$ happens so that there exists a separating hyperplane defined by $\alpha\in\mathbb{R}^{p_{n}}$ . Then, in the same spirit as in the proof of Theorem 3, it is possible to show that if $\hat{\beta}_{m,n}=m\sum_{j=1}^{p_{n}}\alpha_{j}K(\cdot,t_{j})\in\mathcal{H}_{K}$ , then $\lim_{m\to\infty}L_{n}(\hat{\beta}_{m,n},0)=0$ , where $L_{n}(\beta,\beta_{0})$ is the log-likelihood function. As a consequence, for all $n$ , if $\widetilde{E}_{n}$ happens, then the MLE for the RKHS functional logistic regression model does not exist. The result follows from the fact that $\mathbb{P}(E_{n})=\mathbb{P}(\widetilde{E}_{n})$ and the events $\alpha^{\prime}x_{i}(T_{n})=0$ have probability zero since we are assuming that the process does not have degenerate marginals. ∎

Remark 3.

Theorem 2.1. in Candès and Sur (2020) is a remarkable result. It applies to logistic finite-dimensional regression models with a number $p$ of covariables, which is assumed to grow to infinity with the sample size $n$ , in such a way that $p/n\to\kappa$ . Of course, the sample is given by data $(x_{i},y_{i})$ , $i=1,\ldots,n$ . Essentially the result establishes that there is a critical value such that, if $\kappa$ is smaller than such critical value, one has $\lim_{n,p\to\infty}\mathbb{P}(\text{MLE exists})=1$ ; otherwise we have $\lim_{n,p\to\infty}\mathbb{P}(\text{MLE exists})=0$ . Such critical value is given in terms of a function $h$ (which is mentioned in the proof of the previous result) whose definition is as follows. Let us use the notation $(\widetilde{Y},V)\sim F_{\beta_{0},\gamma_{0}}$ whenever $(\widetilde{Y},V)\overset{d}{=}(\widetilde{Y},\widetilde{Y}X)$ , for $\widetilde{Y}=2Y-1$ (note that, in the notation of Candès and Sur (2020), the model is defined for the case that the response variable is coded in $\{-1,1\}$ ), $\beta_{0},\gamma_{0}\in\mathbb{R}$ , $\gamma_{0}\geq 0$ and where $X\sim\mathcal{N}(0,1)$ and $\mathbb{P}(\widetilde{Y}=1|X)=(1+\exp\{-\beta_{0}-\gamma_{0}X\})^{-1}$ . Now, define $h(\beta_{0},\gamma_{0})=\min_{t_{0},t_{1}\in\mathbb{R}}\mathbb{E}[(t_{0}\widetilde{Y}+t_{1}V-Z)_{+}^{2}]$ , where $Z\sim\mathcal{N}(0,1)$ independent of $(\widetilde{Y},V)$ and $x_{+}=\max\{x,0\}$ . Then, Theorem 2.1. in Candès and Sur (2020) proves that the above mentioned critical value for $\kappa$ is precisely $h(\beta_{0},\gamma_{0})$ .

5 The estimation of $\boldsymbol{\beta}$ in practice

The problem of non-existence of the MLE can be circumvented if the goal is variable selection. The main idea behind the proof of Theorem 4 is that one can approximate the functional model with finite approximations as those in (7) with $p$ increasing as fast as desired. Therefore, if we constrain $p$ to be less than a finite fixed value, Theorem 4 does not apply.

In order to sort out the non-existence problem for a given sample (due to the SC property), it would be enough to use a finite-dimensional estimator that is always defined, even for linearly separable samples. As mentioned, an extensive study of existence and uniqueness conditions of the MLE for multiple logistic regression can be found in the paper of Albert and Anderson (1984).

A simple, RKHS-motivated alternative would be as follows. In many cases one could assume that the “true parameter” $(\beta^{*},\beta_{0}^{*})$ belongs to a bounded set $B_{K}(0,R)\times I$ , $I$ being a compact interval in the real line and $B_{K}(0,R)$ the closed ball centered at zero, with radius $R$ in the RKHS associated with the covariance function $K$ . This restriction of searching for an estimator in a ball within the parameter space resembles other regularization methods in regression such as ridge or lasso.

If $K$ is continuous and bounded, all functions $f$ in the RKHS space are continuous as well and, using the reproducing property $\langle f,K(\cdot,t)\rangle_{K}=f(t)$ , we get

[TABLE]

If, for simplicity, we assume that $\sup_{t}K(t,t)=1$ , we have (from the definition of the RKHS ${\mathcal{H}}(K)$ ) that all functions $\beta\in B_{K}(0,R)$ can be approximated by functions of type

[TABLE]

where $\beta_{j}$ are real numbers with $|\beta_{j}|\leq R$ , $p\in{\mathbb{N}}$ , $t_{j}\in[0,1]$ .

Now, recall that the RKHS functional logistic model corresponding to such function $g$ would be given by expression (7) in terms of $\beta_{i}$ and $X(t_{i})$ . Then, assuming the continuity of the trajectories $X(t)$ we can ensure the existence of an approximate maximum likelihood (ML) estimator of $(\beta^{*},\beta_{0}^{*})$ expressed in terms of $(\beta_{0},\beta_{1},\ldots,\beta_{d},t_{1},\ldots,t_{p})$ .

The effective calculation of such estimator could be done by a sequential “greedy” method. The idea is to exchange the direct maximization of the likelihood function by the execution of an iterative algorithm, as follows:

Let us fix a grid $T_{p}$ of $p$ equispaced points in $[0,1]$ . For each $t$ on the grid, we fit the logistic model of Equation (7) with $p=1$ and $\hat{m}(t)=\bar{X}(t)$ . The log-likelihood achieved for this $t$ at the ML estimators $\widehat{\beta}_{0}$ and $\widehat{\beta}_{1}$ is stored in $\ell_{1}(t)$ . Then, the first point $\widehat{t}_{1}$ is fixed as the point at which $\ell_{1}(t)$ achieves its maximum value. 2. 2.

Once $\widehat{t}_{1}$ has been selected, for each $t$ in the grid we fit the model

[TABLE]

As in the previous step, $\ell_{2}(t)$ would be the log-likelihood achieved at $\widehat{\beta}_{0}$ , $\widehat{\beta}_{1}$ and $\widehat{\beta}_{2}$ , and $\hat{t}_{2}$ is the point at which the maximum of $\ell_{2}(t)$ is attained. 3. 3.

We proceed in the same way until a suitable number of points $p$ has been selected.

In practical problems, it is also important to determine how many points $p$ one should retain. The common approach is to fix this value $\widehat{p}$ by cross-validation, whenever it is possible. Another reasonable approach is to increase the initial value $p$ by repeating the whole procedure with another grid $T_{p+1}$ of $p+1$ equispaced points until the increase achieved in the likelihood function is smaller than a given threshold, in a similar way as in Berrendero et al. (2019).

Acknowledgements

This work has been partially supported by Spanish Grant PID2019-109387GB-I00.

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Albert and Anderson (1984) A. Albert and J. A. Anderson. On the existence of maximum likelihood estimates in logistic regression models. Biometrika , 71(1):1–10, 1984.
2Ash and Gardner (2014) R. B. Ash and M. F. Gardner. Topics in Stochastic Processes: Probability and Mathematical Statistics: A Series of Monographs and Textbooks . Academic Press, 2014.
3Baíllo et al. (2011) A. Baíllo, A. Cuevas, and J. A. Cuesta-Albertos. Supervised classification for a family of Gaussian functional models. Scandinavian Journal of Statistics , 38(3):480–498, 2011.
4Berlinet and Thomas-Agnan (2004) A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics . Kluwer Academic, Boston, 2004.
5Berrendero et al. (2019) J. R. Berrendero, B. Bueno-Larraz, and A. Cuevas. An RKHS model for variable selection in functional linear regression. Journal of Multivariate Analysis , 170:22–45, 2019.
6Candès and Sur (2020) E. J. Candès and P. Sur. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. The Annals of Statistics , 48(1):27–42, 2020.
7Cramer (2003) J. S. Cramer. Logit Models from Economics and Other Fields . Cambridge University Press, 2003.
8Cule et al. (2010) M. Cule, R. Samworth, and M. Stewart. Maximum likelihood estimation of a multi-dimensional log-concave density. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 72(5):545–607, 2010.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

1 Introduction: statement of the model

2 The RKHS-based functional logistic model: validity conditions in the Gaussian case

Theorem 1**.**

Proof.

3 The RKHS model: some important particular cases

Theorem 2**.**

Proof.

4 Maximum likelihood estimation: non-existence results

4.1 Non-existence of the MLE in functional settings

Assumption 1** (SC).**

Theorem 3**.**

Proof.

Remark 1**.**

Proposition 1**.**

Proof.

Remark 2**.**

4.2 Asymptotic non-existence for Gaussian processes

Theorem 4**.**

Proof.

Remark 3**.**

5 ** The estimation of β\boldsymbol{\beta}β in practice**

Acknowledgements

Theorem 1.

Theorem 2.

Assumption 1 (SC).

Theorem 3.

Remark 1.

Proposition 1.

Remark 2.

Theorem 4.

Remark 3.

5 The estimation of $\boldsymbol{\beta}$ in practice