Semi-Parametric Uncertainty Bounds for Binary Classification

Bal\'azs Csan\'ad Cs\'aji; Ambrus Tam\'as

arXiv:1903.09790·stat.ML·March 26, 2019

Semi-Parametric Uncertainty Bounds for Binary Classification

Bal\'azs Csan\'ad Cs\'aji, Ambrus Tam\'as

PDF

TL;DR

This paper develops kernel-based semi-parametric methods to construct non-asymptotic confidence regions for the binary classification regression function, ensuring exact coverage and strong consistency.

Contribution

It introduces three novel resampling methods that provide guaranteed coverage probabilities for the regression function in binary classification.

Findings

01

All methods guarantee exact coverage probabilities.

02

The methods are strongly consistent.

03

They improve uncertainty quantification in binary classification.

Abstract

The paper studies binary classification and aims at estimating the underlying regression function which is the conditional expectation of the class labels given the inputs. The regression function is the key component of the Bayes optimal classifier, moreover, besides providing optimal predictions, it can also assess the risk of misclassification. We aim at building non-asymptotic confidence regions for the regression function and suggest three kernel-based semi-parametric resampling methods. We prove that all of them guarantee regions with exact coverage probabilities and they are strongly consistent.

Equations131

μ : M_{+} (X) \to H,

μ : M_{+} (X) \to H,

P \to \int k (x, \cdot) P (d x) .

n = 1 \sum \infty \frac{\mbox V a r ( X _{n} )}{n ^{2}} < \infty

n = 1 \sum \infty \frac{\mbox V a r ( X _{n} )}{n ^{2}} < \infty

\frac{1}{n} k = 1 \sum n (X_{k} - E [X_{k}]) \to 0 \mbox a s n \to \infty,

\frac{1}{n} k = 1 \sum n (X_{k} - E [X_{k}]) \to 0 \mbox a s n \to \infty,

f_{*} (x) ≐

f_{*} (x) ≐

=

=

g_{*} (x) ≐ sign (f_{*} (x)),

g_{*} (x) ≐ sign (f_{*} (x)),

f_{*}\in\mathcal{F}\,\doteq\,\big{\{}\,f_{\theta}:\mathbb{X}\to[\,-1,+1\,]\,\mid\,\theta\in\Theta\,\big{\}}.

f_{*}\in\mathcal{F}\,\doteq\,\big{\{}\,f_{\theta}:\mathbb{X}\to[\,-1,+1\,]\,\mid\,\theta\in\Theta\,\big{\}}.

∥ f_{θ_{1}} - f_{θ_{2}} ∥_{P}^{2} ≐ \int_{X} (f_{θ_{1}} (x) - f_{θ_{2}} (x))^{2} P_{X} (d x) \neq = 0,

∥ f_{θ_{1}} - f_{θ_{2}} ∥_{P}^{2} ≐ \int_{X} (f_{θ_{1}} (x) - f_{θ_{2}} (x))^{2} P_{X} (d x) \neq = 0,

\mathbb{E}\big{[}\,Y\;|\;X=x\,\big{]}\,=\;\frac{p\,\varphi_{1}(x)-(1-p)\,\varphi_{2}(x)}{p\,\varphi_{1}(x)+(1-p)\,\varphi_{2}(x)},

\mathbb{E}\big{[}\,Y\;|\;X=x\,\big{]}\,=\;\frac{p\,\varphi_{1}(x)-(1-p)\,\varphi_{2}(x)}{p\,\varphi_{1}(x)+(1-p)\,\varphi_{2}(x)},

D_{0} ≐ ((x_{1}, y_{1}), \dots, (x_{n}, y_{n})),

D_{0} ≐ ((x_{1}, y_{1}), \dots, (x_{n}, y_{n})),

P_{θ} (Y = + 1 ∣ X = x) =

P_{θ} (Y = + 1 ∣ X = x) =

P_{θ} (Y = - 1 ∣ X = x) =

D_{i} (θ) ≐ ((x_{1}, y_{i, 1} (θ)), \dots, (x_{n}, y_{i, n} (θ))),

D_{i} (θ) ≐ ((x_{1}, y_{i, 1} (θ)), \dots, (x_{n}, y_{i, n} (θ))),

\psi\big{(}\,a_{1},a_{2},\dots,a_{m}\,\big{)}\;=\;\psi\big{(}\,a_{1},a_{\mu(2)},\dots,a_{\mu(m)}\,\big{)},

\psi\big{(}\,a_{1},a_{2},\dots,a_{m}\,\big{)}\;=\;\psi\big{(}\,a_{1},a_{\mu(2)},\dots,a_{\mu(m)}\,\big{)},

\psi\big{(}\,a_{i},\{a_{k}\}_{k\neq i}\,\big{)}\,\neq\;\psi\big{(}\,a_{j},\{a_{k}\}_{k\neq j}\,\big{)},

\psi\big{(}\,a_{i},\{a_{k}\}_{k\neq i}\,\big{)}\,\neq\;\psi\big{(}\,a_{j},\{a_{k}\}_{k\neq j}\,\big{)},

\mathbb{P}\big{(}\,\psi\big{(}\,A_{1},\dots,A_{m}\,\big{)}\,=\,k\,\big{)}

\mathbb{P}\big{(}\,\psi\big{(}\,A_{1},\dots,A_{m}\,\big{)}\,=\,k\,\big{)}

=\;\mathbb{P}\big{(}\,\psi\big{(}\,A_{\mu(1)},\dots,A_{\mu(m)}\,\big{)}\,=\,k\,\big{)},

=\;\mathbb{P}\big{(}\,\psi\big{(}\,A_{\mu(1)},\dots,A_{\mu(m)}\,\big{)}\,=\,k\,\big{)},

\psi\big{(}\,A_{\sigma(1)},\dots,A_{\sigma(m)}\,\big{)}\,=\;\psi\big{(}\,A_{\mu(1)},\dots,A_{\mu(m)}\,\big{)},

\psi\big{(}\,A_{\sigma(1)},\dots,A_{\sigma(m)}\,\big{)}\,=\;\psi\big{(}\,A_{\mu(1)},\dots,A_{\mu(m)}\,\big{)},

\mathbb{P}\big{(}\,\psi\big{(}\,a_{\tilde{\mu}(1)},\dots,a_{\tilde{\mu}(m)}\,\big{)}\,=\,k\,\big{)}\;=\;\nicefrac{{1}}{{m}},

\mathbb{P}\big{(}\,\psi\big{(}\,a_{\tilde{\mu}(1)},\dots,a_{\tilde{\mu}(m)}\,\big{)}\,=\,k\,\big{)}\;=\;\nicefrac{{1}}{{m}},

I_{k} (a, μ) ≐ {1, 0, if ψ (a_{μ}) = k, otherwise,

I_{k} (a, μ) ≐ {1, 0, if ψ (a_{μ}) = k, otherwise,

i_{k}(a)\;\doteq\;\mathbb{E}\big{[}\,\mathbb{I}_{k}(a,\tilde{\mu})\,\big{]},

i_{k}(a)\;\doteq\;\mathbb{E}\big{[}\,\mathbb{I}_{k}(a,\tilde{\mu})\,\big{]},

i_{k}(a)\,=\,\mathbb{P}\big{(}\,\psi(a_{\tilde{\mu}})=\,k\,\big{)}\,=\,\nicefrac{{1}}{{m}},

i_{k}(a)\,=\,\mathbb{P}\big{(}\,\psi(a_{\tilde{\mu}})=\,k\,\big{)}\,=\,\nicefrac{{1}}{{m}},

\displaystyle\mathbb{P}\big{(}\,\psi(A)\,=\,k\,\big{)}

\displaystyle\mathbb{P}\big{(}\,\psi(A)\,=\,k\,\big{)}

\displaystyle=\,\mathbb{E}\big{[}\,\mathbb{E}\big{[}\,\mathbb{I}_{k}(A,\tilde{\mu})\mid A\,\big{]}\,\big{]}\,=\,\mathbb{E}\big{[}\,i_{k}(A)\,\big{]}

\displaystyle=\,\mathbb{E}\big{[}\,\nicefrac{{1}}{{m}}\,\big{]}\,=\,\nicefrac{{1}}{{m}},

\mathcal{D}_{k}^{\pi}(\theta)\,\doteq\,\big{(}\mathcal{D}_{k}(\theta),\pi(k)\big{)},

\mathcal{D}_{k}^{\pi}(\theta)\,\doteq\,\big{(}\mathcal{D}_{k}(\theta),\pi(k)\big{)},

\Theta_{\varrho}^{\psi}\,\doteq\,\big{\{}\,\theta\in\Theta:\,p\,\leq\,\psi\big{(}\,\mathcal{D}^{\pi}_{0},\{\mathcal{D}^{\pi}_{k}(\theta)\}_{k\neq 0}\,\big{)}\,\leq\,q\,\big{\}},\vspace{1mm}

\Theta_{\varrho}^{\psi}\,\doteq\,\big{\{}\,\theta\in\Theta:\,p\,\leq\,\psi\big{(}\,\mathcal{D}^{\pi}_{0},\{\mathcal{D}^{\pi}_{k}(\theta)\}_{k\neq 0}\,\big{)}\,\leq\,q\,\big{\}},\vspace{1mm}

\mathbb{P}\big{(}\,\theta^{*}\in\Theta_{\varrho}^{\psi}\,\big{)}\;=\;\frac{q-p+1}{m}.\vspace{3mm}

\mathbb{P}\big{(}\,\theta^{*}\in\Theta_{\varrho}^{\psi}\,\big{)}\;=\;\frac{q-p+1}{m}.\vspace{3mm}

\mathbb{P}\,\bigg{(}\,\bigcap_{k=1}^{\infty}\bigcup_{n=k}^{\infty}\left\{\,\theta\in\Theta_{\varrho,n}^{\psi}\,\right\}\bigg{)}\,=\,0,\vspace{0mm}

\mathbb{P}\,\bigg{(}\,\bigcap_{k=1}^{\infty}\bigcup_{n=k}^{\infty}\left\{\,\theta\in\Theta_{\varrho,n}^{\psi}\,\right\}\bigg{)}\,=\,0,\vspace{0mm}

f_{\theta,n}^{(i)}(x)\,\doteq\,\frac{1}{k_{n}}\,\sum_{j=1}^{n}\,y_{i,j}(\theta)\,\mathbb{I}\hskip 0.85358pt\big{(}\,x_{j}\in N(x,n_{k})\,\big{)},

f_{\theta,n}^{(i)}(x)\,\doteq\,\frac{1}{k_{n}}\,\sum_{j=1}^{n}\,y_{i,j}(\theta)\,\mathbb{I}\hskip 0.85358pt\big{(}\,x_{j}\in N(x,n_{k})\,\big{)},

∥ f - g ∥_{2}^{2} ≐ \int_{X} (f (x) - g (x))^{2} d x .

∥ f - g ∥_{2}^{2} ≐ \int_{X} (f (x) - g (x))^{2} d x .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Semi-Parametric Uncertainty Bounds for Binary Classification

Balázs Csanád Csáji1

Ambrus Tamás1 *This work was supported by the National Research, Dev. and Innovation Office (NKFIH), Hungary, grant numbers ED_18-2-2018-0006 and KH_17 125698. B. Cs. Csáji was supported by a János Bolyai Res. Fellowship.1 Balázs Csanád Csáji and Ambrus Tamás are with MTA SZTAKI: The Institute for Computer Science and Control, Hungarian Academy of Sciences, Budapest, Hungary, [email protected], [email protected]

Abstract

The paper studies binary classification and aims at estimating the underlying regression function which is the conditional expectation of the class labels given the inputs. The regression function is the key component of the Bayes optimal classifier, moreover, besides providing optimal predictions, it can also assess the risk of misclassification. We aim at building non-asymptotic confidence regions for the regression function and suggest three kernel-based semi-parametric resampling methods. We prove that all of them guarantee regions with exact coverage probabilities and they are strongly consistent.

I Introduction

Classification is one of the principal problems of statistical learning theory [1], and it is widely applied across several fields [8], for example, in quantized identification [2]. A typical aim of classification is to minimize the probability of misclassification. If the (joint) probability distribution of the input-output pairs was known, the misclassification probability could be minimized by the Bayes optimal classifier. This classifier can be written as the sign of the regression function which is the conditional expectation of the labels given the inputs. The regression function can also help to assess the risk of misclassification. Estimating the regression function can be seen as identifying a (nonlinear) function from a sample of input and quantized (binary) output measurements.

Besides providing point-estimates of the regression function, for which there are several methods available [1, 3], it is also an important problem to bound the uncertainty of a candidate model. We will provide these bounds in the form of confidence regions. Note that such regions also induce confidence sets for the misclassification probabilities.

In this paper, inspired by recent developments in Finite-Sample System Identification (FSID) [4, 5, 6, 7], we suggest three semi-parametric kernel-based [8] resampling algorithms to build non-asymptotic confidence regions for the regression function of binary classification. We prove that each of these algorithms provides confidence sets with exact coverage probabilities, and they are strongly consistent, that is any false model will be (almost surely) excluded from the confidence regions, as the sample size tends to infinity. As the suggested algorithms build on distribution-free results and work directly with the samples, the constructions are not restricted to models parametrized by finite dimensional vectors, but also allow infinite dimensional model classes.

II Preliminaries

II-A Binary Classification

We are given an i.i.d. sample, $\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{n}$ from an unknown joint distribution $P$ of the $(X,Y)$ random vector, where $x_{i}\in\mathbb{X}$ is the input and $y_{i}\in\{+1,-1\}$ is the label of the $i$ th observation. We call any $g:\mathbb{X}\rightarrow\{+1,-1\}$ function a classifier. The Bayes optimal classifier $g_{*}$ can be defined as the one which minimizes the a pirori risk functional $R(g)\doteq\mathbb{E}\big{[}\,L(Y,g(X)\,\big{]}$ where $L$ is an arbitrary loss function.

In this paper we will focus on the $0/1$ loss that is one of the most common choices [1]. It is defined by $L(y,g(x))\doteq\,\mathbb{I}\hskip 1.42262pt(g(x)\neq y)$ , where $\mathbb{I}$ is the indicator function. The corresponding a priory risk is simply $R(g)\,=\,\mathbb{P}\hskip 1.42262pt(\,g(X)\neq Y\,)$ .

As distribution $P$ is unknown, we typically aim at estimating $g_{*}$ . At any point $x\in\mathbb{X}$ , $g_{*}(x)=\operatorname{sign}(\,\mathbb{E}\big{[}\,Y\,|\,X=x\,\big{]}\,)$ if it is feasible. Note that the conditional expectation $f_{*}(x)\doteq\mathbb{E}\big{[}\,Y\,|\,X=x\,\big{]}$ contains even more information than $g_{*}$ , e.g., based on $f_{*}$ we are not only able to predict the label of a given input with minimal risk, but we can also calculate the risk itself, i.e., the probability of misclassification. Therefore, it is of high importance to study and estimate $f_{*}$ .

II-B Reproducing Kernel Hilbert Spaces

Given a Hilbert space $\mathcal{H}$ of $f:\mathbb{X}\rightarrow\mathbb{R}$ type functions, with inner product $\langle\,\cdot,\cdot\,\rangle_{\mathcal{H}}$ , we say that it is a Reproducing Kernel Hilbert Space (RKHS) if the point evaluation function $\delta_{x}:f\rightarrow f(x)$ is bounded (or equivalently continuous) for all $x\in\mathbb{X}$ [8]. In this case, by the Riesz representation theorem, there uniquely exists $k(\cdot,\cdot)$ , such that for all $x\in\mathbb{X}$ , $k(\cdot,x)\in\mathcal{H}$ and $f(x)=\langle\,f,k(\cdot,x)\,\rangle_{\mathcal{H}}$ . This is called the reproducing property, and the function $k:\mathbb{X}\times\mathbb{X}\rightarrow\mathbb{R}$ is called the kernel. In particular $\langle\,k(\cdot,x),k(\cdot,y)\,\rangle_{\mathcal{H}}=k(x,y)$ thus $k$ is symmetric and positive definite. The converse is also true by the Moore-Arnoszjan theorem [9]: for each positive definite function there uniquely exists an RKHS. Typical examples of kernels are the Gaussian kernel, $k(x,y)=\exp(\frac{-\left\lVert x-y\right\rVert^{2}}{2\sigma^{2}})$ with $\sigma>0$ , and the polynomial kernel, $k(x,y)=(x^{\mathrm{T}}y+c)^{d}$ with $c\geq 0$ and $d\in\mathbb{N}$ . For a given sample $\mathcal{D}$ , the Gram matrix, $K\in\mathbb{R}^{n\times n}$ , is defined as $K_{i,j}\!\doteq k(\,x_{i},x_{j}\,)$ , which is a (data-dependent) symmetric, positive semidefinite matrix.

Let $C_{b}(\mathbb{X})$ denote the space of bounded continuous functions on a compact metric space $\mathbb{X}$ . A kernel is universal if the corresponding $\mathcal{H}$ is dense in $C_{b}(\mathbb{X})$ : for all $f\in C_{b}(\mathbb{X})$ and $\varepsilon>0$ there exists $h\in\mathcal{H}$ such that $\left\lVert\hskip 0.85358ptf-h\hskip 0.85358pt\right\rVert_{\infty}<\,\varepsilon$ .

II-C Kernel Mean Embedding

The idea of kernel mean embedding is to map distributions to elements of an RKHS with the help of the kernel [10]. Let $(\mathbb{X},\Sigma)$ be a measurable space and let $M_{+}(\mathbb{X})$ denote the space of all probability measures on it. The kernel mean embedding of these probability measures into an RKHS $\mathcal{H}$ endowed with a reproducing kernel $k:\mathbb{X}\times\mathbb{X}\rightarrow\mathbb{R}$ is

[TABLE]

A kernel is called characteristic if the embedding, $\mu$ , is injective (e.g., the Gaussian kernel). In this case the embedded element captures all informations about the distribution, e.g., for all $P,Q\in M_{+}(\mathbb{X})$ , $\left\lVert\hskip 0.85358pt\mu_{P}-\mu_{Q}\hskip 0.85358pt\right\rVert_{\mathcal{H}}=0$ if and only if $P=Q$ . Hence, the embedding induces a metric on $M_{+}(\mathbb{X})$ .

Let $\mathbb{X}$ be a compact metric space and let $k$ be a universal kernel on $\mathbb{X}$ , then one can show that $k$ is also characteristic.

The kernel mean embedding has nice properties even when the kernel is not characteristic. For example, for polynomial kernels with degree $d$ it holds that $\left\lVert\hskip 0.85358pt\mu_{P}-\mu_{Q}\right\rVert_{\mathcal{H}\hskip 0.85358pt}=0$ if and only if the first $d$ moments of $P$ and $Q$ are the same.

Furthermore, many fundamental operations can be performed in $\mathcal{H}$ instead of dealing with the distributions themselves, e.g., Smola showed [10] that $\mathbb{E}_{P}[f(X)]=\langle f,\mu_{P}\rangle_{\mathcal{H}}$ .

The underlying probability distribution of the sample is typically unknown, therefore, the kernel mean embedding should be estimated from empirical data. An important tool to prove the validity of such approaches is the Strong Law of Large Numbers (SLLN) for random elements taking values in a separable Hilbert space $\mathcal{H}$ . Let $\{X_{n}\}$ be a sequence of independent random elements taking values in $\mathcal{H}$ . If

[TABLE]

where $\mbox{Var}(X)\doteq\,\mathbb{E}\big{[}\,\|\,X-\mathbb{E}[X]\,\|_{\mathcal{H}}^{2}\,\big{]}$ , then

[TABLE]

(a.s.) in the metric induced by $\|\cdot\|_{\mathcal{H}}$ [11, Theorem 3.1.4].

III Resampling Framework

In this section we develop a framework to provide non-asymptotically guaranteed uncertainty quantification resampling algorithms for the “regression function”, namely, the conditional expectation of the labels given the inputs. The regression function is a fundamental object to study, for example, its signs at various inputs define the Bayes optimal classifier which achieves minimal misclassification risk.

Assume we have a (joint) distribution on $\mathbb{S}\,\doteq\,\mathbb{X}\,\times\,\mathbb{Y}$ , where $\mathbb{X}$ and $\mathbb{Y}$ are the input and output spaces, respectively. $\mathbb{X}$ does not have to be $\mathbb{R}^{d}$ , but has to be a measurable space (with some $\sigma$ -algebra). As we consider binary classification, $\mathbb{Y}\,\doteq\,\{+1,-1\}$ . The regression function can be written as

[TABLE]

Given $f_{*}$ , the Bayes optimal classifier is

[TABLE]

where “ $\operatorname{sign}$ ” denotes the signum function. Note that in (6), for simplicity, we assumed that $\mathbb{P}(\,f_{*}(X)\,\neq\,0\,)\,=\,1$ .

We assume that we are given an (indexed) family of possible regression functions that also contains $f_{*}$ , that is

[TABLE]

For simplicity, we refer to $\theta\in\Theta$ as a parameter, but $\Theta$ can be an arbitrary set, even an infinite dimensional vector space. The true parameter is denoted by $\theta^{*}$ , that is $f_{\theta^{*}}\,=\,f_{*}$ .

We assume that $\mathcal{F}$ contains square integrable functions w.r.t. the input distribution, and that the parametrization is injective, i.e., $\theta_{1}\,\neq\,\theta_{2}$ implies $f_{\theta_{1}}\,\neq\,f_{\theta_{2}}$ on a set having nonzero measure w.r.t. the input distribution. In other words,

[TABLE]

if $\theta_{1}\neq\theta_{2}$ , where $P_{\scriptscriptstyle\mathbb{X}}$ is the distribution of the inputs.

Note that $f_{*}$ in itself does not determine the joint probability distribution generating the observations, namely, it does not contain information about the (marginal) distribution of the inputs, therefore, our approach is semi-parametric.

As an example, consider the case where the “ $+1$ ” class has probability density function $\varphi_{1}$ , while the “ $-1$ ” class has density $\varphi_{2}$ . For each element of the sample, there is a $p$ probability to see an element with “ $+1$ ” label and a $1-p$ probability to see a measurement with “ $-1$ ” label. Then,

[TABLE]

thus, if we have candidate densities for inputs with various labels and we know their mixing probability, then we can compute the regression function. However, observe that the regression function does not determine $\varphi_{1},\varphi_{2}$ and $p$ .

III-A Resampling Labels

The observed i.i.d. input-output dataset is denoted by

[TABLE]

which can also be seen as a $\mathbb{S}^{n}$ -valued random vector.

One of our core ideas is that if we are given a candidate $\theta$ , then we can generate (resample) alternative labels for the available inputs using the distribution induced by $f_{\theta}$ , that is

[TABLE]

which immediatelly follow from our observations in (III).

Given a $\theta$ , we can generate $m-1$ alternative samples by

[TABLE]

for $i=1,\dots,m-1$ , where for all $i,j$ , label $y_{i,j}(\theta)$ is generated randomly according to the conditional distribution $\mathbb{P}_{\theta}(\,Y\,\mid\,X=x_{j}\,)$ . For notational simplicity, we extend this to $\mathcal{D}_{0}$ , that is $\forall\,\theta:\mathcal{D}_{0}(\theta)\doteq\mathcal{D}_{0}$ and $\forall\,j:y_{0,j}(\theta)\,\doteq\,y_{j}$ .

Naturally, for all $i$ , dataset $\mathcal{D}_{i}(\theta)$ can also be identified with a random vector in $\mathbb{S}^{n}$ , and $\mathcal{D}_{1}(\theta),\dots,\mathcal{D}_{m-1}(\theta)$ are always conditionally i.i.d., given the inputs, $\{x_{j}\}$ .

Observe that, in case $\theta\neq\theta^{*}$ , the distribution of $\mathcal{D}_{0}$ is in general different than that of $\mathcal{D}_{i}(\theta)$ , $\forall\,i\neq 0$ ; while $\mathcal{D}_{0}$ and $\mathcal{D}_{i}(\theta^{*})$ have the same distribution for all possible $i$ .

III-B Ranking Functions

The proposed algorithms will be defined via rank statistics based on suitably defined orderings. A key concept will be the “ranking function” which, informally, computes the rank of its first argument among all of its arguments based on some underlying ordering. Let $\mathbb{A}$ be a measurable space (with some $\sigma$ -algebra), a (measurable) function $\psi:\mathbb{A}^{m}\to[\,m\,]$ , where $[\,m\,]\,\doteq\,\{1,\dots,m\}$ , is called a ranking function if for all $(a_{1},\dots,a_{m})\in\mathbb{A}^{m}$ it satisfies the two properties

(P1)

For all permutations $\mu$ of the set $\{2,\dots,m\}$ , we have

[TABLE]

that is the function is invariant with respect to reordering the last $m-1$ terms of its arguments. 2. (P2)

For all $i,j\in[\,m\,]$ , if $a_{i}\neq a_{j}$ , then we have

[TABLE]

where the simplified notation is justified by (P1).

We refer to the output of the ranking function $\psi$ as the rank. An important observation about ranking exchangeable [12] random elements is given by the following lemma. (Recall that if a sample is i.i.d., it is also exchangeable.)

Lemma 1

Let $A_{1},\dots,A_{m}$ be exchangeable, almost surely pairwise different random elements taking values in $\mathbb{A}$ . Then, $\psi\big{(}\,A_{1},A_{2},\dots,A_{m}\,\big{)}$ has discrete uniform distribution: for all $k\in[\,m\,]$ , the rank is $k$ with probability $\nicefrac{{1}}{{m}}$ .**

Proof:

Since $\{A_{i}\}$ are exchangeable, we know that

[TABLE]

for all $k\in[\,m\,]$ and all permutation $\mu$ on $[\,m\,]$ . Since this is true for all permutations, it is also true if we select $\tilde{\mu}$ randomly, independently of $\{A_{i}\}$ , with any distribution on the (finite) set of all possible permutations on $[\,m\,]$ .

As $\{A_{i}\}$ are almost surely non-equal, and function $\psi$ has properties P1 and P2, it holds with probability one that

[TABLE]

if and only if $\sigma(1)=\mu(1)$ , where $\sigma,\mu$ are permutations on $[m]$ . Hence, there are $m$ equivalence classes of permutations, denoted by $P_{1},\dots,P_{m}$ , each containing $(m-1)!$ permutations, with the (a.s.) property that permutations from the same class produce the same rank, while permutations from different classes produce different ranks. Therefore, each rank $k\in[\,m\,]$ is produced by exactly one class $P_{i}$ , but naturally, the association of ranks and classes depends on the realization of the random elements $A_{1},\dots,A_{m}$ .

Now, let us fix a realization $a_{1},\dots,a_{m}\in\mathbb{A}$ in which the elements are pairwise different. Then, let us sample a permutation $\tilde{\mu}$ randomly, with uniform distribution on the set of all permutations. Since each equivalence class has the same number of elements, the probability that $\tilde{\mu}\in P_{i}$ is exactly $1/m$ . As each $P_{i}$ yields a different rank, we have

[TABLE]

for all rank $k\in[\,m\,]$ and independently of the realization $a_{1},\dots,a_{m}$ . Note that if we did not use a uniform distribution, then the resulting rank distribution would of course depend on the actual realization we are ranking.

Because (16) is independent of the realization, the resulting discrete uniform distribution carries over to the case when $A_{1},\dots,A_{n}$ are random, as they are (a.s.) pairwise different.

This last step can be made more precise as follows. For simplicity, let us introduce the notations $a\doteq(a_{1},\dots,a_{m})$ , $a_{\mu}\doteq(a_{\mu(1)}\,\dots,a_{\mu(m)})$ , and similarly for $A$ and $A_{\mu}$ . Then, let us introduce the indicator function of the rank being $k$ ,

[TABLE]

where $a$ and $\mu$ are deterministic. Then, let us define

[TABLE]

where $\tilde{\mu}$ is a random permutation selected uniformly from the set of all permutations on $[\,m\,]$ , and $a\in\mathbb{A}^{m}$ is a constant. Note that $i_{k}(\cdot)$ is a deterministic function. Then, we have

[TABLE]

for all $a$ whose elements are pairwise different. Then, using the properties of (conditional) expectation, we have

[TABLE]

where we also used that the elements of $A$ are almost surely pairwise different. This concludes the proof of the claim. ∎

III-C Confidence Regions

Inspired by FSID methods [4, 5, 6], the core idea of the proposed algorithms is to compare the original dataset with alternative samples which are randomly generated according to a given hypothesis. The comparison will be based on the rank of the original dataset among all the available samples, therefore, the ranking function is in the heart of all proposed algorithms. The differences between various algorithms primarily come from the various ways they rank.

Lemma 1 will be one of our main technical tools, however, it requires almost surely different elements, which is not guaranteed for $\{\mathcal{D}_{k}(\theta)\}$ . This will be resolved by random tie-breaking, similarly to the solution of [5]. To make this precise, consider a permutation $\pi$ of the set $\{0,\dots,m-1\}$ , generated randomly with uniform distribution, and independently of $\{\mathcal{D}_{k}(\theta)\}$ . Then, obviously $\pi(0),\dots,\pi(m-1)$ are almost surely different, exchangeable random variables.

We extend datasets $\{\mathcal{D}_{k}(\theta)\}$ with $\{\pi(k)\}$ . As a shorthand notation we introduce, for $k=0,\dots,m-1$ , the sample

[TABLE]

which now takes values in $\mathbb{A}\,\doteq\,\mathbb{S}^{n}\times\{0,\dots,m-1\}$ .

Given a ranking function $\psi$ , defined on the codomain (range) of the extended datasets, and hyper-parameters $p,q\in[\,m\,]$ with $p\,\leq\,q$ , a confidence region can be defined by

[TABLE]

where $\varrho\,\doteq\,(m,p,q)$ denotes the applied hyper-parameters, with $m\geq 1$ being the total number of available samples, including the original one as well as the generated ones.

Our main abstract result about the coverage probability of the true parameter of such confidence regions is

Theorem 1

We have for all ranking function $\psi$ and hyper-parameter $\varrho=(m,p,q)$ with integers $1\,\leq\,p\,\leq\,q\,\leq\,m$ ,**

[TABLE]

Proof:

First note that $\mathcal{D}_{0},\mathcal{D}_{1}(\theta^{*}),\dots,\mathcal{D}_{m-1}(\theta^{*})$ are conditionally i.i.d., given the inputs, $\{x_{k}\}$ , therefore they are also exchangeable. As $\pi(0),\dots,\pi(m-1)$ are exchangeable, as well, and $\pi$ is generated independently of the datasets, we have that $\mathcal{D}_{0}^{\pi},\mathcal{D}^{\pi}_{1}(\theta^{*}),\dots,\mathcal{D}^{\pi}_{m-1}(\theta^{*})$ are exchangeable, too, furthermore, they are almost surely pairwise different.

Then, the theorem follows directly from Lemma 1, as the lemma implies that the rank of $D^{\pi}_{0}$ takes each value in $[\,m\,]$ with probability exactly $\nicefrac{{1}}{{m}}$ , therefore, the probability that its rank is between $p$ and $q$ is exactly $(\,q-p+1\,)\,/\,m$ . ∎

Theorem 1 shows that the confidence regions constructed as (22) have exact coverage probabilities, independently of the underlying probability distribution generating the (i.i.d.) data and for all ranking functions (satisfying P1 and P2). Observe that it is a non-asymptotic result, the exact coverage probability is valid irrespective of the sample size, $n$ . Also note that the hyper-parameters are user-chosen, therefore, any (rational) confidence probability in $(0,1)$ can be achieved.

This theorem is very general and hence also allows some degenerate constructions, like the ones that do not depend on the data at all, only on the tie-breaking random permutation, $\pi$ . Such regions are called purely randomized. In order to avoid such constructions, we should analyze other properties of the methods. Besides having guaranteed confidence, one of the most important properties an algorithm can have is (strong) consistency, namely, the property that, for any false parameter, as the sample size increases, eventually it will be excluded from the constructed confidence region (a.s.).

Formally, a method is strongly consistent if

[TABLE]

for all parameter $\theta\,\neq\,\theta^{*}$ , $\theta\in\Theta$ , where $\Theta_{\varrho,n}^{\psi}$ denotes the confidence region constructed based on a sample of size $n$ . Obviously, purely randomized regions are not consistent.

IV Kernel-Based Constructions

In this section we propose three kernel-based algorithms to construct confidence regions based on the resampling framework of Section III. We show that all of these methods have exact coverage probabilities and are strongly consistent.

IV-A Algorithm I (Neighborhood Based)

The main idea of Algorithm 1 is that we can estimate the regression function, $f_{*}$ , based on the available (quantized) dataset, $\mathcal{D}_{0}$ , by the kNN ( $k$ -nearest neighbors) algorithm. We can similarly do so based on the alternative datasets, $\{\mathcal{D}_{k}(\theta)\}_{k\neq 0}$ . Then, we can compare the estimate based on $\mathcal{D}_{0}$ to the ones coming from the alternative samples.

For Algorithm I we assume that $\mathbb{X}\,\subseteq\,\mathbb{R}^{d}$ , $\mathbb{X}$ is compact, the support of the (marginal) distribution of the inputs, $P_{\scriptscriptstyle\mathbb{X}}$ , is the whole $\mathbb{X}$ , furthermore, $P_{\scriptscriptstyle\mathbb{X}}$ is absolutely continuous.

Let us introduce functions, for $i=0,\dots,m-1$ , as

[TABLE]

where $\mathbb{I}$ is an indicator function (its value is $1$ if its argument is true, and [math] otherwise), $N(x,n_{k})$ denotes the $k_{n}$ closest neighbors of $x$ from $\{x_{j}\}_{j=1}^{n}$ , and $k_{n}\leq n$ is a constant (window size), which can depend on $n$ . We use the standard Euclidean distance as a metric on $\mathbb{X}$ (to define neighbors). Since the inputs, $\{x_{j}\}$ , have a distribution that is absolutely continuous, there is zero probability of ties in $N(x,n_{k})$ .

Given two square integrable functions, $f,g:\mathbb{X}\to\mathbb{R}$ , let

[TABLE]

We will need the total (cumulative) distance of $f_{\theta,n}^{(i)}$ from all other functions, thus we introduce, for $i=0,\dots,m-1,$

[TABLE]

Then, we can define the rank of $Z_{n}^{(0)}$ among $\{Z_{n}^{(i)}(\theta)\}$ as

[TABLE]

where $\mathbb{I}$ is an indicator function, and binary relation “ $\prec_{\pi}$ ” is the standard “ $<$ ” with random tie-breaking. More precisely, as before, let $\pi$ be a random (uniformly chosen) permutation of the set $\{0,\dots,m-1\}$ . Then, given $m$ arbitrary real numbers, $Z_{0},\dots,Z_{m-1}$ , we can construct a strict total order, denoted by “ $\prec_{\pi}$ ”, by defining $Z_{k}\prec_{\pi}Z_{j}$ if and only if $Z_{k}<Z_{j}$ or it both holds that $Z_{k}=Z_{j}$ and $\pi(k)<\pi(j)$ .

Therefore, in case of Algorithm I, the ranking function is

[TABLE]

As we will see (cf. the proof of Theorem 2), for any fixed false parameter, $Z_{n}^{(0)}(\theta)$ tends to have the largest rank, therefore, we fix $p=1$ and only exclude parameters which lead to high ranks. That is, using (22), the confidence set is

[TABLE]

where $\varrho\,\doteq\,(\,m,q\,)$ again denotes the user-chosen hyper-parameters with $1\,\leq\,q\,\leq\,m$ ; we assume that $3\,\leq\,m$ .

The main theoretical results can be summarized as

Theorem 2

The coverage probability of the region is

[TABLE]

for any sample size $n$ . Moreover, if $\{k_{n}\}$ are chosen such that $k_{n}\to\infty$ and $k_{n}/n\to 0$ , as $n\to\infty$ , then the confidence regions are strongly consistent, as defined by (24).**

Proof:

The exact confidence of the constructed regions immediately follows from Theorem 1, as it is straightforward to check that the applied ranking satisfies P1 and P2.

In order to prove strong consistency, let us fix a false parameter $\theta\in\Theta$ with $\theta\neq\theta^{*}$ . Since the parametrization is injective, we know that $f_{\theta}\neq f_{*}$ on a set of positive measure.

Under our assumptions we know that the kNN estimator (25) is strongly consistent [3, Theorem 23.7], that is

[TABLE]

almost surely, for $i=1,\dots,m-1$ . Since the support of $P_{\scriptscriptstyle\mathbb{X}}$ is $\mathbb{X}$ and it is absolutely continuous, we have the same (a.s.) convergence properties if we use $\|\cdot\|_{2}^{2}$ instead of $\|\cdot\|^{2}_{\scriptscriptstyle P}$ . Now, let $\kappa\,\doteq\,\|f_{*}-f_{\theta}\|^{2}_{2}>0$ , then taking (27) into account,

[TABLE]

almost surely, from which $Z_{n}^{(0)}(\theta)$ tends to take rank $m$ in the ordering (a.s.), as $n\to\infty$ . Therefore, for any fixed $\theta\neq\theta^{*}$ , asymptotically we (a.s.) have that $\mathcal{R}_{\infty}(\theta)=m$ , which means that $\theta\neq\theta^{*}$ will be (a.s.) excluded from the confidence region (since $q<m$ ), as the sample size tends to infinity. ∎

Regarding the computation aspects of Algorithm I note that $\{f_{\theta,n}^{(i)}\}$ can be calculated exactly based on the available data, as they are piece-wise constant functions. The distance $\|\hskip 1.13809ptf_{\theta,n}^{(i)}-f_{\theta,n}^{(j)}\hskip 1.13809pt\|^{2}_{2}$ can also be calculated from the available data. Nevertheless, one may use the Monte Carlo approximation

[TABLE]

where $\ell_{n}$ is a constant and $\{\bar{x}_{k}\}$ are i.i.d. random variables having uniform distribution on $\mathbb{X}$ . Note that we know from the strong law of large numbers (SLLN) that the sum in (36) almost surely converges to $\|\hskip 1.13809ptf_{\theta,n}^{(i)}-f_{\theta,n}^{(j)}\hskip 1.13809pt\|^{2}_{2}$ , as $\ell_{n}\to\infty$ .

It is relatively easy to see that using the approximation in (36), instead of (26), does not affect the exact coverage probability of the algorithm. Moreover, if $\ell_{n}\to\infty$ as $n\to\infty$ , then one can also show the strong consistency of the Monte Carlo approximated variant. Hence, the theoretical properties of Theorem 2 remain valid even under (36), but the sizes of regions are of course affected by the approximation.

The kNN estimator, which is in the core of Algorithm I, is a simple kernel method that uses a variable bandwidth rectangular window. A natural generalization of this approach is to apply other kernels, such as Gaussian or Laplacian, for local averaging. Given any kernel $k(\cdot,\cdot)$ , by interpreting it as a similarity measure, we can redefine functions $\{f_{\theta,n}^{(i)}\}$ as

[TABLE]

which leads to alternative confidence region constructions.

These variants typically also build confidence regions with exact coverage probabilities. Moreover, as a wide variety of such kernel estimates are strongly consistent, under some technical conditions [3], and the generalized Algorithm I inherits these properties, the resulting confidence sets are also strongly consistent. The corresponding coverage and consistency theorems could be proved analogously to Theorem 2.

IV-B Algorithm II (Embedding Based)

The core idea of Algorithm II is to embed the distribution of the original sample and that of the alternative ones in an RKHS using a characteristic kernel. If the underlying distributions are different, then the original dataset results in a different element than the one the alternative datasets are being mapped to, which can be detected statistically.

Assume $\mathcal{H}$ is a separable RKHS containing $\mathbb{S}\to\mathbb{R}$ type functions with a characteristic, bounded, and translation-invariant kernel $k(\cdot,\cdot)$ . If $\mathbb{X}=\mathbb{R}^{d}$ , then $\mathbb{S}=\mathbb{R}^{d}\times\{+1,-1\}$ , and we can use, for example, the Gaussian, the Laplacian, or the Poisson kernel, which are all characteristic [10].

Let us introduce the following kernel mean embeddings

[TABLE]

where $S_{*}$ and $S_{\theta}$ are a random elements from $\mathbb{S}$ ; Variable $S_{*}$ has the “true” distribution of the observations, while $S_{\theta}$ has a distribution where the output, $Y$ , is generated according to the conditional probability (III-A), parametrized by $\theta$ , while the marginal distribution of the input, $X$ , remains the same.

Since the kernel is bounded, $\mathbb{E}\big{[}\,\sqrt{k(S_{\theta},S_{\theta})}\,\big{]}<\infty$ , for all $\theta$ , which ensures that $\{h_{\theta}\}$ exist and belong to $\mathcal{H}$ [10].

Because the kernel is characteristic, we know that $h_{\theta}=h_{*}$ if and only if $\theta=\theta^{*}$ . Now, let us introduce the following empirical versions of the embedded distributions,

[TABLE]

for $i=0,\dots,m-1$ , where $s_{i,j}(\theta)\doteq(x_{j},y_{i,j}(\theta))$ ; and recall that for $i=0$ (original sample), we have $y_{i,j}(\theta)=y_{j}$ . In other words, $s_{i,j}(\theta)$ has the same distribution of $S_{\theta}$ for $i\neq 0$ and its distribution is the same as that of $S_{*}$ for $i=0$ .

Let $C_{k}$ be a constant that satisfies $|\,k(x,y)\,|\leq C_{k}$ for all $x,y$ . Then, obviously $|\,h_{\theta}(x)\,|\leq C_{k}$ for all $x$ , as well. Now, applying the reproducing property, we have the bound

[TABLE]

where $S$ is either $S_{*}$ or $S_{\theta}$ , and $h\,\doteq\,\mathbb{E}\big{[}\,k(\cdot,S)\,\big{]}$ .

Then, we know from the SLLN for Hilbert space valued elements that $\|\,h_{\theta,n}^{(i)}-h_{\theta}\,\|_{\mathcal{H}}\to 0$ (a.s.), as $n\to\infty$ , for $i\neq 0$ , additionally, $\|\,h_{\theta,n}^{(0)}-h_{*}\,\|_{\mathcal{H}}\to 0$ (a.s.), as $n\to\infty$ .

Now, we can define the $\{Z_{n}^{(i)}(\theta)\}$ variables similarly to (27), but using the squared distances $\|\,h_{\theta,n}^{(i)}-h_{\theta,n}^{(j)}\,\|^{2}_{\mathcal{H}}$ instead of $\|\hskip 1.13809ptf_{\theta,n}^{(i)}-f_{\theta,n}^{(j)}\hskip 1.13809pt\|^{2}_{2}$ , and construct the confidence set as (30).

Theorem 3

The confidence regions of Algorithm II have

[TABLE]

for any sample size $n$ ; and they are strongly consistent.**

Proof:

The exact confidence again follows from Theorem 1 by noting that the ranking satisfies P1 and P2.

The proof of consistency follows the ideas of the proof of Theorem 2. Namely, let us fix a false parameter $\theta\in\Theta$ with $\theta\neq\theta^{*}$ . Since the parametrization is injective, we know that $\mathcal{D}_{0}$ and $\{\mathcal{D}_{i}(\theta)\}_{i\neq 0}$ have different distributions. As the kernel is characteristic, we know that the RKHS embedded distributions $h_{*}(\cdot)$ and $h_{\theta}(\cdot)$ are different. We then apply the SLLN for Hilbert space valued elements [11] and use the construction of the $\{Z_{n}^{(i)}\}$ variables to get the limits

[TABLE]

for $i\neq 0$ , almost surely, where $\kappa\,\doteq\,\|h_{*}-h_{\theta}\|_{\mathcal{H}}>0$ . Thus, $Z_{n}^{(0)}(\theta)$ again tends to take rank $m$ (a.s.), as $n\to\infty$ , which leads to the (a.s.) asymptotic exclusion of the false parameter $\theta\neq\theta^{*}$ (for more details, see the proof of Theorem 2). ∎

The squared distance of the empirical versions of the embeddings $\|\,h_{\theta,n}^{(i)}-,h_{\theta,n}^{(j)}\,\|^{2}_{\mathcal{H}}$ can be computed by applying the reproducing property of the kernel and the Gram matrix of the sample $s_{i,1}(\theta),\dots,s_{i,n}(\theta),s_{j,1}(\theta),\dots,s_{j,n}(\theta)$ .

Algorithm II has a nice theoretical interpretation as comparing embedded distributions in an RKHS. However, as the Gram matrices required to compute the $\{Z_{n}^{(i)}(\theta)\}$ variables depend on $\theta$ , this method has a large computational burden, hence the importance of Algorithm II is mainly theoretical. Nevertheless, motivated by its ideas, in the next section we suggest a computationally much lighter algorithm.

IV-C Algorithm III (Discrepancy Based)

Algorithm III follows the intuitions behind Algorithm II, but ensures that we can work with the same Gram matrix for all $\theta$ . Moreover, it has a simpler construction for $\{Z_{n}^{(i)}(\theta)\}$ , which also makes it computationally more appealing.

For Algorithm III we assume that $\mathcal{H}$ is a separable RKHS containing $\mathbb{X}\to\mathbb{R}$ functions with a universal, bounded, and translation-invariant kernel $k(\cdot,\cdot)$ . We assume that $\mathbb{X}$ is a compact metric space, hence, $k(\cdot,\cdot)$ is also characteristic [10]. Finally, we assume that each $f\in\mathcal{F}$ is continuous.

Let us introduce the notation $\varepsilon_{i,j}(\theta)\,\doteq\,y_{i,j}(\theta)-f_{\theta}(x_{j})$ , for $i=0,\dots,m-1$ and $j=1,\dots,n$ . Note that if $i\neq 0$ , $\varepsilon_{i,j}(\theta)$ has zero mean for all $j$ , as $f_{\theta}(x_{j})=\,\mathbb{E}_{\theta}\big{[}\,y_{i,j}(\theta)\,|\,x_{j}\,\big{]}$ .

The fundamental objects of Algorithm III are

[TABLE]

for $i=0,\dots,m-1$ . Observe that $Z_{n}^{(i)}(\theta)$ can be easily computed using the Gram matrix $K_{i,j}\,\doteq\,k(x_{i},x_{j})$ , as

[TABLE]

using the notation $\varepsilon_{i}(\theta)\,\doteq\,(\varepsilon_{i,1}(\theta),\dots,\varepsilon_{i,n}(\theta))^{\mathrm{T}}$ .

From this point, we follow the construction of Algorithms I and II, namely, we define the ranking function as (28), and the confidence region as (30), but naturally we apply our new functions (45) as the definition of the $\{Z_{n}^{(i)}(\theta)\}$ variables.

Theorem 4

The confidence regions of Algorithm III have

[TABLE]

for any sample size $n$ ; and they are strongly consistent.**

Proof:

The exact confidence follows from Theorem 1.

For the proof of strong consistency, let us fix $\theta\neq\theta^{*}$ and an $i\neq 0$ . To simplify the notations, introduce $e_{j}\doteq\varepsilon_{i,j}(\theta)$ and $\bar{y}_{j}\doteq y_{i,j}(\theta)$ . We first show that $e_{j}k(\cdot,x_{j})$ has zero mean

[TABLE]

About the variance of $e_{j}k(\cdot,x_{j})$ , observe that

[TABLE]

where $B_{k}\doteq k(x,x)$ , for any $x$ since the kernel is translation-invariant; also note that $\|k(\cdot,x)\|_{\mathcal{H}}^{2}=k(x,x)$ , for any $x\in\mathbb{X}$ , because of the reproducing property of the kernel.

Therefore, we can apply the Hilbert space valued SLLN to conclude that $Z_{n}^{(i)}(\theta)\to 0$ (a.s.), as $n\to\infty$ , for all $i\neq 0$ .

Now, let $e^{*}_{j}\doteq\varepsilon_{0,j}(\theta)=y_{j}-f_{\theta}(x_{j})$ . We will prove that the mean of $e^{*}_{j}k(\cdot,x_{j})$ is not zero. We can again show

[TABLE]

using similar steps as in (48), except in the last one, where in our case we have $\mathbb{E}[\,y_{j}\,|\,x_{j}\,]=f_{*}(x_{j})$ . We will argue that the term $\mathbb{E}\big{[}\,(f_{*}(x_{j})-f_{\theta}(x_{j}))k(\cdot,x_{j})\,\big{]}$ cannot be zero.

Let us introduce $f_{0}\doteq f_{*}-f_{\theta}$ , and assume indirectly that $\mathbb{E}\big{[}\,f_{0}(x_{j})\,k(\cdot,x_{j})\,\big{]}$ is the zero function. Then, for all $x$ , $\left<f_{0},k(x,\cdot)\right>_{\scriptscriptstyle P}\!\doteq\mathbb{E}\big{[}\,f_{0}(x_{j})k(x,x_{j})\,\big{]}=\,0$ (note that an RKHS is a space of functions and not that of equivalence classes of functions). Since the kernel is universal, $\mathbb{X}$ is compact, and $f_{0}$ is continuous, we know that for all $\varepsilon>0$ , there exists an $\hat{f}\in\mathcal{H}$ , such that $\|\,\hat{f}-f_{0}\,\|_{\infty}<\varepsilon$ . Then, clearly

[TABLE]

since $P_{\scriptscriptstyle\mathbb{X}}$ is a probability measure on $\mathbb{X}$ . Hence, for all $\varepsilon>0$ ,

[TABLE]

Since $k(\cdot,\cdot)$ is the kernel of the RKHS, we can write $\hat{f}$ as

[TABLE]

for some points $\{\bar{x}_{k}\}$ . Since for all $x$ , $\left<f_{0},k(x,\cdot)\right>_{\scriptscriptstyle P}=0$ ,

[TABLE]

where we have applied Fubini’s theorem [12] to exchange the two integrals (one of which is a sum). Regarding the applicability of Fubini’s theorem note that both integrals are w.r.t. a finite measure, and the functions are bounded.

Then, combining (52) and (54) we get that for all $\varepsilon>0$ ,

[TABLE]

which implies that $\|\hskip 1.42262ptf_{0}\hskip 1.42262pt\|^{2}_{\scriptscriptstyle P}=0$ . On the other hand, we know from (8) that this norm cannot be zero if $\theta\neq\theta^{*}$ . Therefore, we have reached a contradiction, hence $\mathbb{E}\big{[}\,(f_{*}(x_{j})-f_{\theta}(x_{j}))k(\cdot,x_{j})\,\big{]}$ cannot be the zero element of the RKHS.

We can use a similar argument to (41) to show that $\mbox{Var}(e^{*}_{j}k(\cdot,x_{j}))$ is bounded, also using that $\{e^{*}_{j}\}$ are bounded. Then, applying the Hilbert space variant of SLLN [11],

[TABLE]

almost surely. Therefore, summarizing our results, we have

[TABLE]

for $i\neq 0$ , almost surely, where $\|h_{0}\|^{2}_{\mathcal{H}}>0$ . Thus, $Z_{n}^{(0)}(\theta)$ again tends to take rank $m$ (a.s.), as $n\to\infty$ , which leads to the (a.s.) asymptotic exclusion of the parameter $\theta\neq\theta^{*}$ . ∎

V Numerical Experiments

Numerical experiments were carried out to demonstrate the proposed algorithms. In the presented test scenario the joint probability distribution of the data was assumed to be the mixture of two Laplace distributions with different locations, $\mu_{1},\mu_{2}$ , but with the same scale $\lambda$ . It was assumed that with probability $p$ we observe the “ $+1$ ” class, and with $1-p$ we see an element of the “ $-1$ ” class. Selecting $p$ , $\mu_{1}$ , $\mu_{2}$ and $\lambda$ induces a regression function, e.g., see (9).

During the experiments the confidence regions were built for parameters $p$ and $\lambda$ , while the location parameters were fixed, $\mu_{1}=1$ and $\mu_{2}=-1$ , to allow two dimensional figures. Figure 1 demonstrates the obtained ranks $\{\mathcal{R}_{n}(\theta)\}$ for various $\theta=(\hskip 1.13809ptp,\lambda\hskip 1.13809pt)$ using Algorithm I with the kNN approach (a), Algorithm I with a Gaussian kernel (b) and Algorithm III with a Gaussian kernel (c). For the Gaussian kernel we choose $\sigma=\nicefrac{{1}}{{2}}$ . On parts (a), (b) and (c) darker colors indicate smaller ranks, hence, the darker the color is, the more likely the parameter is included in a confidence region. The three corresponding $90\,\%$ (exact) confidence regions are also demonstrated by part (d). The true parameters were $p=\nicefrac{{1}}{{2}}$ ( $x$ -axis) and $\lambda=1$ ( $y$ -axis). The sample size was $n=500$ and $m=50$ (original and alternative) samples were generated. The regions were evaluated on a grid.

It can be observed that Algorithm III produced the most concentrated rank clusters and provided the smallest confidence region. The extended version (37) of Algorithm I, with a Gaussian kernel, produced comparable results, while the kNN version was the worst in this case. Nevertheless, it still has computational advantages which may make it attractive.

Note that in this special example it is possible to construct individual confidence regions for parameters $p$ and $\lambda$ based on standard results. One can use, for example, Hoeffding’s inequality [12] to get confidence intervals for probability $p$ , and $\lambda$ can be estimated based on the fact that the variance of the observations, for both classes, is $2\lambda^{2}$ . Nevertheless, such approaches need the specific interpretations of the parameters: on how they influence the observations. Furthermore, even in this very special case it is not obvious how to construct a joint confidence region for the $(\hskip 1.13809ptp,\lambda\hskip 1.13809pt)$ pair. Simply intersecting the two confidence tubes (i.e., if we extend the confidence intervals for $p$ and $\lambda$ to $\mathbb{R}^{2}$ , then they define two infinite “stripes”, a vertical and a horizontal one) produces a set with a lower confidence than that of the original sets, and hence it ultimately leads to conservative confidence regions.

On the other hand, the suggested three algorithms do not presuppose any interpretation of the tested parameters, apart from the fact that they determine a regression function. They do not need a fully parametrized joint distribution, indeed, the regression function is compatible with infinitely many joint distributions having widely different (marginal) input distributions. Furthermore, if $\theta\in\mathbb{R}^{d}$ , then the algorithms automatically build joint and non-conservative confidence sets. Hence, another advantage of the presented framework, apart from its strong theoretical guarantees, is its flexibility.

VI Conclusions

In this paper we addressed the problem of building non-asymptotic confidence regions for the regression function of binary classification, which is a key object defined as the conditional expectation of the class labels given the inputs.

The main idea was to test candidate models by generating alternative samples based on them, and then computing the performance of a kernel-based algorithm on all samples. If the candidate model is wrong, then the algorithm behave differently on the alternatively generated samples than on the original one, which can be detected statistically by ranking.

Three constructions were proposed and it was proved that all of them build confidence regions with exact coverage probabilities, for any sample size, and are strongly consistent.

The proposed framework is semi-parametric, because the regression function does not determine the (joint) probability

distribution of the data, it does not contain information about the (marginal) distribution of the inputs (and that is why only the outputs are resampled in the alternative datasets).

Moreover, the algorithms only indirectly depend on the given family of candidate functions, namely, their inputs are just the original sample and several alternative samples generated based on the tested function. Consequently, the family of regression functions can be arbitrary. It could even be the set of all possible regression functions which satisfy (8) and the theoretical results are still valid. If we work with an infinite dimensional class of functions, then the confidence regions cannot be explicitly constructed in practice. Nevertheless, it is still possible to test any candidate regression function to check whether it is included in a confidence set, or in other words, to quantify its uncertainty by computing how compatible it is with the available observations.

Bibliography12

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] V. N. Vapnik, Statistical Learning Theory . Wiley-Interscience, 1998.
2[2] A. Goudjil, M. Pouliquen, E. Pigeon, O. Gehan, and M. M’Saad, “Identification of systems using binary sensors via support vector machines,” in 54th IEEE Conference on Decision and Control (CDC), Osaka, Japan , pp. 3385–3390, 2015.
3[3] L. Györfi, M. Kohler, A. Krzyzak, and H. Walk, A Distribution-Free Theory of Nonparametric Regression . Springer, 2002.
4[4] A. Carè, B. Cs. Csáji, M. Campi, and E. Weyer, “Finite-sample system identification: An overview and a new correlation method,” IEEE Control Systems Letters , vol. 2, no. 1, pp. 61 – 66, 2018.
5[5] B. Cs. Csáji, M. C. Campi, and E. Weyer, “Sign-Perturbed Sums: A new system identification approach for constructing exact non-asymptotic confidence regions in linear regression models,” IEEE Transactions on Signal Processing , vol. 63, no. 1, pp. 169–181, 2015.
6[6] S. Kolumbán, System Identification in Highly Non-Informative Environment . Ph D thesis, Budapest University of Technology and Economics, Hungary, and Vrije Univesiteit Brussels, Belgium, 2016.
7[7] G. Pillonetto, A. Carè, and M. C. Campi, “Kernel-based SPS,” in Proceedings of the 18th IFAC Symposium on System Identification (SYSID 2018), Stockholm, Sweden, July 9-11, 2018 , pp. 31–36, Elsevier, 2018.
8[8] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in machine learning,” The Annals of Statistics , pp. 1171–1220, 2008.