Signal recovery by Stochastic Optimization

Anatoli Juditsky; Arkadi Nemirovski

arXiv:1903.07349·math.ST·March 19, 2019·Autom. Remote. Control.

Signal recovery by Stochastic Optimization

Anatoli Juditsky, Arkadi Nemirovski

PDF

TL;DR

This paper introduces a stochastic optimization approach for signal recovery in Generalized Linear Models, reducing the problem to solving a stochastic variational inequality with proven convergence rates.

Contribution

It proposes a novel method that simplifies signal estimation in GLMs by linking it to stochastic VI, with weaker assumptions than traditional convexity requirements.

Findings

01

Finite-time error bounds of $O(1/K)$ for strongly monotone cases

02

Efficient computational approach for stochastic VI solutions

03

Weaker structural assumptions than maximum likelihood convexity

Abstract

We discuss an approach to signal recovery in Generalized Linear Models (GLM) in which the signal estimation problem is reduced to the problem of solving a stochastic monotone variational inequality (VI). The solution to the stochastic VI can be found in a computationally efficient way, and in the case when the VI is strongly monotone we derive finite-time upper bounds on the expected $∥ \cdot ∥_{2}^{2}$ error converging to 0 at the rate $O (1/ K)$ as the number $K$ of observations grows. Our structural assumptions are essentially weaker than those necessary to ensure convexity of the optimization problem resulting from Maximum Likelihood estimation. In hindsight, the approach we promote can be traced back directly to the ideas behind the Rosenblatt's perceptron algorithm.

Figures6

Click any figure to enlarge with its caption.

Equations142

u \in X min E_{ω \sim P_{x}} {ℓ (y, f (η^{T} u))},

u \in X min E_{ω \sim P_{x}} {ℓ (y, f (η^{T} u))},

\frac{1}{K} k = 1 \sum K ℓ (y_{k}, f (η_{k}^{T} u))

\frac{1}{K} k = 1 \sum K ℓ (y_{k}, f (η_{k}^{T} u))

ℓ (y, θ) = - ln (p_{θ} (y)) .

ℓ (y, θ) = - ln (p_{θ} (y)) .

ℓ (y, f (η^{T} u)) = ln (1 + exp {η^{T} u}) - y η^{T} u .

ℓ (y, f (η^{T} u)) = ln (1 + exp {η^{T} u}) - y η^{T} u .

u \in X min E_{(η, y) \sim P_{x}} {ln (1 + exp {η^{T} u}) - y η^{T} u},

u \in X min E_{(η, y) \sim P_{x}} {ln (1 + exp {η^{T} u}) - y η^{T} u},

u \in X min \frac{1}{K} k = 1 \sum K [ln (1 + exp {η_{k}^{T} u}) - y_{k} η_{k}^{T} u];

u \in X min \frac{1}{K} k = 1 \sum K [ln (1 + exp {η_{k}^{T} u}) - y_{k} η_{k}^{T} u];

{ℓ (y, η^{T} u) = F (η^{T} u) - y η^{T} u,

{ℓ (y, η^{T} u) = F (η^{T} u) - y η^{T} u,

u \in X min E_{(η, y) \sim P_{x}} {F (η^{T} u) - y η^{T} u} .

u \in X min E_{(η, y) \sim P_{x}} {F (η^{T} u) - y η^{T} u} .

y = f (η^{T} x) + ξ, ξ \sim N (0, σ^{2} I_{m}) .

y = f (η^{T} x) + ξ, ξ \sim N (0, σ^{2} I_{m}) .

u \in X min E_{η \sim Q} {∥ f (η^{T} x) - f (η^{T} u) ∥_{2}^{2}},

u \in X min E_{η \sim Q} {∥ f (η^{T} x) - f (η^{T} u) ∥_{2}^{2}},

u \in X min {\frac{1}{K} k = 1 \sum k ∥ y_{k} - f (η_{k}^{T} u) ∥_{2}^{2}},

ω^{K} = {ω_{k} = (η_{k}, y_{k}), 1 \leq k \leq K}

ω^{K} = {ω_{k} = (η_{k}, y_{k}), 1 \leq k \leq K}

E_{∣ η}^{x} {y} = f (η^{T} x),

E_{∣ η}^{x} {y} = f (η^{T} x),

[g (z) - g (z^{'})]^{T} [z - z^{'}] \geq 0 \forall z, z^{'} \in R^{m} .

[g (z) - g (z^{'})]^{T} [z - z^{'}] \geq 0 \forall z, z^{'} \in R^{m} .

[g (z) - g (z^{'})]^{T} [z - z^{'}] \geq ϰ ∥ z - z^{'} ∥_{2}^{2}, \forall z z^{'} \in Z,

[g (z) - g (z^{'})]^{T} [z - z^{'}] \geq ϰ ∥ z - z^{'} ∥_{2}^{2}, \forall z z^{'} \in Z,

d^{T} f^{'} (z) d \geq ϰ d^{T} d \forall (d \in R^{n}, z \in Z) .

d^{T} f^{'} (z) d \geq ϰ d^{T} d \forall (d \in R^{n}, z \in Z) .

g (x) = A f (A^{T} x + a)

g (x) = A f (A^{T} x + a)

F (x) = \int_{S} f (x, s) μ (d s)

F (x) = \int_{S} f (x, s) μ (d s)

F (z) = E_{η \sim Q} {η f (η^{T} z)}

F (z) = E_{η \sim Q} {η f (η^{T} z)}

E_{(η, y) \sim P_{x}} {∥ η y ∥_{2}^{2}} \leq M^{2} .

E_{(η, y) \sim P_{x}} {∥ η y ∥_{2}^{2}} \leq M^{2} .

G_{(η, y)} (z) = η f (η^{T} z) - η y : R^{n} \to R^{n} .

G_{(η, y)} (z) = η f (η^{T} z) - η y : R^{n} \to R^{n} .

\begin{array}[]{rclr}{\mathbf{E}}_{(\eta,y)\sim P_{x}}\left\{G_{(\eta,y)}(z)\right\}&=&F(z)-F(x)\,\,\forall z\in{\mathbf{R}}^{n}&(a)\\ \|F(z)\|_{2}&\leq&M\,\forall z\in{\cal X}&(b)\\ {\mathbf{E}}_{(\eta,y)\sim P_{x}}\left\{\|G_{(\eta,y)}(z)\|_{2}^{2}\right\}&\leq&4M^{2}\,\,\forall z\in{\cal X}&(c)\\ \end{array}

\begin{array}[]{rclr}{\mathbf{E}}_{(\eta,y)\sim P_{x}}\left\{G_{(\eta,y)}(z)\right\}&=&F(z)-F(x)\,\,\forall z\in{\mathbf{R}}^{n}&(a)\\ \|F(z)\|_{2}&\leq&M\,\forall z\in{\cal X}&(b)\\ {\mathbf{E}}_{(\eta,y)\sim P_{x}}\left\{\|G_{(\eta,y)}(z)\|_{2}^{2}\right\}&\leq&4M^{2}\,\,\forall z\in{\cal X}&(c)\\ \end{array}

E_{(η, y) \sim P_{x}} {η y} = E_{η \sim Q} {E_{∣ η}^{x} {η y}} = E_{η} {η f (η^{T} x)} = F (x)

E_{(η, y) \sim P_{x}} {η y} = E_{η \sim Q} {E_{∣ η}^{x} {η y}} = E_{η} {η f (η^{T} x)} = F (x)

E_{(η, y) \sim P_{x}} {G_{(η, y)} (z)}

E_{(η, y) \sim P_{x}} {G_{(η, y)} (z)}

E_{(η, y) \sim P_{x}} {∥ η f (η^{T} z) ∥_{2}^{2}}

E_{(η, y) \sim P_{x}} {∥ η f (η^{T} z) ∥_{2}^{2}}

G (z) = F (z) - F (x), F (z) = E_{η \sim Q} {η f (η^{T} z)};

G (z) = F (z) - F (x), F (z) = E_{η \sim Q} {η f (η^{T} z)};

G (z) = E_{(η, y) \sim P_{x}} {η f (η^{T} z) - η y},

G (z) = E_{(η, y) \sim P_{x}} {η f (η^{T} z) - η y},

G_{ω^{K}} (z) = \frac{1}{K} k = 1 \sum K [η_{k} f (η_{k}^{T} z) - η_{k} y_{k}] .

G_{ω^{K}} (z) = \frac{1}{K} k = 1 \sum K [η_{k} f (η_{k}^{T} z) - η_{k} y_{k}] .

H (z)^{T} (z - z_{*}) \geq 0 \forall z \in X .

H (z)^{T} (z - z_{*}) \geq 0 \forall z \in X .

H^{T} (\overset{z}{ˉ}) (z - \overset{z}{ˉ}) \geq 0 \forall z \in X .

H^{T} (\overset{z}{ˉ}) (z - \overset{z}{ˉ}) \geq 0 \forall z \in X .

\forall z, z^{'} \in X [H (z) - H (z^{'})]^{T} [z - z^{'}] \geq ϰ ∥ z - z^{'} ∥_{2}^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Signal recovery by Stochastic Optimization

Anatoli Juditsky LJK, Université Grenoble Alpes, 700 Avenue Centrale 38401 Domaine Universitaire de Saint-Martin-d’Hères, France, [email protected]

Arkadi Nemirovski ISyE, Georgia Institute of Technology, Atlanta, Georgia 30332, USA, [email protected]

The first author was supported by the PGMO grant 2016-2032H. Research of the second author was supported by NSF grant CCF-1523768.

Abstract

We discuss an approach to signal recovery in Generalized Linear Models (GLM) in which the signal estimation problem is reduced to the problem of solving a stochastic monotone variational inequality (VI). The solution to the stochastic VI can be found in a computationally efficient way, and in the case when the VI is strongly monotone we derive finite-time upper bounds on the expected $\|\cdot\|_{2}^{2}$ error converging to 0 at the rate $O(1/K)$ as the number $K$ of observations grows. Our structural assumptions are essentially weaker than those necessary to ensure convexity of the optimization problem resulting from Maximum Likelihood estimation. In hindsight, the approach we promote can be traced back directly to the ideas behind the Rosenblatt’s perceptron algorithm.

1 Introduction

Statistical estimation problems constitute one of principal application domains of Stochastic Optimization. A typical setting is as follows (cf., e.g., [7] and references therein): we are given i.i.d. observations $\omega^{K}=(\omega_{1},...,\omega_{K})$ , $\omega_{k}=(\eta_{k},y_{k})$ , where $\eta_{k}\in{\mathbf{R}}^{n\times m}$ , $y_{k}\in{\mathbf{R}}^{m}$ are, respectively, realizations of regressors (independent variables) and responses (labels). We assume that the observations can be described by a Generalized Linear Model (GLM) [13, 12], that is, the conditional, $\eta$ given, expectation of $y$ is $f(\eta^{T}x)$ , where $f(\cdot):{\mathbf{R}}^{m}\to{\mathbf{R}}^{m}$ is a known link function, and $x\in{\mathbf{R}}^{m}$ is the unknown “signal” — vector of model’s parameters. Our goal is to “fit the model,” that is, to recover $x$ from observations $\omega^{K}$ . The standard approach to fitting the model is to choose a loss function $\ell(y,\theta):\,{\mathbf{R}}^{m}\times{\mathbf{R}}^{m}\to{\mathbf{R}}$ and to recover $x$ as an optimal solution to the optimization problem

[TABLE]

where $P_{x}$ is the distribution of the observation $\omega=(\eta,y)$ associated with “true signal $x$ ”, and ${\cal X}$ is an a priori known signal set. In other words, in the just presented framework, the statistical estimation problem reduces to the stochastic optimization problem (1), which is to be solved approximately via available observation $\omega^{K}$ . This can be done either “in batch,” minimizing in $u\in{\cal X}$ the Sample Average Approximation (SAA)

[TABLE]

of the expectation in (1) (see, e.g. [19]), or applying iterative stochastic optimization algorithms of Stochastic Approximation (SA) type [16, 21].

Assuming that the conditional, given $\eta$ , distribution $P^{x}_{|\eta}$ of $y$ induced by $P_{x}$ belongs to a known parameteric family ${\cal P}=\{P^{\theta}:\,\theta\in\Theta\subset{\mathbf{R}}^{m}\}$ , specifically, $P^{x}_{|\eta}=P^{f(\eta^{T}x)}$ , the standard choice of the loss function is given by Maximum Likelihood: assuming that distributions $P^{\theta}$ have densities $p_{\theta}$ w.r.t. a reference measure $\Pi$ , one uses

[TABLE]

For example, in the classical logistic regression $m=1$ , $f(s)=(1+{\rm e}^{-s})^{-1}$ , $\Theta=(0,1)$ , and $P^{\theta}$ , $\theta\in\Theta$ , is Bernoulli distribution, that is, label $y$ takes value 1 with probability $(1+\exp\{-\eta^{T}x\})^{-1}$ and value 0 with the complementary probability, resulting in

[TABLE]

In this case, problem (1) becomes the optimization problem

[TABLE]

and its SAA becomes

[TABLE]

the optimal solution $\widehat{x}_{\textrm{\tiny ML}}(\omega^{K})$ to the latter problem is the Maximum Likelihood (ML) estimate of $x$ . Assuming the signal set ${\cal X}$ to be convex, both these problems turn out to be convex, implying the possibility to solve the SAA to global optimality in a computationally efficient fasion, same as utilizing nice convergence properties of SA.

More generally, when distributions of observations form a conditional exponential family [3, 9], negative log-likelihood has the form

[TABLE]

with convex cumulant function $F$ , and corresponding risk minimization problem (1) reads

[TABLE]

In this case, same as in the case of logistic regression, SAA or SA can be applied to compute Maximal Likelihood estimates of model parameter.

Note, however, that exponential family assumption is quite restrictive. On the other hand, beyond exponential families, the convexity of the optimization problem resulting from Maximum Likelihood selection of $\ell(\cdot)$ appears to be an exception rather than a rule. For example, consider the “nonlinear Least Squares” setting in which the label $y$ is obtained from $f(\eta^{T}x)$ by adding independent of the regressor zero mean Gaussian noise:

[TABLE]

In this case problem (1) and its SAA approximation for the ML selection of $\ell(\cdot)$ become

[TABLE]

where $Q$ is the distribution of regressors (which we assume to be independent of the signal). When $f$ is nonlinear, both these problems usually are nonconvex and could be difficult to process numerically. Similarly, in the “non-exponential logistic regression,” where the “exponential sigmoid” $f(s)=(1+\exp\{-s\})^{-1}$ is replaced with a general nondecreasing link function $f(s):{\mathbf{R}}\to(0,1)$ (e.g., probit or complementary $\log-\log$ link) the ML selection of the loss function typically makes (1) and its SAA approximation nonconvex.

The goal of what follows is to propose an alternative to model fitting via (1) with ML-based selection of the loss function approach to estimating the signal underlying observations in a GLM. In hindsight, the approach we put forward in this paper can be traced back to the ideas behind the Rosenblatt’s perceptron iterative algorithm [17, 4] and its batch version [10]. The structural assumptions to be imposed on the model are essentially weaker than those resulting in convex ML-based problems (1) and their SAA approximations.111For instance, in the “nonlinear least squares” with $m=1$ , same as in “non-exponential logistic regression,” all we need from $f$ to be continuously differentiable, with positive derivative, and from the signal set ${\cal X}$ to be convex. Under these assumptions, instead of using the classical loss function approach [1, 8, 2, 5, 20], we reduce the estimation problem to another problem with convex structure — a strongly monotone variational inequality (VI) represented by a stochastic oracle. This VI may or may not be equivalent to a convex minimization problem. The first option definitely takes place when $m=1$ , when the VI is equivalent to analogous to (6) convex optimization problem; but even in this case the resulting problem typically is different from the ML version of (1). The solution to the VI can be found in a computationally efficient way and turns out to be a “good” estimate of the signal underlying observations, for which we derive finite-time upper bounds on the expected $\|\cdot\|_{2}^{2}$ error, converging to 0 at the rate $O(1/K)$ as $K\to\infty$ .222We were unable to locate a reference to the proposed approach in the statistical literature, though it would be the most surprising if simple derivations which follow were not known.

2 Problem statement

Throughout the paper we consider the GLM model as posed in Introduction:

Our observation depends on unknown signal $x$ known to belong to a given convex compact set ${\cal X}\subset{\mathbf{R}}^{n}$ and is

[TABLE]

with $\omega_{k}$ , $1\leq k\leq K$ which are i.i.d. realizations of a random pair $(\eta,y)$ with the distribution $P_{x}$ such that

•

the regressor $\eta$ is a random $n\times m$ matrix with some independent of $x$ probability distribution $Q$ ;

•

the label $y$ is $m$ -dimensional random vector such that the conditional, given $\eta$ , distribution of $y$ induced by $P_{x}$ has the expectation $f(\eta^{T}x)$ :

[TABLE]

where ${\mathbf{E}}^{x}_{|\eta}$ is the conditional, $\eta$ given, distribution of $y$ stemming from the distribution $P_{x}$ of $\omega=(\eta,y)$ , and $f(\cdot):{\mathbf{R}}^{m}\to{\mathbf{R}}^{m}$ is a given mapping.

We are about to formulate assumptions on the parameters of a generalized linear model (namely, on $f(\cdot)$ , and the distributions $P_{x}$ , $x\in{\cal X}$ , of the pair $(\eta,y)$ ) required by the approach we are about to develop.

2.1 Preliminaries: monotone vector fields

A monotone vector field on ${\mathbf{R}}^{m}$ is a single-valued everywhere defined mapping $g(\cdot):{\mathbf{R}}^{m}\to{\mathbf{R}}^{m}$ which possesses the monotonicity property

[TABLE]

We say that such a field is monotone with modulus $\varkappa\geq 0$ on a closed convex set $Z\subset{\mathbf{R}}^{m}$ , if

[TABLE]

and say that $g$ is strongly monotone on $Z$ if the modulus of monotonicity of $g$ on $Z$ is positive. It is immediately seen that for a monotone vector field which is continuously differentiable on a closed convex set $Z$ with a nonempty interior, the necessary and sufficient condition for being monotone with modulus $\varkappa$ on the set is

[TABLE]

Basic examples of monotone vector fields are:

•

gradient fields $\nabla\phi(x)$ of continuously differentiable convex functions of $m$ variables or, more generally, the vector fields $[\nabla_{x}\phi(x,y);-\nabla_{y}\phi(x,y)]$ stemming from continuously differentiable functions $\phi(x,y)$ which are convex in $x$ and concave in $y$ ;

•

“diagonal” vector fields $f(x)=[f_{i}(x_{1});f_{2}(x_{2});...;f_{m}(x_{m})]$ with monotonically nondecreasing univariate components $f_{i}(\cdot)$ . If, in addition, $f_{i}(\cdot)$ are continuously differentiable with positive derivatives, then the associated field $f$ is strongly monotone on every compact convex subset of ${\mathbf{R}}^{m}$ , the monotonicity modulus depending on the subset.

Monotone vector fields on ${\mathbf{R}}^{n}$ admit simple calculus which includes, in particular, the following two rules:

I.

[affine substitution of argument]: If $f(\cdot)$ is monotone vector field on ${\mathbf{R}}^{m}$ and $A$ is an $n\times m$ matrix, the vector field

[TABLE]

is monotone on ${\mathbf{R}}^{n}$ ; if, in addition, $f$ is monotone with modulus $\varkappa\geq 0$ on a closed convex set $Z\subset{\mathbf{R}}^{m}$ and $X\subset{\mathbf{R}}^{n}$ is closed, convex, and such that $A^{T}x+a\in Z$ whenever $x\in X$ , $g$ is monotone with modulus $\sigma^{2}\varkappa$ on $X$ , where $\sigma$ is the minimal singular value of $A$ .

II.

[summation]: If $S$ is a Polish space, $f(x,s):{\mathbf{R}}^{m}\times S\to{\mathbf{R}}^{m}$ is a Borel vector-valued function which is monotone in $x$ for every $s\in S$ and $\mu(ds)$ is a Borel probability measure on $S$ such that the vector field

[TABLE]

is well defined for all $x$ , then $F(\cdot)$ is monotone. If, in addition, $X$ is a closed convex set in ${\mathbf{R}}^{m}$ and $f(\cdot,s)$ is monotone on $X$ with Borel in $s$ modulus $\varkappa(s)$ for every $s\in S$ , then $F$ is monotone on $X$ with modulus $\int_{S}\varkappa(s)\mu(ds)$ .

2.2 Assumptions

In what follows, we make the following assumptions on the ingredients of the estimation problem set in Introduction:

•

A.1. The vector field $f(\cdot)$ is continuous and monotone, and the vector field

[TABLE]

is well defined (and therefore is monotone along with $f$ by I, II);

•

A.2. The signal set ${\cal X}$ is a nonempty convex compact set, and the vector field $F$ is monotone with positive modulus $\varkappa$ on ${\cal X}$ ;

•

A.3. For properly selected $M<\infty$ and every $x\in{\cal X}$ it holds

[TABLE]

A simple sufficient condition for the validity of Assumptions A.1-3 with properly selected $M<\infty$ and $\varkappa>0$ is as follows:

•

The distribution $Q$ of $\eta$ has finite moments of all orders, and ${\mathbf{E}}_{\eta\sim Q}\{\eta\eta^{T}\}\succ 0$ ;

•

$f$ is continuously differentiable, and $d^{T}f^{\prime}(z)d>0$ for all $d\neq 0$ and all $z$ . Besides this, $f$ is of polynomial growth: for some constants $C\geq 0$ and $p\geq 0$ and all $z$ one has $\|f(z)\|_{2}\leq C(1+\|z\|_{2}^{p})$ .

Verification of sufficiency is straightforward.

3 Construction and Main result

The principal observation underlying the construction we are about to present is as follows:

Proposition 3.1

Assuming that Assumptions A.1-3 hold, let us associate with the pair $(\eta,y)\in{\mathbf{R}}^{n\times m}\times{\mathbf{R}}^{m}$ the vector field

[TABLE]

Then for every $x\in{\cal X}$ we have

[TABLE]

Proof is immediate. Indeed, let $x\in{\cal X}$ . Then

[TABLE]

(we have used (9) and the definition of $F$ ), whence,

[TABLE]

as stated in (13. $a$ ). Besides this, for $x,z\in{\cal X}$ , denoting by $P^{z}_{|\eta}$ the conditional, $\eta$ given, distribution of $y$ induced by the distribution $P_{z}$ of $(\eta,y)$ , and taking into account that the marginal distribution of $\eta$ induced by $P_{z}$ is $Q$ , we have

[TABLE]

This combines with the relation ${\mathbf{E}}_{(\eta,y)\sim P_{x}}\{\|\eta y\|_{2}^{2}\}\leq M^{2}$ given by A.3 due to $x\in{\cal X}$ to imply (13. $b$ ) and (13. $c$ ). $\square$

3.1 Main result

Recall that our goal is to recover the signal $x\in{\cal X}$ underlying observations (8). Under assumptions A.1-3, $x$ is a root of the monotone vector field

[TABLE]

we know that this root belongs to ${\cal X}$ , and is unique because $G(\cdot)$ is strongly monotone on ${\cal X}$ along with $F(\cdot)$ . Now, finding a root, known to belong to a given convex compact set ${\cal X}$ , of a strongly monotone on this set vector field $G$ is known to be a computationally tractable problem, provided we have access to an “oracle” which, given on input a point $z\in{\cal X}$ , returns the value $G(z)$ of the field at the point. The latter is not exactly the case in the situation we are interested in: the field $G$ is the expectation of a random field:

[TABLE]

and we do not know a priori what is the distribution over which the expectation is taken. However, we can sample from this distribution – the samples are exactly the observations (8), and we can use these samples to approximate somehow $G$ and use this approximation to approximate the signal $x$ . Two standard implementations of this idea are Sample Average Approximation (SAA) and Stochastic Approximation (SA). We are about to consider these two techniques as applied to the situation we are in.

3.1.1 Estimation by Sample Average Approximation

The idea underlying SAA is quite transparent: given observations (8), let us approximate the field of interest $G$ with its empirical counterpart

[TABLE]

By the Law of Large Numbers, as $K\to\infty$ , the empirical field $G_{\omega^{K}}$ converges to the field of interest $G$ , so that under mild regularity assumptions, when $K$ is large, $G_{\omega^{K}}$ , with overwhelming probability, will be uniformly on ${\cal X}$ close to $G$ . Due to strong monotonicity of $G$ , this would imply that a set of “near-zeros” of $G_{\omega^{K}}$ on ${\cal X}$ will be close to the zero $x$ of $G$ , which is nothing but the signal we want to recover. The only question is how we can consistently define a “near-zero” of $G_{\omega^{K}}$ on ${\cal X}$ .333Note that we in general cannot define a “near-zero” of $G_{\omega^{K}}$ on ${\cal X}$ as a root of $G_{\omega^{K}}$ on this set – while $G$ does have a root belonging to ${\cal X}$ , nobody told us that the same holds true for $G_{\omega^{K}}$ . A convenient in our context notion of a “near-zero” is provided by the concept of a weak solution to a variational inequality (VI) with monotone operator, defined as follows (we restrict the general definition to the situation of interest):

Let ${\cal X}\subset{\mathbf{R}}^{n}$ be a nonempty convex compact set, and $H(z):{\cal X}\to{\mathbf{R}}^{n}$ be a monotone (i.e., $[H(z)-H(z^{\prime})]^{T}[z-z^{\prime}]\geq 0$ for all $z,z^{\prime}\in{\cal X}$ ) vector field. A vector $z_{*}\in{\cal X}$ is called a weak solution to the variational inequality (VI) associated with $H,{\cal X}$ when

[TABLE]

Let ${\cal X}\subset{\mathbf{R}}^{n}$ be a nonempty convex compact set and $H$ be monotone on ${\cal X}$ . It is well known that

•

The VI associated with $H,{\cal X}$ (let us denote it ${\textrm{VI}}(H,{\cal X})$ ) always has a weak solution. It is clear that if $\bar{z}\in{\cal X}$ is a root of $H$ , then $\bar{z}$ is a weak solution to ${\textrm{VI}}(H,{\cal X})$ .444Indeed, when $\bar{z}\in{\cal X}$ and $H(\bar{z})=0$ , monotonicity of $H$ implies that $H(z)^{T}[z-\bar{z}]=[H(z)-H(\bar{z})]^{T}[z-\bar{z}]\geq 0$ for all $z\in{\cal X}$ , that is, $\bar{z}$ is a weak solution to the VI.

•

When $H$ is continuous on ${\cal X}$ , every weak solution $\bar{z}$ to ${\textrm{VI}}(H,{\cal X})$ is also a strong solution, meaning that

[TABLE]

Indeed, (15) clearly holds true when $z=\bar{z}$ . Assuming $z\neq\bar{z}$ and setting $z_{t}=\bar{z}+t(z-\bar{z})$ , $0<t\leq 1$ , we have $H^{T}(z_{t})(z_{t}-\bar{z})\geq 0$ (since $\bar{z}$ is a weak solution), whence $H^{T}(z_{t})(z-\bar{z})\geq 0$ (since $z-\bar{z}$ is a positive multiple of $z_{t}-\bar{z}$ ). Passing to limit as $t\to+0$ and invoking the continuity of $H$ , we get $H^{T}(\bar{z})(z-\bar{z})\geq 0$ , as claimed.

•

When $H$ is the gradient field of a continuously differentiable convex function on ${\cal X}$ (such a field indeed is monotone), weak (or, which in the case of continuous $H$ is the same, strong) solutions to ${\textrm{VI}}(H,{\cal X})$ are exactly the minimizers of the function on ${\cal X}$ .

Note also that a strong solution to ${\textrm{VI}}(H,{\cal X})$ with monotone $H$ always is a weak one: if $\bar{z}\in{\cal X}$ satisfies $H^{T}(\bar{z})(z-\bar{z})\geq 0$ for all $z\in{\cal X}$ , then $H(z)^{T}(z-\bar{z})\geq 0$ for all $z\in{\cal X}$ , since by monotonicity $H(z)^{T}(z-\bar{z})\geq H^{T}(\bar{z})(z-\bar{z})$ .

In the sequel, we heavily exploit the following simple and well known fact:

Lemma 3.1

Let ${\cal X}$ be a convex compact set, and $H$ be a monotone vector field on ${\cal X}$ with monotonicity modulus $\varkappa>0$ , i.e.

[TABLE]

Further, let $\bar{z}$ be a weak solution to ${\textrm{VI}}(H,{\cal X})$ . Then the weak solution to ${\textrm{VI}}(H,{\cal X})$ is unique. Besides this,

[TABLE]

Proof: Under the premise of the lemma, let $z\in{\cal X}$ and let $\bar{z}$ be a weak solution to ${\textrm{VI}}(H,{\cal X})$ (recall that it does exist). Setting $z_{t}=\bar{z}+t(z-\bar{z})$ , for $t\in(0,1)$ we have

[TABLE]

where the first $\geq$ is due to strong monotonicity of $H$ , and the second $\geq$ is due to the fact that $H^{T}(z_{t})[z-z_{t}]$ is proportional, with positive coefficient, to $H^{T}(z_{t})[z_{t}-\bar{z}]$ , and the latter quantity is nonnegative since $\bar{z}$ is a weak solution to the VI in question. We end up with $H^{T}(z)(z-z_{t})\geq\varkappa\|z-z_{t}\|_{2}^{2}$ ; passing to limit as $t\to+0$ , we arrive at (16). To prove uniqueness of a weak solution, assume that aside of the weak solution $\bar{z}$ there exists a weak solution $\widetilde{z}$ distinct form $\bar{z}$ , and let us set $z^{\prime}={1\over 2}[\bar{z}+\widetilde{z}]$ . Since both $\bar{z}$ and $\widetilde{z}$ are weak solutions, both the quantities $H^{T}(z^{\prime})[z^{\prime}-\bar{z}]$ and $H^{T}(z^{\prime})[z^{\prime}-\widetilde{z}]$ should be nonnegative, and because the sum of these quantities is 0, both of them are zero. Thus, when applying (16) to $z=z^{\prime}$ , we get $z^{\prime}=\bar{z}$ , whence $\widetilde{z}=\bar{z}$ as well. $\square$

Now, let us return to the estimation problem under consideration. Assume that Assumptions A.1-3 hold, so vector fields $G_{(\eta_{k},y_{k})}(z)$ defined in (12), and therefore vector field $G_{\omega^{K}}(z)$ are continuous and monotone. When using the SAA, we compute a weak solution $\widehat{x}(\omega^{K})$ to ${\textrm{VI}}(G_{\omega^{K}},{\cal X})$ and treat it as the SAA estimate of signal $x$ underlying observations (8). Since the vector field $G_{\omega^{K}}(\cdot)$ is monotone with efficiently computable values, provided that so is $f$ , computing (a high accuracy approximation to) a weak solution to ${\textrm{VI}}(G_{\omega^{K}},{\cal X})$ is a computationally tractable problem (see, e.g., [14]). Moreover, utilizing the techniques from [6, 15, 20, 18], under mild additional to A.1-3 regularity assumptions one can get non-asymptotical upper bound on, say, the expected $\|\cdot\|_{2}^{2}$ -error of the SAA estimate as a function of the sample size $K$ and find out the rate at which this bound converges to 0 as $K\to\infty$ ; this analysis, however, goes beyond our scope.

Let us look at the SAA estimate in the logistic regression model. In this case we have $f(u)=(1+{\rm e}^{-u})^{-1}$ , and

[TABLE]

In other words, $G_{\omega^{K}}(z)$ is the gradient field of the minus empirical log-likelihood $\ell(z,\omega^{K})$ , see (5). As a result, in the case in question weak solutions to ${\textrm{VI}}(G_{\omega^{K}},{\cal X})$ are exactly the optimal solutions to (5), that is, for the logistic regression the SAA estimate is nothing but the Maximum Likelihood estimate $\widehat{x}_{{\textrm{\tiny ML}}}(\omega^{K})$ .555This phenomenon is specific for the logistic regression model. The fact that the SAA and the ML estimates in this case are the same is due to the fact that the logistic sigmoid $f(s)=\exp\{s\}/(1+\exp\{s\})$ “happens” to satisfy the identity $f^{\prime}(s)=f(s)(1-f(s))$ . When replacing the exponential sigmoid with $f(s)=\phi(s)/(1+\phi(s))$ with differentiable monotonically nondecreasing positive $\phi(\cdot)$ , the SAA estimate becomes the weak solution to ${\textrm{VI}}(\Phi,{\cal X})$ with

$\Phi(z)=\sum_{k}\left[{\phi(\eta_{k}^{T}z)\over 1+\phi(\eta_{k}^{T}z)}-y_{k}\right]\eta_{k}.$

On the other hand, the gradient field of the minus log-likelihood $-{1\over K}\sum_{k}\left[y_{k}\ln(f(\eta_{k}^{T}z))+(1-y_{k})\ln(1-f(\eta_{k}^{T}z))\right]$ ) which we should minimize when computing the ML estimate is

$\Psi(z)=\sum_{k}{\phi^{\prime}(\eta_{k}^{T}z)\over\phi(\eta_{k}^{T}z)}\left[{\phi(\eta_{k}^{T}z)\over 1+\phi(\eta_{k}^{T}z)}-y_{k}\right]\eta_{k}.$

When $k>1$ and $\phi$ is not an exponent, $\Phi$ and $\Psi$ are “essentially different,” so that the SAA estimate typically will differ from the ML one. On the other hand, in the “nonlinear least squares” example described in the introduction with (for the sake of simplicity, scalar) monotone $f(\cdot)$ the vector field $G_{\omega^{K}}(\cdot)$ is given by

[TABLE]

which is “essentially different” (provided that $f$ is nonlinear) from the gradient field

[TABLE]

of the negative log-likelihood appearing in (7). As a result, in this case the ML estimate (7) is, in general, different from the SAA estimate (and, in contrast to the ML, the SAA estimate is easy to compute).

3.1.2 Stochastic Approximation estimate

The Stochastic Approximation (SA) estimate stems from a simple algorithm – Subgradient Descent – for solving variational inequality ${\textrm{VI}}(G,{\cal X})$ . Were the values of the vector field $G(\cdot)$ available, one could approximate a root $x\in{\cal X}$ of this VI using the recurrence

[TABLE]

where

•

${\mathrm{Proj}}_{{\cal X}}[z]$ is the metric projection of ${\mathbf{R}}^{n}$ onto ${\cal X}$ :

[TABLE]

•

$\gamma_{k}>0$ are given stepsizes;

•

the initial point $z_{0}$ is an arbitrary point of ${\cal X}$ .

It is well known that under Assumptions A.1-3 this recurrence with properly selected stepsizes and started at a point from ${\cal X}$ allows to approximate the root of $G$ (in fact, the unique weak solution to ${\textrm{VI}}(G,{\cal X})$ ) to a whatever high accuracy, provided $K$ is large enough. However, we are in the situation when the actual values of $G$ are not available; the standard way to cope with this difficulty is to replace in the above recurrence the “unobservable” values $G(z_{k-1})$ of $G$ with their unbiased random estimates $G_{(\eta_{k},y_{k})}(z_{k-1})$ . This modification gives rise to Stochastic Approximation (coming back to [11]) – the recurrence

[TABLE]

where $z_{0}$ is a once for ever chosen point from ${\cal X}$ , and $\gamma_{k}>0$ are deterministic.

Convergence analysis.

The following result is perfectly well known; to make the paper self-contained, we present its (completely standard) proof in Appendix.

Proposition 3.2

Under Assumptions A.1-3 and with the stepsizes

[TABLE]

for every signal $x\in{\cal X}$ the sequence of estimates $\widehat{x}_{k}(\omega^{k})=z_{k}$ given by the SA recurrence (17) and $\omega_{k}=(\eta_{k},y_{k})$ defined in (8) for every $k$ obeys the error bound

[TABLE]

$P_{x}$ * being the distribution of $(\eta,y)$ stemming from signal $x$ .*

3.2 Numerical illustration

To illustrate the above developments, we present here results of some numerical experiments. Our deliberately simplistic setup is as follows:

•

${\cal X}=\{x\in{\mathbf{R}}^{n}:\|x\|_{2}\leq 1\}$ ;

•

the distribution $Q$ of $\eta$ is ${\cal N}(0,I_{n})$ ;

•

$f$ is the monotone vector field on ${\mathbf{R}}$ given by one of the following four options:

A.

$f(s)=\exp\{s\}/(1+\exp\{s\})$ ;

B.

$f(s)=s$ ;

C.

$f(s)=\max[s,0]$ ;

D.

$f(s)=\min[1,\max[s,0]]$ .

•

conditional, given $\eta$ , distribution of $y$ induced by $P_{x}$ is

–

Bernoulli distribution with probability $f(\eta^{T}x)$ of outcome 1 in the case of A (i.e., A corresponds to the logistic model),

–

Gaussian distribution ${\cal N}(f(\eta^{T}x),I_{n})$ in cases B – D.

Note that in the considered example one can easily compute the field $F(z)$ . Indeed, we have $\forall z\in{\mathbf{R}}^{n}$ :

[TABLE]

and due to the independence of $\eta^{T}z$ and $\eta_{\perp}$ ,

[TABLE]

and $F(z)$ is proportional to $z/\|z\|_{2}$ with proportionality coefficient

[TABLE]

In Figure 1 we present the plots of the function $h(t)$ for the situations A – D, same as the dependencies of the moduli of strong convexity of the corresponding mappings $F$ in a centered at the origin $\|\cdot\|_{2}$ -ball of radius $R$ on $R$ .

The dimension $n$ in all experiments was set to 100, and the number of observations $K$ was $400$ , 1e3, 4e3, 1e4, and 4e4. For each combination of parameters we ran 10 simulations for signals $x$ underlying observations (8) drawn randomly from the uniform distribution on the unit sphere (the boundary of ${\cal X}$ ).

In each experiment, we computed the SAA and the SA estimates (note that in the cases A and B the SAA estimate is the Maximum Likelihood estimate as well). The SA stepsizes $\gamma_{k}$ were selected according to (18) with “empirically selected” $\varkappa$ . 666We could get (lower bounds on) the modules of strong monotonicity of the vectors fields $F(\cdot)$ we are interested in analytically, but this would be boring and conservative. Namely, given observations $\omega_{k}=(\eta_{k},y_{k})$ , $k\leq K$ , see (8), we used them to build the SA estimate in two stages:

— at tuning stage, we generate a random “training signal” $x^{\prime}\in{\cal X}$ and then generate labels $y_{k}^{\prime}$ as if $x^{\prime}$ were the actual signal. For instance, in the case of A, $y_{k}^{\prime}$ is assigned value 1 with probability $f(\eta_{k}^{T}x^{\prime})$ and value 0 with complementary probability. After “training signal” and associated labels are generated, we run on the resulting artificial observations SA with different values of $\varkappa$ , compute the accuracy of the resulting estimates, and select the value of $\varkappa$ resulting in the best recovery;

— at execution stage, we run SA on the actual data with stepsizes (18) specified by $\varkappa$ found at the tuning stage.

The results of some numerical experiments are presented in Figure 2.

Note that the cpu time for SA includes both tuning and execution stages. The conclusion from these experiments is that as far as estimation quality is concerned, the SAA estimate marginally outperforms the SA, while being significantly more time consuming. Note also that the observed in our experiments dependence of recovery errors on $K$ is consistent with the convergence rate $O(1/\sqrt{K})$ established by Proposition 3.2.

4 “Single-observation” case

Let us look at the special case of the estimation problem where the sequence $\eta_{1},...,\eta_{K}$ of regressors in (8) is deterministic. At the first glance, this situation goes beyond our setup, where the regressors should be i.i.d. drawn from some distribution $Q$ . We can, however, circumvent this “contradiction” by saying that we are now in the single-observation case with the regressor being the matrix $[\eta_{1},...,\eta_{K}]$ and $Q$ being a degenerate distribution supported at a singleton. Specifically, consider the case where our observation is

[TABLE]

( $m,n,K$ are given positive integers), and the distribution $P_{x}$ of observation stemming from a signal $x\in{\mathbf{R}}^{n}$ is as follows:

•

$\eta$ is a given independent of $x$ deterministic matrix;

•

$y$ is random, and the distribution of $y$ induced by $P_{x}$ is with mean $\phi(\eta^{T}x)$ , where $\phi:{\mathbf{R}}^{mK}\to{\mathbf{R}}^{mK}$ is a given mapping.

As an instructive example connecting our current setup with the previous one, consider the case where $\eta=[\eta_{1},...,\eta_{K}]$ with $n\times m$ deterministic “individual regressors” $\eta_{k}$ , $y=[y_{1};...;y_{K}]$ with random “individual labels” $y_{k}\in{\mathbf{R}}^{m}$ conditionally independent, given $x$ , across $k$ , and such that the induced by $x$ expectations of $y_{k}$ are $f(\eta_{k}^{T}x)$ for some $f:{\mathbf{R}}^{m}\to{\mathbf{R}}^{m}$ . We set $\phi([u_{1};...;u_{K}])=[f(u_{1});...;f(u_{K})]$ . The resulting “single observation” model is a natural analogy of the $K$ -observation model considered so far, the only difference being that the individual regressors now form a fixed deterministic sequence rather than being a sample of some random matrix.

Same as everywhere in this paper, our goal is to use observation (20) to recover the (unknown) signal $x$ underlying, as explained above, the distribution of the observation. Formally, we are now in the case $K=1$ of our previous recovery problem where $Q$ is supported on a singleton $\{\eta\}$ and can use the constructions developed so far. Specifically,

•

The vector field $F(z)$ associated with our problem (it used to be ${\mathbf{E}}_{\eta\sim Q}\{\eta f(\eta^{T}z)\}$ ) is

[TABLE]

and the vector field $G(z)=F(z)-F(x),$ $x$ being the signal underlying observation (20), is

[TABLE]

(cf. (14)). Same as before, the signal to recover is a zero of the latter field. Note that now the vector field $F(z)$ is observable, and the vector field $G$ still is the expectation, over $P_{x}$ , of an observable vector field:

[TABLE]

cf. Lemma 3.1.

•

Assumptions A.1-2 now read

A.1′ The vector field $\phi(\cdot):{\mathbf{R}}^{mK}\to{\mathbf{R}}^{mK}$ is continuous and monotone, so that $F(\cdot)$ is continuous and monotone as well,

A.2′ ${\cal X}$ is a nonempty compact convex set, and $F$ is strongly monotone, with modulus $\varkappa>0$ , on ${\cal X}$ .

A simple sufficient condition for the validity of the above monotonicity assumptions is positive definiteness of the matrix $\eta\eta^{T}$ plus strong monotonicity of $\phi$ on every bounded set.

•

For our present purposes, it is convenient to reformulate assumption A.3 in the following equivalent form:

A.3′ For properly selected $\sigma\geq 0$ and every $x\in{\cal X}$ it holds

[TABLE]

In the present setting, the SAA estimate $\widehat{x}(y)$ is the unique weak solution to ${\textrm{VI}}(G_{y},{\cal X})$ , and we can easily quantify the quality of this estimate:

Proposition 4.1

In the situation in question, let Assumptions A.1′-3′ hold. Then for every $x\in{\cal X}$ and every realization $(\eta,y)$ of induced by $x$ observation (20) one has

[TABLE]

whence also

[TABLE]

Proof. Let $x\in{\cal X}$ be the signal underlying observation (20), and $G(z)=F(z)-F(x)$ be the associated vector field $G$ . We have

[TABLE]

For $y$ fixed, $\bar{z}=\widehat{x}(y)$ is the weak, and therefore the strong (since $G_{y}(\cdot)$ is continuous) solution to ${\textrm{VI}}(G_{y},{\cal X})$ , implying, due to $x\in{\cal X}$ , that

[TABLE]

whence

[TABLE]

Besides this, $G(x)=0$ , whence $G^{T}(x)[x-\bar{z}]=0$ , and we arrive at

[TABLE]

whence also

[TABLE]

(recall that $G$ , along with $F$ , is strongly monotone with modulus $\varkappa$ on ${\cal X}$ and $x,\bar{z}\in{\cal X}$ ). Applying the Cauchy inequality, we arrive at (21). $\square$

Example. Consider the case where $m=1$ , $\phi$ is strongly monotone, with modulus $\varkappa_{\phi}>0$ , on the entire ${\mathbf{R}}^{K}$ , and $\eta$ in (20) is drawn from a “Gaussian ensemble” – the columns $\eta_{k}$ of the $n\times K$ matrix $\eta$ are independent ${\cal N}(0,I_{n})$ -random vectors. Assume also that the observation noise is Gaussian:

[TABLE]

It is well known that as $K/n\to\infty$ , the minimal singular value of the $n\times n$ matrix $\eta\eta^{T}$ is at least $O(1)K$ with overwhelming probability, implying that when $K/n\gg 1$ , the typical modulus of strong monotonicity of $F(\cdot)$ is $\varkappa\geq O(1)K\varkappa_{\phi}$ . Furthermore, in our situation, as $K/n\to\infty$ , the Frobenius norm of $\eta$ with overwhelming probability is at most $O(1)\sqrt{nK}$ . In other words, when $K/n$ is large, a “typical” recovery problem from the just described ensemble satisfies the premise of Proposition 4.1 with $\varkappa=O(1)K\varkappa_{\phi}$ and $\sigma^{2}=O(\lambda^{2}nK)$ . As a result, (22) reads

[TABLE]

It is well known that in the standard case of linear regression, where $\phi(x)=\varkappa_{\phi}x$ , the resulting bound is near-optimal, provided ${\cal X}$ is large enough.

Numerical illustration: in the situation described in the example above, we set $m=1$ , $n=100$ and use

[TABLE]

The set ${\cal X}$ is the unit ball $\{x\in{\mathbf{R}}^{n}:\|x\|_{2}\leq 1\}$ . In a particular experiment, $\eta$ is chosen at random from the Gaussian ensemble as described above, and signal $x\in{\cal X}$ underlying observation (20) is drawn at random; the observation noise $y-\phi(\eta^{T}x)$ is ${\cal N}(0,\lambda^{2}I_{K})$ . We ran 10 simulations for each combination of the samples size and noise variance $\lambda^{2}$ ; the results are presented in Figure 3.

Appendix A Proof of Proposition 3.2

We start by observing that $z_{k}$ are deterministic functions of the initial fragments $\omega^{k}=\{\omega_{t},1\leq t\leq k\}\sim\underbrace{P_{x}\times...\times P_{x}}_{P^{k}_{x}}$ of our sequence of observations $\omega^{K}=\{\omega_{k}=(\eta_{k},y_{k}),1\leq k\leq K\}$ : $z_{k}=Z_{k}(\omega^{k})$ . Let us set

[TABLE]

where $x\in{\cal X}$ is the signal underlying observations (8). Note that, as it is well known, the metric projection onto a closed convex set ${\cal X}$ is contracting:

[TABLE]

Consequently, for $1\leq k\leq K$ it holds

[TABLE]

Taking expectations w.r.t. $\omega^{k}\sim{P_{x}^{k}}$ of both sides of the resulting inequality and keeping in mind relations (13) along with the fact that $z_{k-1}\in{\cal X}$ , we get

[TABLE]

Recalling that we are in the case where $G$ is strongly monotone on ${\cal X}$ with modulus $\varkappa>0$ , $x$ is the weak solution ${\textrm{VI}}(G,{\cal X})$ , and $z_{k-1}$ takes values in ${\cal X}$ , invoking (16), the expectation in (23) is at least $2\varkappa d_{k}$ , and we arrive at the relation

[TABLE]

We put

[TABLE]

note that $\gamma_{k}$ are exactly the stepsizes (18). Let us verify by induction in $k$ that for $k=0,1,...,K$ it holds

[TABLE]

Base $k=0$ . Let $D$ stand for the $\|\cdot\|_{2}$ -diameter of ${\cal X}$ , and $z_{\pm}\in{\cal Z}$ be such that $\|z_{+}-z_{-}\|_{2}=D$ . By (13) we have $\|F(z)\|_{2}\leq M$ for all $z\in{\cal X}$ , and by strong monotonicity of $G(\cdot)$ on ${\cal X}$ we have

[TABLE]

By Cauchy inequality, the left hand side in the concluding $\geq$ is at most $2MD$ , and we get

[TABLE]

whence $S\geq D^{2}/2$ . On the other hand, due to the origin of $d_{0}$ we have $d_{0}\leq D^{2}/2$ . Thus, $(*_{0})$ holds true.

Inductive step $(*_{k-1})\Rightarrow(*_{k})$ . Now assume that $(*_{k-1})$ holds true for some $k$ , $1\leq k\leq K$ , and let us prove that $(*_{k})$ holds true as well. Observe that $\varkappa\gamma_{k}=(k+1)^{-1}\leq 1/2$ , so that

[TABLE]

so that $(*_{k})$ hods true. Induction is complete. It remains to note that by definition of $d_{k}$ we have $d_{k}={1\over 2}{\mathbf{E}}\{\|\widehat{x}_{k}-x\|_{2}^{2}\}$ . $\square$

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Aiserman, E. M. Braverman, and L. Rozonoer. Theoretical foundations of the potential function method in pattern recognition. Avtomat. i Telemeh , 25:917–936, 1964.
2[2] M. Aizerman, E. Braverman, and L. Rozonoer. Method of potential functions in the theory of learning machines . Nauka, Moscow, 1970.
3[3] O. Barndorff-Nielsen. Information and exponential families in statistical theory. 1978.
4[4] H.-D. Block. The perceptron: A model for brain functioning. i. Reviews of Modern Physics , 34(1):123, 1962.
5[5] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In Advanced lectures on machine learning , pages 169–207. Springer, 2004.
6[6] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of machine learning research , 2(Mar):499–526, 2002.
7[7] L. Devroye, L. Györfi, and G. Lugosi. A probabilistic theory of pattern recognition , volume 31. Springer Science & Business Media, 2013.
8[8] I. Devyaterikov, A. Propoi, and Y. Z. Tsypkin. Iterative learning algorithms for pattern recognition. Automation and Remote Control , 28:122–132, 1967.