A New Look at an Old Problem: A Universal Learning Approach to Linear   Regression

Koby Bibas; Yaniv Fogel; Meir Feder

arXiv:1905.04708·cs.LG·November 11, 2019

A New Look at an Old Problem: A Universal Learning Approach to Linear Regression

Koby Bibas, Yaniv Fogel, Meir Feder

PDF

3 Repos

TL;DR

This paper applies universal learning principles to linear regression, providing analytical solutions and insights into generalization capabilities, especially in over-parameterized settings where the number of features exceeds the training samples.

Contribution

It introduces an analytical pNML solution for universal learning in linear regression, including cases with more parameters than data points, and explores conditions for successful generalization.

Findings

01

Learnability depends on the test vector’s subspace alignment.

02

Linear regression can generalize in over-parameterized models under certain conditions.

03

Simulation demonstrates polynomial fitting with high-degree models.

Abstract

Linear regression is a classical paradigm in statistics. A new look at it is provided via the lens of universal learning. In applying universal learning to linear regression the hypotheses class represents the label $y \in R$ as a linear combination of the feature vector $x^{T} θ$ where $x \in R^{M}$ , within a Gaussian error. The Predictive Normalized Maximum Likelihood (pNML) solution for universal learning of individual data can be expressed analytically in this case, as well as its associated learnability measure. Interestingly, the situation where the number of parameters $M$ may even be larger than the number of training samples $N$ can be examined. As expected, in this case learnability cannot be attained in every situation; nevertheless, if the test vector resides mostly in a subspace spanned by the eigenvectors associated with the large eigenvalues of the empirical…

Equations78

L (q; x, y) = - lo g q (y ∣ x) .

L (q; x, y) = - lo g q (y ∣ x) .

P_{Θ} = {p_{θ} (y ∣ x), θ \in Θ}

P_{Θ} = {p_{θ} (y ∣ x), θ \in Θ}

\hat{θ} (z^{N}, x, y) = ar g θ max [p_{θ} (y ∣ x) \cdot Π_{i = 1}^{N} p_{θ} (y_{i} ∣ x_{i})]

\hat{θ} (z^{N}, x, y) = ar g θ max [p_{θ} (y ∣ x) \cdot Π_{i = 1}^{N} p_{θ} (y_{i} ∣ x_{i})]

R (z^{N}, x, y, q) = lo g \frac{p _{\hat{θ} (z^{N}, x, y)} ( y ∣ x )}{q _{(} y ∣ x ; z ^{N} )} .

R (z^{N}, x, y, q) = lo g \frac{p _{\hat{θ} (z^{N}, x, y)} ( y ∣ x )}{q _{(} y ∣ x ; z ^{N} )} .

q min y max R (z^{N}, x, y, q) = R^{*} (z^{N}, x)

q min y max R (z^{N}, x, y, q) = R^{*} (z^{N}, x)

q_{\mbox pN M L} (y ∣ x; z^{N}) = \frac{p _{\hat{θ} (z^{N}, x, y)} ( y ∣ x )}{\sum _{y \in Y} p _{\hat{θ} (z^{N}, x, y)} ( y ∣ x )} .

q_{\mbox pN M L} (y ∣ x; z^{N}) = \frac{p _{\hat{θ} (z^{N}, x, y)} ( y ∣ x )}{\sum _{y \in Y} p _{\hat{θ} (z^{N}, x, y)} ( y ∣ x )} .

R^{*} (z^{N}, x) = lo g ⎩ ⎨ ⎧ y \in Y \sum p_{\hat{θ} (z^{N}, x, y)} (y ∣ x) ⎭ ⎬ ⎫ .

R^{*} (z^{N}, x) = lo g ⎩ ⎨ ⎧ y \in Y \sum p_{\hat{θ} (z^{N}, x, y)} (y ∣ x) ⎭ ⎬ ⎫ .

\left\{p_{\theta}(y|x)=\frac{1}{\sqrt[]{2\pi\sigma^{2}}}\exp\left\{-\frac{1}{2\sigma^{2}}\big{(}y-x^{T}\theta\big{)}^{2}\right\},\;\;\theta\in{\cal R}^{M}\right\}

\left\{p_{\theta}(y|x)=\frac{1}{\sqrt[]{2\pi\sigma^{2}}}\exp\left\{-\frac{1}{2\sigma^{2}}\big{(}y-x^{T}\theta\big{)}^{2}\right\},\;\;\theta\in{\cal R}^{M}\right\}

q_{\mbox E R M} (y ∣ x) = p_{θ} argmin \frac{1}{N} i = 1 \sum N L (p_{θ}; x_{i}, y_{i}) .

q_{\mbox E R M} (y ∣ x) = p_{θ} argmin \frac{1}{N} i = 1 \sum N L (p_{θ}; x_{i}, y_{i}) .

h_{ii} = x_{i}^{T} (X X^{T})^{- 1} x_{i}

h_{ii} = x_{i}^{T} (X X^{T})^{- 1} x_{i}

(\overset{y}{^} - y) N (0, \overset{σ}{^}^{2} x^{T} (X X^{T})^{- 1} x) .

(\overset{y}{^} - y) N (0, \overset{σ}{^}^{2} x^{T} (X X^{T})^{- 1} x) .

y_{1} y_{N} = x_{1}^{T} θ + e_{1} ⋮ = x_{N}^{T} θ + e_{N}

y_{1} y_{N} = x_{1}^{T} θ + e_{1} ⋮ = x_{N}^{T} θ + e_{N}

p_{\theta}(y)=\frac{1}{\sqrt[]{2\pi\sigma^{2}}}\exp\left\{-\frac{1}{2\sigma^{2}}\big{(}y-x^{T}\theta\big{)}^{2}\right\}.

p_{\theta}(y)=\frac{1}{\sqrt[]{2\pi\sigma^{2}}}\exp\left\{-\frac{1}{2\sigma^{2}}\big{(}y-x^{T}\theta\big{)}^{2}\right\}.

q_{\mbox pN M L} (y ∣ x; z^{N}) = \frac{1}{K} p_{\hat{θ} (z^{N}; x, y)} (y ∣ x) .

q_{\mbox pN M L} (y ∣ x; z^{N}) = \frac{1}{K} p_{\hat{θ} (z^{N}; x, y)} (y ∣ x) .

\hat{θ} (z^{N}, x, y) = ar g θ \in Θ min [i = 1 \sum N (y_{i} - x_{i}^{T} θ)^{2} + (y - x^{T} θ)^{2}]

\hat{θ} (z^{N}, x, y) = ar g θ \in Θ min [i = 1 \sum N (y_{i} - x_{i}^{T} θ)^{2} + (y - x^{T} θ)^{2}]

K = \int_{R} p_{\hat{θ} (z^{N}; x, y)} (y ∣ x) d y,

K = \int_{R} p_{\hat{θ} (z^{N}; x, y)} (y ∣ x) d y,

X = [x_{1} \dots x_{N} x], Y = y_{1} ⋮ y_{N} y

X = [x_{1} \dots x_{N} x], Y = y_{1} ⋮ y_{N} y

\hat{θ} (z^{N}, x, y) = θ_{N + 1}^{*} = (X X^{T})^{- 1} X Y

\hat{θ} (z^{N}, x, y) = θ_{N + 1}^{*} = (X X^{T})^{- 1} X Y

θ_{N + 1}^{*} = θ_{N}^{*} + P_{N + 1} x (y - \overset{y}{^})

θ_{N + 1}^{*} = θ_{N}^{*} + P_{N + 1} x (y - \overset{y}{^})

P_{N + 1} = (X X^{T})^{- 1} .

P_{N + 1} = (X X^{T})^{- 1} .

\begin{split}&p_{\theta_{N+1}^{*}}(y)=\frac{1}{\sqrt[]{2\pi\sigma^{2}}}\exp\left\{-\frac{1}{2\sigma^{2}}\big{(}y-x^{T}\theta_{N+1}^{*}\big{)}^{2}\right\}=\\ &\qquad\frac{1}{\sqrt[]{2\pi\sigma^{2}}}\exp\bigg{\{}-\frac{1}{2\sigma^{2}}\big{(}y-x^{T}\big{(}\theta^{*}_{N}+\\ &\qquad\qquad\qquad\qquad\qquad\qquad P_{N+1}x(y-\hat{y})\big{)}\big{)}^{2}\bigg{\}}=\\ &\qquad\frac{1}{\sqrt[]{2\pi\sigma^{2}}}\exp\left\{-\frac{(1-x^{T}P_{N+1}x)^{2}}{2\sigma^{2}}\left(y-\hat{y}\right)^{2}\right\}.\\ \end{split}

\begin{split}&p_{\theta_{N+1}^{*}}(y)=\frac{1}{\sqrt[]{2\pi\sigma^{2}}}\exp\left\{-\frac{1}{2\sigma^{2}}\big{(}y-x^{T}\theta_{N+1}^{*}\big{)}^{2}\right\}=\\ &\qquad\frac{1}{\sqrt[]{2\pi\sigma^{2}}}\exp\bigg{\{}-\frac{1}{2\sigma^{2}}\big{(}y-x^{T}\big{(}\theta^{*}_{N}+\\ &\qquad\qquad\qquad\qquad\qquad\qquad P_{N+1}x(y-\hat{y})\big{)}\big{)}^{2}\bigg{\}}=\\ &\qquad\frac{1}{\sqrt[]{2\pi\sigma^{2}}}\exp\left\{-\frac{(1-x^{T}P_{N+1}x)^{2}}{2\sigma^{2}}\left(y-\hat{y}\right)^{2}\right\}.\\ \end{split}

K = \int_{- \infty}^{\infty} \frac{1}{2 π σ ^{2}} e x p {- \frac{( 1 - x ^{T} P _{N + 1} x ) ^{2}}{2 σ ^{2}} (y - \overset{y}{^})^{2}} d y = \frac{1}{1 - x ^{T} P _{N + 1} x} = \frac{1}{1 - x ^{T} ( X X ^{T} ) ^{- 1} x}

K = \int_{- \infty}^{\infty} \frac{1}{2 π σ ^{2}} e x p {- \frac{( 1 - x ^{T} P _{N + 1} x ) ^{2}}{2 σ ^{2}} (y - \overset{y}{^})^{2}} d y = \frac{1}{1 - x ^{T} P _{N + 1} x} = \frac{1}{1 - x ^{T} ( X X ^{T} ) ^{- 1} x}

q_{\mbox pN M L} (y ∣ x; z^{N}) = \frac{1}{K} p_{θ_{N + 1}^{*}} (y ∣ x) = \frac{1 - x ^{T} P _{N + 1} x}{2 π σ ^{2}} exp {- \frac{( 1 - x ^{T} P _{N + 1} x ) ^{2}}{2 σ ^{2}} (y - \overset{y}{^})^{2}}

q_{\mbox pN M L} (y ∣ x; z^{N}) = \frac{1}{K} p_{θ_{N + 1}^{*}} (y ∣ x) = \frac{1 - x ^{T} P _{N + 1} x}{2 π σ ^{2}} exp {- \frac{( 1 - x ^{T} P _{N + 1} x ) ^{2}}{2 σ ^{2}} (y - \overset{y}{^})^{2}}

Γ = lo g K = lo g (\frac{1}{1 - x ^{T} ( X X ^{T} ) ^{- 1} x}) .

Γ = lo g K = lo g (\frac{1}{1 - x ^{T} ( X X ^{T} ) ^{- 1} x}) .

L (z^{N}) = i = 1 \sum N ∣ y_{i} - x_{i}^{T} θ ∣^{2} + λ ∣ θ ∣^{2}

L (z^{N}) = i = 1 \sum N ∣ y_{i} - x_{i}^{T} θ ∣^{2} + λ ∣ θ ∣^{2}

\hat{θ} (z^{N}, x, y) = θ_{N + 1}^{*} = (X X^{T} + λ I)^{- 1} X Y

\hat{θ} (z^{N}, x, y) = θ_{N + 1}^{*} = (X X^{T} + λ I)^{- 1} X Y

θ_{N + 1}^{*} = θ_{N}^{*} + P_{N + 1} x (y - \overset{y}{^})

θ_{N + 1}^{*} = θ_{N}^{*} + P_{N + 1} x (y - \overset{y}{^})

P_{N + 1} = (X X^{T} + λ I)^{- 1} .

P_{N + 1} = (X X^{T} + λ I)^{- 1} .

q_{\mbox pN M L} (y ∣ x; z^{N}, λ) = \frac{1 - x ^{T} ( X X ^{T} + λ I ) ^{- 1} x}{2 π σ ^{2}} \cdot exp {- \frac{( 1 - x ^{T} ( X X ^{T} + λ I ) ^{- 1} x ) ^{2}}{2 σ ^{2}} (y - \overset{y}{^})^{2}}

q_{\mbox pN M L} (y ∣ x; z^{N}, λ) = \frac{1 - x ^{T} ( X X ^{T} + λ I ) ^{- 1} x}{2 π σ ^{2}} \cdot exp {- \frac{( 1 - x ^{T} ( X X ^{T} + λ I ) ^{- 1} x ) ^{2}}{2 σ ^{2}} (y - \overset{y}{^})^{2}}

Γ = lo g K = lo g (\frac{1}{1 - x ^{T} ( X X ^{T} + λ I ) ^{- 1} x})

Γ = lo g K = lo g (\frac{1}{1 - x ^{T} ( X X ^{T} + λ I ) ^{- 1} x})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Regression

Full text

A New Look at an Old Problem: A Universal Learning Approach to Linear Regression

Koby Bibas

School of Electrical Engineering

Tel Aviv University

Email: [email protected]

Yaniv Fogel

School of Electrical Engineering

Tel Aviv University

Email: [email protected]

Meir Feder

School of Electrical Engineering

Tel Aviv University

Email: [email protected]

Abstract

Linear regression is a classical paradigm in statistics. A new look at it is provided via the lens of universal learning. In applying universal learning to linear regression the hypotheses class represents the label $y\in{\cal R}$ as a linear combination of the feature vector $x^{T}\theta$ where $x\in{\cal R}^{M}$ , within a Gaussian error. The Predictive Normalized Maximum Likelihood (pNML) solution for universal learning of individual data can be expressed analytically in this case, as well as its associated learnability measure. Interestingly, the situation where the number of parameters $M$ may even be larger than the number of training samples $N$ can be examined. As expected, in this case learnability cannot be attained in every situation; nevertheless, if the test vector resides mostly in a subspace spanned by the eigenvectors associated with the large eigenvalues of the empirical correlation matrix of the training data, linear regression can generalize despite the fact that it uses an “over-parametrized” model. We demonstrate the results with a simulation of fitting a polynomial to data with a possibly large polynomial degree.

I Introduction

Linear regression, using least squares, is probably one of the most standard techniques in statistics, [1]. This work provides a new view of this problem based on recent results in universal learning. In particular, the common assumption in linear regression is that the number of training samples needs to be higher than the number of features in order to be able to generalize [2]. Recently, the success of Deep Neural Networks (DNNs) in which the number of learnable parameters may be greater by several orders of magnitudes than the size of the feature space, requires rethinking that assumption. The new view we provide will show that sometimes generalization can be attained even in the “over-parameterized” regime.

Before diving into this analysis, a short introduction to universal learning is provided. In the common situation of supervised machine learning, a training set is given consisting of $N$ pairs $z^{N}=\{(x_{i},y_{i})\}_{i=1}^{N}$ , where $x\in{\cal X}$ is the data or the features and $y\in{\cal Y}$ is the label. Then, a new $x$ is given and the task is to predict its corresponding label $y$ . In the information theoretic framework considered in a variety of works, e.g., [3] and more recently [4], prediction is done by assigning a probability distribution $q(\cdot|x)$ to the unknown label, and the prediction loss is the log-loss:

[TABLE]

Clearly a reasonable goal is to find the predictor $q$ with the minimal loss for the test sample. However, this problem is ill-posed unless additional assumptions are made.

First, a “model” class, or ‘hypotheses” class must be defined. This class is a set of conditional probability distributions

[TABLE]

where $\Theta$ is a general index set. This is equivalent to saying that there is a set of stochastic functions $\{y=g_{\theta}(x),\;\;\theta\in\Theta\}$ used to explain the relation between $x$ and $y$ .

Next, assumptions must be made on how the features and the labels are generated. In the stochastic setting, it is assumed that there is a true probabilistic relation between $x$ and $y$ given by an (unknown) model from the class $P_{\Theta}$ . A more general setting, used in the variety of works in machine learning, is the Probably Approximately Correct (PAC) established in [5]. In PAC $x$ and $y$ are assumed to be generated by some source $P(x,y)=P(x)P(y|x)$ , but unlike the standard stochastic setting $P(y|x)$ is not necessarily a member of the hypothesis class. In both the stochastic and PAC settings the goal is to perform as well as a learner that knows the true probability.

The most general setting, however, and the one used in this paper is the individual setting, where the features and the labels of both the training and test are specific, individual values. In this setting the goal can no longer be to perform as well as a learner that knows the true probability. Instead, following [3], the goal is to seek a learner that can compete with a “genie” or a reference learner that knows the desired label value, but is restricted to use a model from the given hypotheses class $P_{\Theta}$ . In addition, as discussed in [4], the reference does not know which of the samples is the test. Thus, the reference chooses

[TABLE]

The log-loss difference between a universal learner $q$ and the reference is the regret:

[TABLE]

As advocated in [4], the chosen universal learner solves:

[TABLE]

Following [6] this learner, called the Predictive Normalized Maximum Likelihood (pNML), is given by

[TABLE]

Note that this pNML probability assignment was essentially proposed earlier, see [7, 8], with a different motivation as one of the possible variations of the Normalized Maximum Likelihood (NML) method of [6] for universal prediction.

In order to obtain the pNML the following procedure is executed: assuming the label of the test data is known, find the best model that fits it with the training samples, and predict the assumed label by this model. Repeat the process for all possible labels. Then, normalize to get a valid probability distribution which is the pNML learner. The regret of the pNML, $R^{*}(z^{N},x)$ is the logarithm of its normalization factor

[TABLE]

In considering linear regression, $y\in{\cal R}$ is the scalar label, $x\in{\cal R}^{M}$ is the feature vector (sometimes the first component of $x$ is set to $1$ to formulate affine linear relation), and the model class is the set:

[TABLE]

That is, the label $y$ is a linear combination of the components of $x$ , within a Gaussian noise. As shown below, in this case the pNML and its learnability measure can be evaluated explicitly.

The pNML approach deviates from the standard Empirical Risk Minimization (ERM) [9] approach. In ERM, given a training set and hypothesis class $\{p_{\theta}(y|x),\ \theta\in\Theta\}$ , a learner that minimizes the loss over the training set is chosen:

[TABLE]

In the linear regression model (8), one chooses the least squares solution over the training set for the linear coefficients. This, however, may lead to large log-loss generalization error.

The paper has two main contributions. First, it provides an explicit analytical solution for the pNML learner and its “learnability” measure (which is the minmax regret (7)) for the linear regression hypothesis class. This includes also the regularized case where the norm of the coefficients vector is constrained. Second, based on the analysis of the learnability measure, it is shown that even in the over-parameterized case where the number of parameters $M$ may be larger than the training size $N$ , if the test data comes from a “learnable space” successful generalization occurs. This phenomenon may explain why other over-parameterized models such as deep neural networks are successful for “learnable” data.

The paper outline is as follows. Section II presents some related work. Section III provides the formal problem definition, while the pNML evaluation for the regression problem is given in sections IV and V. In-depth analysis of the learnable space is given in section VI. Simulation of the pNML and its regret for the problem of fitting a polynomial to data is described in section VII and the conclusions are in section VIII.

II Related Works

In this section, we briefly mention related works on model generalization, least squares regression and the confidence of the least squares predictions.

Model Generalization. Understating the model generalization capabilities is considered a fundamental problem in machine learning [10]. As noted, most of the theoretical work in learning use the PAC setting. In that setting, a common measure is the VC Dimension that can be used to upper bound on the test generalization error. For DNNs, the VC dimension is linear with the number of parameters [11], yet the empirical evidence demonstrates that DNNs have state of the art generalization performance. This makes the VC dimension irrelevant for assessing the generalization error of DNNs.

Least Squares. The least squares algorithm is widely used in linear regression due to its robust performance and simplicity of implementation. In addition to the explicit formula for its solution, it can be solved sequentially, via the Recursive Least Squares (RLS) algorithm, which is an efficient online method for finding the linear predictor that minimizes the squared error over the training data [12]. This paper provides a new look at the classical least squares method, the individual setting using the pNML approach.

Outliers Detection and Confidence. In order to evaluate a pointwise confidence measure for linear regression, several methods were proposed. Leverage values are employed to identify outliers with respect to their feature values [13]. A leverage value is a measure of the distance between an observation and the center of the data111Whenever a matrix is inverted it is assumed that the matrix is invertible. If needed, $\lambda I$ with small $\lambda$ is added to assure invertibility

[TABLE]

where $XX^{T}$ is the (unnormalized) correlation matrix of the training set and $x_{i}$ is the feature value which is examined. If the leverage value $h_{ii}$ of observation is large, the observation is considered as an outlier. Using the pNML, in section IV we get a confidence measure for the prediction of the next label which is similar to the leverage measure.

Another approach for finding the reliability of the prediction is to compose confidence intervals [14]. Confidence intervals are a pointwise measure that is sensitive to the variability of the features and sample size. Denote $\hat{y}$ as the predicted value of $x$ and $\hat{\sigma}^{2}$ as the empirical error of the prediction, under the assumption of stochastic i.i.d data and existence of white noise, a confidence interval convergences in distribution to

[TABLE]

III Linear Regression: Formal Problem Definition

Given N pairs of data and labels $\{x_{i},y_{i}\}_{i=1}^{N}$ where $x_{i}\in R^{M},y_{i}\in R$ are deterministic, the model takes the form:

[TABLE]

where $\theta\in R^{M}$ are the learnable parameters and the $e_{i}\in R$ are zero mean, Gaussian, independent with variance of $\sigma^{2}$ . The goal is to predict $y$ based on a new data sample $x$ . Under the assumptions $y$ , conditioned on $x$ , has a normal distribution that depends on the learnable parameters $\theta$

[TABLE]

The unknown parameter vector $\theta$ belongs to a set $\Theta$ , which in the general case is the entire $R^{M}$ . In the regularized version (leading to Ridge regression [15]), $\Theta$ is the sphere $|\theta|\leq A$ . In the next section, the pNML will be evaluated for this hypotheses class. Recall that the pNML learner of $y$ given the the test sample $x$ and the training set $z^{N}=\{(x_{i},y_{i})\}_{i=1}^{N}$ is given by:

[TABLE]

where in the linear regression case

[TABLE]

and where $K$ is the the normalization factor:

[TABLE]

The goal is to find an analytic expression for (14) and for the learnability measure $\Gamma=\log K$ , the minmax regret value.

IV pNML Evaluation

The following notation is used. $X\in R^{M\times N+1}$ is the matrix which contains all the training data along with the test sample and $Y\in R^{N+1}$ is the vector which contains all the labels including the test label, i.e.,

[TABLE]

Assuming that the test label $y$ is given, the optimal solution under least squares:

[TABLE]

By the Recursive Least Squares (RLS) formulation [12]:

[TABLE]

where $\hat{y}=x^{T}\theta^{*}_{N}$ is the ERM prediction based on the training samples $\{(x_{i},y_{i})\}_{i=1}^{N}$ and222When $M>N$ , $XX^{T}$ is not invertible, so $\lambda I$ with small $\lambda$ is added

[TABLE]

Note that in RLS, $P_{N+1}$ is also calculated recursively from $P_{N}$ , but this is not needed at this point. Now,

[TABLE]

To get the pNML normalization factor (16), we integrate over all possible labels

[TABLE]

Thus, the pNML distribution of $y$ given $x$ is:

[TABLE]

and its associate learnability measure or regret:

[TABLE]

V pNML with Regularization

Next, we shall assume that the model class $\Theta$ is constrained to the sphere $|\theta|\leq A$ , for some $A$ . Using a Lagrange multiplier $\lambda$ we get the Tikhonov regularization (or Ridge regression), where the expression to minimize is now:

[TABLE]

With the test data, the “regularized” least square solution is:

[TABLE]

Here too the RLS formula holds:

[TABLE]

However, now

[TABLE]

The rest of the evaulation is similar to IV, yielding the following pNML learner:

[TABLE]

and the associated regret or the log-normalization factor:

[TABLE]

Note that regularization can help in the case where $XX^{T}$ , the unnormalized correlation matrix of the data is ill conditioned. In the next section we find the “learnable space” for the linear regression problem and observe situations where this regularization is needed.

VI Learnable Space

In order to understand for which test sample the trained model generalizes well we need to look at the regret expression (24). High regret means that the pNML learner is very far from the genie and therefore we may not trust its predictions. Low regret, on the other hand, means the model is as good as a genie who knows the true test label, and so it is trusted.

Consider the matrix $X_{N}=[x_{1},x_{2},\ldots,x_{N}]$ , composed of the training data, and apply the singular value decomposition (SVD) on it, i.e., $X_{N}=U\Sigma V^{T}$ with $U\in R^{M\times M}$ , $\Sigma$ is a rectangular diagonal matrix of the singular values and $V\in R^{N\times N}$ . The expression $x^{T}(XX^{T})^{-1}x$ can be rewritten as:

[TABLE]

Denote by $R_{N}$ the empirical correlation matrix of the training:

[TABLE]

where H is a diagonal matrix with $H_{ii}=\eta_{i}$ , the eigenvalues of $R_{N}$ . By the matrix inversion lemma, see [16], we have:

[TABLE]

Denote $\gamma=x^{T}R_{N}^{-1}x$ . We can simplify the expression:

[TABLE]

Plugging in the regret formula (24):

[TABLE]

Let $u_{i}$ be the eigenvectors of the empirical correlation matrix of the training data. Expressing $\gamma$ by $x^{T}u_{i}$ , the projections of $x$ on $u_{i}$ :

[TABLE]

The final regret expression is thus:

[TABLE]

If the test sample $x$ lies mostly in the subspace spanned by the eigenvectors with large eigenvalues, then the model can generalize well even if $M>N$ .

VII Simulation

In this section we present some simulations that demonstrate the results above. We chose the problem of fitting a polynomial to data, which is a special case of linear regression. The simulations show prediction and generalization capabilities in a variety of regularization factors and polynomial degrees.

In the first experiment we generated 3 random points, $t_{0},t_{1},t_{2}$ , uniformly in the interval $[-1,1]$ . These points are the training set and are shown in Figure 1 (top) as red dots. The relation between $y$ and $t$ is given by a polynomial of degree two. Thus, the X matrix of section III is given by:

[TABLE]

Based on the training we predict a probability for all t values in the interval [-1,1] using (29) with a regularization factor $\lambda$ of [math], $0.1$ and $1.0$ . It is shown in Figure 1 (top) that without regularization ( $\lambda=0$ ), the blue curve fits the data exactly. As $\lambda$ increases the fitted curve becomes less steep but tends to fit less to the training data.

Figure 1 (bottom) shows the regret, given by (24), for the polynomial model from (38) for all $t\in[-1,1]$ where the training $t_{i}$ ’s are marked in red on the x axis. We can see that around the training data the regret is very low in comparison to areas where there are no training data. In addition, models with larger regularization term have lower regret for every point in the interval $[-1,1]$ . For all regularization terms, the regret increases as moving away from the training data.

Next, we simulate the case of fitting polynomials with different degrees. Again, we generated 10 random points in the interval $[-1,1]$ . The matrix $X$ is now:

[TABLE]

Figure 2 (top) shows the predicted label for every $t$ value in $[-1,1]$ for the different polynomial degrees. To avoid singularities we used the regularized version with $\lambda=10^{-4}$ . The training set is shown by red dots in the figure. Note that for a polynomial of degree ten, the number of parameters is greater than the size of the training set. Nevertheless, the prediction accuracy near the training samples is similar to that of a degree three polynomial. Figure 2 (bottom) shows the regret (or learnability) of the three pNML learners corresponding to model classes of polynomials with the various degrees. All the learners have regret values that are small near the training samples and large as $t$ drifts away from these samples.

VIII Conclusions

In this paper, we provided an explicit analytical solution of the pNML universal learning scheme and its learnability measure for the linear regression hypothesis class. Interestingly, the predicted universal pNML assignment is Gaussian with a mean that is equal to that of the ERM, but with a variance that increases by a factor $K$ whose logarithm is the learnability measure $\Gamma$ . Analyzing $\Gamma$ we can observe the “learnability space” for this problem. Specifically, if a test sample mostly lies in the subspace spanned by the eigenvectors associated with large eigenvalues of the empirical correlation matrix then the learner can generalize well, even in an over-parameterized case where the regression dimension is larger than the number of training samples. Finally, we provided a simulation of the pNML least squares prediction for polynomial interpolation.

This work suggests a number of potential directions for future work, some are already explored in an accompanying paper [17]. We conjecture that as in linear regression other “over-parameterized” model classes are learnable at least locally, that can be inferred from the pNML solution. This notion is indeed corroborated by the findings in [17].

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. L. Lawson and R. J. Hanson, Solving least squares problems . Siam, 1995, vol. 15.
2[2] G. James, D. Witten, T. Hastie, and R. Tibshirani, An introduction to statistical learning . Springer, 2013, vol. 112.
3[3] N. Merhav and M. Feder, “Universal prediction,” IEEE Transactions on Information Theory , vol. 44, no. 6, pp. 2124–2147, 1998.
4[4] Y. Fogel and M. Feder, “Universal supervised learning for individual data,” ar Xiv preprint ar Xiv:1812.09520 , 2018.
5[5] L. G. Valiant, “A theory of the learnable,” Communications of the ACM , vol. 27, no. 11, pp. 1134–1142, 1984.
6[6] Y. M. Shtar’kov, “Universal sequential coding of single messages,” Problemy Peredachi Informatsii , vol. 23, no. 3, pp. 3–17, 1987.
7[7] T. Roos and J. Rissanen, “On sequentially normalized maximum likelihood models,” 2008.
8[8] T. Roos, T. Silander, P. Kontkanen, and P. Myllymaki, “Bayesian network structure learning using factorized nml universal models,” in Information Theory and Applications Workshop, 2008 . IEEE, 2008, pp. 272–276.