Learning relevant features for statistical inference

C\'edric B\'eny

arXiv:1904.10387·cs.LG·March 25, 2020

Learning relevant features for statistical inference

C\'edric B\'eny

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method to identify features most inferable from another data view using deep canonical correlation analysis, enabling joint distribution estimation and improved supervised learning.

Contribution

The paper demonstrates that features with high correlation in DCCA are optimal for inference and introduces a non-parametric joint distribution representation for various inference tasks.

Findings

01

Effective inference on occluded MNIST images.

02

Representation captures multiple modes of data.

03

Automatic regularization and faster convergence in supervised learning.

Abstract

Given two views of data, we consider the problem of finding the features of one view which can be most faithfully inferred from the other. We find that these are also the most correlated variables in the sense of deep canonical correlation analysis (DCCA). Moreover, we show that these variables can be used to construct a non-parametric representation of the implied joint probability distribution, which can be thought of as a classical version of the Schmidt decomposition of quantum states. This representation can be used to compute the expectations of functions over one view of data conditioned on the other, such as Bayesian estimators and their standard deviations. We test the approach using inference on occluded MNIST images, and show that our representation contains multiple modes. Surprisingly, when applied to supervised learning (one dataset consists of labels), this approach…

Figures11

Click any figure to enlarge with its caption.

Equations124

p (x, y) = p (x) p (y) i = 1 \sum D η_{i} u_{i} (x) v_{i} (y),

p (x, y) = p (x) p (y) i = 1 \sum D η_{i} u_{i} (x) v_{i} (y),

⟨ μ, μ^{'} ⟩_{X} := x \sum \frac{μ ( x ) μ ^{'} ( x )}{p ( x )} and ⟨ ν, ν^{'} ⟩_{Y} := y \sum \frac{ν ( y ) ν ^{'} ( y )}{p ( y )}

⟨ μ, μ^{'} ⟩_{X} := x \sum \frac{μ ( x ) μ ^{'} ( x )}{p ( x )} and ⟨ ν, ν^{'} ⟩_{Y} := y \sum \frac{ν ( y ) ν ^{'} ( y )}{p ( y )}

χ^{2} (q, p_{X}) = ⟨ q - p_{X}, q - p_{X} ⟩_{X},

χ^{2} (q, p_{X}) = ⟨ q - p_{X}, q - p_{X} ⟩_{X},

N (μ) (y) = x \sum p (y ∣ x) μ (x) and N^{*} (ν) (x) = y \sum p (x ∣ y) ν (y) .

N (μ) (y) = x \sum p (y ∣ x) μ (x) and N^{*} (ν) (x) = y \sum p (x ∣ y) ν (y) .

⟨ ν, N (μ) ⟩_{Y} = ⟨ N^{*} (ν), μ ⟩_{X} .

⟨ ν, N (μ) ⟩_{Y} = ⟨ N^{*} (ν), μ ⟩_{X} .

η (q) = \frac{χ ^{2} ( N ( q ) , p _{Y} )}{χ ^{2} ( q , p _{X} )} = \frac{⟨ N ( μ ) , N ( μ ) ⟩ _{Y}}{⟨ μ , μ ⟩ _{X}},

η (q) = \frac{χ ^{2} ( N ( q ) , p _{Y} )}{χ ^{2} ( q , p _{X} )} = \frac{⟨ N ( μ ) , N ( μ ) ⟩ _{Y}}{⟨ μ , μ ⟩ _{X}},

μ (x) = p (x) f (x) and ν (y) = p (y) g (y)

μ (x) = p (x) f (x) and ν (y) = p (y) g (y)

⟨ μ, μ^{'} ⟩_{X} = E (f f^{'}) and ⟨ ν, ν^{'} ⟩_{Y} = E (g g^{'}) .

⟨ μ, μ^{'} ⟩_{X} = E (f f^{'}) and ⟨ ν, ν^{'} ⟩_{Y} = E (g g^{'}) .

E (f g) = ⟨ ν, N (μ) ⟩_{Y},

E (f g) = ⟨ ν, N (μ) ⟩_{Y},

N (μ) = i \sum η_{i} ν_{i} ⟨ μ_{i}, μ ⟩_{X},

N (μ) = i \sum η_{i} ν_{i} ⟨ μ_{i}, μ ⟩_{X},

p (y ∣ x) = N (δ_{x}) (y) = \frac{1}{p ( x )} i \sum η_{i} ν_{i} (y) μ_{i} (x) = p (y) i \sum η_{i} v_{i} (y) u_{i} (x),

p (y ∣ x) = N (δ_{x}) (y) = \frac{1}{p ( x )} i \sum η_{i} ν_{i} (y) μ_{i} (x) = p (y) i \sum η_{i} v_{i} (y) u_{i} (x),

K_{ij} = ⟨ μ_{i}, μ_{j} ⟩_{X} = E (f_{i} f_{j}), L_{ij} = ⟨ ν_{i}, ν_{j} ⟩_{X} = E (g_{i} g_{j}), A_{ij} = ⟨ ν_{i}, N (μ_{j}) ⟩_{Y} = E (g_{i} f_{j}) .

K_{ij} = ⟨ μ_{i}, μ_{j} ⟩_{X} = E (f_{i} f_{j}), L_{ij} = ⟨ ν_{i}, ν_{j} ⟩_{X} = E (g_{i} g_{j}), A_{ij} = ⟨ ν_{i}, N (μ_{j}) ⟩_{Y} = E (g_{i} f_{j}) .

i = 1 \sum k_{0} η_{i}^{2} = Tr (N^{*} N) = Tr (K^{- 1} A^{⊤} L^{- 1} A) .

i = 1 \sum k_{0} η_{i}^{2} = Tr (N^{*} N) = Tr (K^{- 1} A^{⊤} L^{- 1} A) .

p (y ∣ x) = N (δ_{x}) (y) ≃ p (y) i, j = 1 \sum k_{0} (L^{- 1} A K^{- 1})_{ij} g_{i} (y) f_{j} (x),

p (y ∣ x) = N (δ_{x}) (y) ≃ p (y) i, j = 1 \sum k_{0} (L^{- 1} A K^{- 1})_{ij} g_{i} (y) f_{j} (x),

y \sum g_{k} (y) p (y ∣ x) = i, j = 1 \sum D (L^{- 1} A K^{- 1})_{ij} E (g_{k} g_{i}) f_{j} (x) = j = 1 \sum D (A K^{- 1})_{k j} f_{j} (x) = j = 1 \sum k_{0} (A K^{- 1})_{k j} f_{j} (x),

y \sum g_{k} (y) p (y ∣ x) = i, j = 1 \sum D (L^{- 1} A K^{- 1})_{ij} E (g_{k} g_{i}) f_{j} (x) = j = 1 \sum D (A K^{- 1})_{k j} f_{j} (x) = j = 1 \sum k_{0} (A K^{- 1})_{k j} f_{j} (x),

C = k_{0} - Tr (K^{- 1} A^{⊤} L^{- 1} A),

C = k_{0} - Tr (K^{- 1} A^{⊤} L^{- 1} A),

K_{ij}

K_{ij}

L_{ij}

A_{ij}

Θ_{j} = \frac{1}{N _{full}} n = 1 \sum N_{full} Θ (x_{n}) f_{j} (x_{n}),

Θ_{j} = \frac{1}{N _{full}} n = 1 \sum N_{full} Θ (x_{n}) f_{j} (x_{n}),

\overline{Θ} = x \sum p (x ∣ y) Θ (x) \approx i, j = 1 \sum k_{0} (K^{- 1} A^{⊤} L^{- 1})_{j i} Θ_{j} g_{i} (y) .

\overline{Θ} = x \sum p (x ∣ y) Θ (x) \approx i, j = 1 \sum k_{0} (K^{- 1} A^{⊤} L^{- 1})_{j i} Θ_{j} g_{i} (y) .

Tr (K^{- 1} A^{⊤} L^{- 1} A) = Tr (P Q)

Tr (K^{- 1} A^{⊤} L^{- 1} A) = Tr (P Q)

⟨ μ, μ^{'} ⟩_{X} := x \sum \frac{μ ( x ) μ ^{'} ( x )}{p _{X} ( x )}

⟨ μ, μ^{'} ⟩_{X} := x \sum \frac{μ ( x ) μ ^{'} ( x )}{p _{X} ( x )}

χ^{2} (q, p_{X}) = ⟨ q - p_{X}, q - p_{X} ⟩_{X} .

χ^{2} (q, p_{X}) = ⟨ q - p_{X}, q - p_{X} ⟩_{X} .

N (μ) (y) = x \sum p_{Y ∣ X} (y ∣ x) μ (x)

N (μ) (y) = x \sum p_{Y ∣ X} (y ∣ x) μ (x)

N^{*} (ν) (x) = x \sum p_{X ∣ Y} (x ∣ y) ν (x)

N^{*} (ν) (x) = x \sum p_{X ∣ Y} (x ∣ y) ν (x)

⟨ ν, N (μ) ⟩_{Y} = ⟨ N^{*} (ν), μ ⟩_{X} .

⟨ ν, N (μ) ⟩_{Y} = ⟨ N^{*} (ν), μ ⟩_{X} .

N (u_{j}) = η_{j} v_{j},

N (u_{j}) = η_{j} v_{j},

N^{*} (v_{j}) = η_{j} u_{j} .

N^{*} (v_{j}) = η_{j} u_{j} .

N_{0} (μ)

N_{0} (μ)

N_{0}^{*} (ν)

N_{0} (μ) (y) = x \sum q (y ∣ x) μ (x) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cbeny/RFA
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Mechanics and Entropy · Gaussian Processes and Bayesian Inference · Generative Adversarial Networks and Image Synthesis

Full text

Learning relevant features for statistical inference

Cédric Bény

(January 15, 2020)

Abstract

Given two views of data, we consider the problem of finding the features of one view which can be most faithfully inferred from the other. We find that these are also the most correlated variables in the sense of deep canonical correlation analysis (DCCA). Moreover, we show that these variables can be used to construct a non-parametric representation of the implied joint probability distribution, which can be thought of as a classical version of the Schmidt decomposition of quantum states. This representation can be used to compute the expectations of functions over one view of data conditioned on the other, such as Bayesian estimators and their standard deviations. We test the approach using inference on occluded MNIST images, and show that our representation contains multiple modes. Surprisingly, when applied to supervised learning (one dataset consists of labels), this approach automatically provides regularization and faster convergence compared to the cross-entropy objective. We also explore using this approach to discover salient independent variables of a single dataset.

1 Introduction

Given samples $(x_{1},y_{1}),\dots,(x_{n},y_{n})$ from an unknown joint probability distribution $p(x,y)$ , we want to construct a useful representation of the conditional probabilities $p(x|y)$ and $p(y|x)$ , so that we that we can infer one view from the other on new data.

For instance, $x$ and $y$ could be past and future histories of dynamical data, visual and auditory inputs, actions and their effects, etc.

Following the approach introduced in [1, 2] in the context of quantum information theory, we look at the problem as follows: the conditional distributions $p(y|x)$ can be thought of as representing a noisy communication channel (stochastic map). This channel is a linear map between spaces of typically ludicrously large dimensions (the spaces of all probability distributions over $x$ or $y$ ). We want a pair of small subspaces on which the channel is minimally noisy. Specifically, we look for those vectors representing probability distributions over $x$ which lose least distinguishability under the channel, where the distinguishability is measured by the $\chi^{2}$ divergence.

We show in Section 3 that this is equivalent to performing a singular value decomposition of the channel (seen as an operator in Hilbert space) and keep only the components with the largest singular values. Moreover, the full singular value decomposition is equivalent to the decomposition in terms of canonical variables introduced in [3], namely,

[TABLE]

where $u_{i}$ and $u_{j}$ are real non-linear functions such that $\mathbb{E}(u_{i}u_{j})=\delta_{ij}$ , $\mathbb{E}(v_{i}v_{j})=\delta_{ij}$ , and $0<\eta_{D}\leq\dots\leq\eta_{1}\leq\eta_{0}=1$ are the singular values.

This can also be interpreted as a classical version of the Schmidt decomposition for pure quantum states, where the Fisher information metric plays the role of the Hilbert space inner product (Section 3).

The practical advantage of this representation for inference (or prediction) is that it reduces the evaluation of conditional expectations to that of empirical averages over the (unconditional) marginal $p(y)$ .

As observed in [4], the span of the first $k$ canonical variables $u_{i}$ , $v_{j}$ is what is learned by the deep canonical correlation analysis (DCCA) [5]. Indeed, these variables are those which maximize the correlations $\mathbb{E}(u_{i}v_{i})$ subject to the same constraints as above. (This reduces to CCA [6] when $u_{i}$ , $v_{j}$ are linear maps).

In this work, besides establishing this new information-theoretical interpretation of canonical variables and DCCA, we experiment with using this representation for performing inference on new data. Moreover, we propose a strategy for extracting disentangled variables from the canonical variables, inspired by analytical solutions.

2 Related work

This general problem (of building an effective representation of the conditional probability distributions implied by joint samples) covers many existing approaches in different contexts. For instance, if the variables $y$ has few possible states, then it reduces to a classification problem, usually solved by minimizing the crossentropy between a predicted distribution and the one-hot encoding of the classes.

When $y$ has a large number of states, or is fundamentally continuous, existing approaches usually do not model the whole conditional distribution, but either provide the average (regression), or approximately sample from it.

The main class of methods which allows for sampling from the conditional distributions are variational: a deterministic neural networks produces the parameters of analytical classes of probabilities. This includes variational autoencoders [7] (e.g., [8, 9]), and approaches based on the minimum description length principle such as [10].

Alternatively, it may also be possible to use adversarial training [11], by using a conditional [12] version of an energy-based GAN [13].

By contrast, our approach doesn’t require the training of a generative model. Instead, conditional expectations are constructed as linear combinations of unconditional empirical averages over the training data.

The information-theoretical approach that we propose is very close to that introduced in [14], and further developed in [15] in the context of classical information theory. The authors consider the problem of sending information through a channel with an infinitesimal bound on the mutual information of the encoding operation. This leads to the same singular value problem.

Other approaches to equipping CCA or DCCA with an information-theoretic interpretation have explored different directions. For instance, in [9], the authors generalize a probabilistic interpretation for CCA in terms of gaussian distributions, which leads to a variational approach. In [16], additional constraints on the mutual information between the data and the learned variables are added to the optimizations.

A previous attempt at designing a numerical solution for our singular value problem can be found in [17]. In that work, the relevant variables were represented through a PCA kernel produced by Monte Carlo sampling, but wasn’t practical.

3 Theory

We formalize the problem by assuming that our data was sampled from an unknown joint distribution $p(x,y)$ over two random variables $X$ and $Y$ .

Let $V_{X}$ and $V_{Y}$ denote the linear spaces spanned by all probability distributions over $X$ and $Y$ respectively. Here we assume that $X$ and $Y$ take finitely many values for simplicity, but this formalism can be straightforwardly extended to infinite-dimensional vector spaces.

We will need inner products on $V_{X}$ and $V_{Y}$ , to make them into real Hilbert spaces. We use the Fisher information metrics evaluated at the points $p(x)$ and $p(y)$ respectively (marginals of $p(x,y)$ ), that is,

[TABLE]

for any vectors $\mu,\mu^{\prime}\in V_{X}$ and $\nu,\nu^{\prime}\in V_{Y}$ .

Below we also call the marginals $p_{X}$ and $p_{Y}$ respectively when omitting their arguments ( $p_{X}(x)\equiv p(x)$ and $p_{Y}(y)\equiv p(y)$ ).

These inner products allow us to define the $\chi^{2}$ divergence:

[TABLE]

which a measures of statistical distinguishability between $q$ and $p_{X}$ . Specifically, it quantifies how easy it is to reject the null hypothesis that the state is $p_{X}$ when it is actually $q$ , based on the empirical distribution obtained from independent samples. It is also the lowest order approximation of the Kullback-Leibler divergence.

The joint distribution $p(x,y)$ yields conditional distributions $p(y|x)$ and $p(x|y)$ . These can be understood as the components (or kernels) of stochastic maps $\mathcal{N}:V_{X}\rightarrow V_{Y}$ and $\mathcal{N}^{*}:V_{X}\rightarrow V_{Y}$ respectively. Explicitely, if $\mu\in V_{X}$ and $\nu\in V_{Y}$ , then the images $\mathcal{N}(\mu)\in V_{Y}$ and $\mathcal{N}^{*}(\nu)\in V_{X}$ are defined by

[TABLE]

These stochastic maps $\mathcal{N}$ and $\mathcal{N}^{*}$ perform inference of one variable given some (possibly imperfect) knowledge about the other, with priors given by the marginals $p(x)$ or $p(y)$ of $p(x,y)$ depending on the direction of the inference. Importantly, $\mathcal{N}^{*}$ is the transpose of $\mathcal{N}$ in terms of the inner products defined above:

[TABLE]

We now have the tools to address the problem mentioned in the introduction. The distinguishability between $q\in V_{X}$ and $p_{X}$ after the action of the channel $\mathcal{N}$ is $\chi^{2}(\mathcal{N}(q),p_{Y})$ since $p_{Y}=\mathcal{N}(p_{X})$ . Hence we want to find the distributions $p$ which maximize the relevance [1].

[TABLE]

where $\mu=q-p_{X}$ . The inner-product formulation makes it clear that this amounts to finding the eigenvector with largest eigenvalue for the symmetric map $\mathcal{N}^{*}\mathcal{N}$ , which is also the singular vector with largest singular value for $\mathcal{N}$ . On can then go on to find the eigenvector with next largest eigenvalue and so on, which are automatically orthogonal.

In practice, the inner products are more tractable to compute if we express elements $\mu\in V_{X}$ and $\nu\in V_{Y}$ in terms of variables $f$ and $g$ as $\mu=p_{X}f$ and $\nu=p_{Y}g$ , or

[TABLE]

for all $x,y$ . Indeed, this yields simply

[TABLE]

We are now in measure to make the connection with DCCA [5]. Indeed, the aims of DCCA is to maximize the correlations $\text{corr}(f,g)=\mathbb{E}(fg)$ over function $f(x)$ and $g(y)$ such that $\mathbb{E}(f^{2})=\mathbb{E}(g^{2})=1$ . But, using $\mu=p_{X}f$ and $\nu=p_{Y}g$ , we have

[TABLE]

which is maximized by the left- and right- singular vectors $\mu$ and $\nu$ of $\mathcal{N}$ with largest singular value.

Given all the singular vectors $\mu_{i}=p_{X}f_{i}$ and $\nu_{i}=p_{Y}g_{i}$ with singular values $\eta_{i}$ , we obtain the representation

[TABLE]

which, using a more standard notation and the Kronecker delta $\delta_{x}$ , yields Eq. 1:

[TABLE]

where $\mu_{i}=p_{X}u_{i}$ and $\nu_{i}=p_{Y}v_{i}$ .

3.1 Non-diagonal form and relevant variables

For the purpose of the optimization and inference, we do not need to full diagonal decomposition, but just functions $f_{i}=\mu_{i}/p_{X}$ and $g_{j}=\nu_{j}/p_{Y}$ , $i,j=1,\dots,k_{0}$ , which have the same span as the canonical variables $u_{i}$ and $v_{j}$ respectively for $i,j=1,\dots,k_{0}$ (assuming that $\eta_{1},\dots,\eta_{k_{0}}$ are the largest singular vector). Below we refer to $f_{i}$ and $g_{j}$ as the $k_{0}$ most relevant variables.

Because these functions may not be orthogonal, we need the covariance matrices

[TABLE]

If $N_{ij}$ denote the components of $\mathcal{N}$ in the sense that $\mathcal{N}(p_{X}f_{j})=\sum_{i}N_{ij}p_{Y}g_{i},$ then, using our inner products to isolate $N_{ij}$ , we obtain $N=L^{-1}A$ . Similarly, the components of $\mathcal{N}^{*}$ are $N_{ij}^{*}=K^{-1}A^{\top}$ . This implies that the sum of the square of the singular values of $\mathcal{N}$ restricted to the spans of the vectors $p_{X}f_{i}$ and $p_{Y}g_{j}$ for all $i,j$ , which is what we want to maximize, is just given by

[TABLE]

This is the DCCA objective. Below we use the objective function $C=k_{0}-{\rm Tr}\,(N^{*}N)$ , for the cosmetic reason that its optimal value is zero.

Moreover, the corresponding truncated representation of the conditional distribution is

[TABLE]

where we used the fact that the components of $\delta_{x}$ are $\delta_{j}=\sum_{i}K^{-1}_{ji}f_{i}(x)$ .

Of course, This approach can produce a faithful representation of the correlations only if $\mathcal{N}$ is actually close to being of rank $k_{0}$ (see Theorem 1 in Appendix B for a more precise statement). If we interpret the relevant subspace as a space of probability over latent variable, this means that our latent variables have at most $k_{0}$ discrete states.

However, even if the rank $k_{0}$ corner of $\mathcal{N}$ is a not a good approximation, this strategy allows us to nevertheless do the correct inference on certain random variables, namely those which are in the span of the canonical variables!

Indeed, the exact conditional expectation of $g_{k}$ is (assuming $D$ is the actual rank of $\mathcal{N}$ ),

[TABLE]

where the last truncation is exact if $k\leq k_{0}$ due to the assumption that the basis $f_{i}$ and $g_{j}$ have the same span as the $k_{0}$ largest right and left singular vectors of $\mathcal{N}$ respectively.

For instance, if $p(x,y)$ is Gaussian, the canonical variables can be computed analytically, as in [3] or [18] in the multivariate case. Solutions for other distributions were also computed in [19].

Notably, for any two dimensional Gaussian, the space of $k$ most relevant variables is simply spanned by the moments $f_{n}(x)=x^{n}$ and $g_{n}(y)=y^{n}$ for $n=0,\dots,k-1$ . Hence, in this case the first $k$ moments can be inferred exactly using only the $k+1$ most relevant variables (See Appendix C).

4 Algorithm

Let us explicit the algorithm resulting from the above analysis.

We assume that we are given independent samples $(x_{1},y_{1}),(x_{1},y_{2}),\dots$ from the otherwise unknown joint distribution $p(x,y)$ .

We first perform DCCA [5]. That is, we need two independent deterministic feed-forward neural networks. The first maps $x$ to a set of $k_{0}$ real-valued variables $f_{1}(x),\dots,f_{k_{0}}(x)$ . The second maps $y$ to a different set of $k_{0}$ variables $g_{1}(y),\dots,g_{k_{0}}(y)$ .

The parameters of the neural networks are to be set to minimize the objective function

[TABLE]

where the matrices $K,L,A$ can be approximated over a mini-batch $(x_{n},y_{n})$ , $n=1,\dots,N$ via

[TABLE]

We found that, provided the batch size is sufficiently large compared to $k_{0}$ (about 10 times in our experience), this can be minimized using ADAM or direct gradient descent. However, to guarantee stability when using large $k_{0}$ , we needed to explicit the gradient of the objective function in order to force the use of the Moore-Penrose pseudo-inverses for $K^{-1}$ and $L^{-1}$ in both the forward and backward passes, in addition to using 64 bits floats in these computations.

Once the relevant variables have been learned, we still need to use the training data in a second step. Indeed, suppose that we wish to use our model to infer the value of some function $\Theta(x)$ , i.e., to compute its approximate expectation value in terms of the conditional distribution $x\mapsto p(x|y)$ . Then we need to store, for each variable $j=1,\dots,k_{0}$ , the quantities

[TABLE]

where the average is to be taken on the full training batch (of size $N_{\rm full}$ ). The same can be done exchanging $x$ with $y$ and $f_{j}$ with $g_{j}$ for the reverse inference.

For instance, if a data point $x$ is composed of real components $x^{a}$ —such as pixel color components for an image—and we are interested in the estimator which minimize the expected $l^{2}$ distance to the predicted values of these components, then we need the expectation values of the components $\Theta(x)=x^{a}$ for all $a$ , and possibly higher moments to gain more knowledge about the shape of the posterior distribution, such as the second moments $\Theta^{\prime}(x)=x^{2}$ , etc.

Inference can then be performed with new data using

[TABLE]

Moreover, the accuracy of these predictions does not depend on the rank $k_{0}$ if $\Theta$ is taken in the span of the relevant variables, i.e., $\Theta(x)=\sum_{i=1}^{k}c_{i}f_{i}(x)$ , for which $\Theta_{j}=\sum_{i=1}^{k_{0}}c_{i}K_{ij}$ .

The reverse inference formulas are obtained simply by the exchanges $K\leftrightarrow L$ , $A\leftrightarrow A^{\top}$ , and $g_{i}\leftrightarrow f_{i}$ .

5 Experiments

In all our experiments, we used the ADAM optimizer with learning rate $0.001$ . We used the Flux package [20] for Julia, as well as Tensorflow.

As usual the data is divided into a training set and a testing set. No aspect of the testing set is used during training. The loss function refers to Eq. (16). In order to monitor overfitting, we compute a “test loss” and a “training loss”. The test loss is computed from the trained variables using only the test data, and accordingly, the training loss is computed purely using the training data.

Moreover, when performing inference on test data using Eq. (21), we use the covariances $A,L,K$ and expectations $\Theta_{j}$ (Equ. (20)) built from the training data only.

5.1 Inference on occluded MNIST

In this experiment, we use the left and right halves of the MNIST digit images as correlated variables $X$ and $Y$ . The goal is to obtain the expected left halves given the right halves, or vice versa.

The training set was augmented by random small rotations and displacements to make the task more ambiguous, as we want to explore the uncertainty in the prediction.

The relevant variables were represented by two convolutional neural networks of identical architecture. They are composed of four convolutional layers and one fully connected layer, an architecture that performs well for supervised learning on this dataset. For ease of implementation, these CNN have the whole image as input, but with either half zeroed (same value as black pixels). Half-width CNNs with proper padding at the cut perform similarly.

After training, we used the training dataset to also compute the expected pixel gray value as well as their covariance for each relevant variable using Eq. (20).

These were used into Eq. (21) to compute the mean pixel gray values and their covariances over the conditional probability of $X$ given $Y$ on test data. This mean is the Bayesian estimator for the $l^{2}$ distance between half images, i.e., it should minimize the expected distance $d_{l^{2}}$ over the conditional distribution, where $d_{l^{2}}^{2}(x,y)=\sum_{i}(x_{i}-y_{i})^{2},$ where $x_{i}\in[0,1]$ is the value of the ith pixel. (This is equivalent to minimizing the mean square error).

The results on a randomly selected subset of test digits is shown in Fig. 1. For each example, we also computed the images obtained by adding plus or minus one standard deviation along the direction of greatest variance in the space of relevant variables. This reveals the main ambiguities (such as between $8$ and $3$ or $7$ and $9$ which share a similar right half).

The graph of the singular values shows that the rank cutoff of $200$ is too low to capture all of the relevant variables (the sudden drop at the end is not robust to an increase in the cutoff), but the results are reasonable nevertheless. This shows that our representation of the conditional distributions contains valuable information besides the simple mean.

5.2 Supervised learning

In the context of a supervised classification task, one of the dataset (the labels) is of sufficiently low dimensionality that we can use a complete basis over its probability space as our relevant variables, such as the standard one-hot encoding of labels. This serves as a good first sanity test for our approach. Surprisingly, we find that it converges faster than standard approaches, and without the need for regularization.

Let the variable $Y$ stands for the labels, with values in $\{1,\dots,k\}$ . The probability space consists of vectors with $k$ real components. The canonical basis corresponds to the one-hot encoding $g_{i}(j)=\delta_{ij}$ (Kronecker delta). All we need is a neural network to encode $k$ variables $f_{1},\dots,f_{k}$ on $X$ . After learning the most relevant variables $f_{i}$ , we apply the reverse of Eq. (21) for function $\Theta(y)=y$ , and use the maximum component of expected value $\overline{y}$ to infer the labels from the data.

Let us refer to this procedure as DCCI (Deep Canonical Correlations based Inference).

We tested this approach on the MNIST and CIFAR10 datasets, and compared the results to the standard cross-entropy objective (Fig. 2).

We plotted the accuracy as function of the epoch rather than clock time which would depend on many factors. But the time per epoch is roughly the same for each approaches in the above experiments. Indeed, the training time is dominated by the forward and backward evaluations of the neural networks which are identical. (However, the time it takes to evaluate our objective can become significant for much larger number of labels $k$ , since it involves the inversion of matrices of dimension $k$ . This is in addition to the fact that a greater dimension would require also larger batches.)

We found that, without regularization, simply changing the objective from cross-entropy to DCCI provided a large improvement both of convergence speed and final accuracy for both models.

On MNIST, DCCI alone also outperformed cross-entropy with dropout. (Dropout did not yield any improvement in conjunction with DCCI). However, adding batch-normalization layers on the CIFAR example, erased any distinction between DCCI and cross-entropy.

5.3 Structure of the relevant variables

We mentioned in Section 3 that if $p(x,y)$ is a two-dimensional Gaussian distribution with zero mean, then the $n$ most relevant variables of $X$ are the first $n$ powers of $X$ itself, independently of the covariance matrix. This implies that the canonical variables are the Hermite polynomials in $X$ (which results from applying the Gram-Schmidt procedure to the basis $\{1,x,x^{2},\dots\}$ ).

A similar property holds for multivariate Gaussians, namely, the less relevant singular values are polynomials in the more relevant ones. If this is true more generally, it should be possible to further compress and organize the latent space extracted with DCCA by finding a minimal set of generators, which ought to also be in the span of the most relevant variables.

We applied DCCA to a synthetic dataset to explore this idea, shown in Fig. 3. In this case, we actually performed a final SVD to obtain the unique uncorrelated canonical variables, and ordered them by decreasing relevance (their respective singular values).

Here, $X$ consists of two real numbers, distributed uniformly within a ring and a disk. The variable $Y$ is obtained by adding a random Gaussian shift to $X$ with a small standard deviation. The more relevant variables ought to be those which are more robust to such small random displacement. This formalizes the idea that we are interested in extracting “large-scale” variables [17].

We would expect the relevant independent variables to be: the binary variable indicating whether the point is in the disk or the ring and the angle around the ring, followed by the radial component in the ring, and finally the Cartesian coordinates inside the disk. This is precisely what we see in Fig. 3.

Indeed—if we put aside for now the fact that the angle itself is not directly represented—besides the constant function, the two most relevant variables are the sine and cosine of the angle, followed by the binary variable separating the disk from the ring.

But these variables ought to span the space of probabilities over the relevant variables, not just the variables themselves. Hence the next six variables are sines and cosines of smaller wavelength, which can encode probability distributions which are increasingly more precisely localized, down to a precision (wavelength) comparable with the diameter of the inner disk. Accordingly, the next two most relevant variables are the Cartesian coordinates inside the disk. This is followed by additional moments of the angle, down to a wavelength equal to the ring’s thickness, at which point we see the radius in the ring appear.

As mentioned, we see that the angle itself is not represented, likely because it is discontinuous. However, as shown also in Fig. 3, creating a gap in the ring allows for the angle to emerge as most relevant variable. This suggests that this approach may be able to automatically learn intrinsic coordinates of the latent variable manifold.

5.4 Independent variables and generative model

If we postulate that the independent (or disentangled) relevant latent variables can be found in the linear span of the relevant variables, we can attempt to extract them by optimizing a neural network composed of two parts. Firstly, a linear layer maps the relevant variables to a small number of outputs (equal to the latent dimension). The purpose of this linear layer is to find the independent variables. These latent variables are then processed by an arbitrarily complex generative network to produce a possible value of the variable $X$ . As objective function, we may us an appropriate measure of similarity between the output and the data element from which the variables were obtained.

We tested this idea as follows. We took $X$ to consist of the MNIST digits, and produced $Y$ by randomly permuting neighboring pixels in the image, until the mean displacement per pixel is of order $1$ . In addition, we added independent Gaussian noise to the pixel values. (Hence the noise map $\mathcal{N}$ simulates the coarse-graining channel introduced in [21]).

As in the previous experiment, we do so to implement our intuition that the more relevant variables ought to be the ones which are of larger scale, or more robust to local perturbations.

The relevant variables of the clean images were produced by the same convolutional neural network as in Section 5.2, while the variables of the coarse-grained images were extracted by a network of the same geometry, but with half the number of filters and neurons.

We extracted the $1000$ most relevant out of $1200$ learned variables in this way. (The least relevant variables in this system happen to be highly dependent on the total number of variables and hence cannot be trusted to be correct). As a second step, we trained a linear layer coupled to a network composed of $5$ fully-connected layers of $800$ hidden neurons each. We refer to the number of output neurons in the first linear layer as the latent dimension.

As input, this network received the variables extracted from MNIST images using the above convolutional neural net (after it was fully trained using DCCA), and was trained to minimize the mean square error between its output and the original MNIST digit.

The resulting best mean square errors are shown in Fig. 4, as function of the latent dimension. Here we see a distinct change of polynomial scaling law at dimension $19$ . Increasing the dimension further provides no improvement. This behaviour is compatible with our hypothesis that the extra variables are just functions of those first twenty variables (functions which are effectively re-implemented by the generative network).

Images generated by sampling from a Gaussian approximation of the latent distribution for different latent dimensions are shown in Fig. 4. Below dimension $20$ , most generated image can be recognized as a specific digit.

6 Outlook

We studied the classical (non-quantum) form of the theory introduced in [1], and found that the relevant observables of that theory are just the most correlated canonical variables in the sense of DCCA [5], and can be learned effectively using standard machine learning methods.

This point of views on DCCA provided us with several new insights. The first is that the learned relevant variables provide a useful representation of a joint probability distribution. We showed that performing inference using this representation can outperform crossentropy in predicting classes. Our experiments on halves of MNIST also show that the conditional distribution we obtain can effectively represent the uncertainty in the prediction of high-dimensional data.

A second insight relates to the interpretation of the canonical variables as spanning directions in the space of probability distributions. As suggested by the gaussian solutions and our experiment on synthetic data, we postulate that the canonical variables are functions of a small number of independent generators contained in their span. This hypothesis is supported by our experiment on MNIST, but further work is required to find a way to cleanly extract these variables.

We have yet to explore the potential applications of one of the salient aspect of this approach to inference, the fact that the canonical variables learned using DCCA are also those which can be most reliably predicted, irrespective of the value of the cutoff. To see why this is potentially significant, we observe that a central feature of scientific exploration is that we are not so concerned with making predictions about some given variables, as much as we are with discovering variables which can be predicted.

Another important feature of this approach is the fact that the resulting model allows for the direct evaluation of the expectation values in the posterior distribution without sampling. In particular this allows for the evaluation of credible intervals. Hence it should be especially suited to scientific applications where the ability to quantify uncertainty is essential.

Finally, the relationship that we established with theory of quantum origin points towards a potential quantum generalization of DCCA that would apply to quantum data, or classical measurements of quantum systems.

Acknowledgments

We would like to thank Joël Bény and Raban Iten for helpful suggestions. We are also indebted to an anonymous ICLR2020 referee for pointing out the connection between our approach and DCCA. This work was supported by the National Research Foundation of Korea (NRF-2018R1D1A1A02048436).

Appendix A Extra information about the algorithm

A.1 Alternative interpretation of the objective

If we write $F_{ij}:=f_{j}(x_{i})$ and $G_{ij}:=g_{j}(y_{i})$ for the value of our variables on the dataset, then $K=\frac{1}{N}F^{\top}F$ , $L=\frac{1}{N}G^{\top}G$ and $A=\frac{1}{N}G^{\top}F$ . The DCCA objective can then be written as

[TABLE]

where $P=F(F^{\top}F)^{-1}F^{\top}$ and $Q=G(G^{\top}G)^{-1}G^{\top}$ are the projectors on the ranges of $F$ and $G$ respectively. Hence, we are maximizing the overlap between those ranges (which represent possible linear combinations of datapoints, respectively determined from variables of one or the other correlated views.)

A.2 Heuristic

Batch size—In our experiments, we observed that the batch size during training needs to be an order of magnitude larger than the number of variables (rank cutoff). When the batch size was too small, learning seemed to converge normally in terms of training and test loss, but resulted in variables which yield dramatically different losses when evaluated on larger batches, and yield spurious predictions.

Constant variables—The loss function $C$ takes value between [math] and $k_{0}-1$ because the constant variable always has relevance $1$ . The constant variable could be enforced a priori rather than learned, which, due to the objective, automatically forces the learned variables to have zero expectation values (be orthogonal to the constant variable). This might have advantages in certain circumstances, but in our experiments we found that this sometime hindered convergence.

Invertibility issues—The covariance matrices $K$ and $L$ can be ill-conditioned, potentially causing the gradient to “explode” because of the inverses $K^{-1}$ et $L^{-1}$ involved in the loss function. This can be avoided either by using the Moore-Penrose pseudo-inverse, or by replacing $K^{-1}$ by $(K+\epsilon{\bf 1})^{-1}$ in the loss for some small positive number $\epsilon$ , and likewise for $L^{-1}$ .

Symmetries in the loss function—The loss $C$ only depends on the span of the variables $f_{i}$ and $g_{j}$ , hence it has a very large group of symmetries. In particular, it is invariant under a change of the norm of each variable independently from each other. Because of that, it is preferable not to have a linear last layer. Using a hyperbolic tangent as last nonlinearity worked in our experiments.

Regularization—In all our tests, dropout had no beneficial effect. In fact, our objective seems to already provide a form of regularization, as shown in Section 5.2.

Appendix B Theory in more details

We consider two correlated random variables $X$ and $Y$ with a joint probability distribution $p(x,y)$ . We assume that we are able to numerically evaluate expectations with respect to this distribution, for instance because we can sample from it. We want to use this ability in order to compute expectations with respect to the conditional distributions $p_{X|Y}(x|y)=p(x,y)/p_{X}(x)$ and $p_{Y|X}(y|x)=p(x,y)/p_{Y}(y)$ , where $p_{X}(x)=\sum_{y}p(x,y)$ and $p_{Y}(y)=\sum_{x}p(x,y)$ are the marginals of $p$ . Below we sometime remove the subscripts $X$ , $X|Y$ or $Y|X$ if there is no ambiguity.

For instance, suppose we generated samples of $y$ given $x$ , through explicit knowledge of $p_{Y|X}$ . Then the evaluation of expectations with respect to $p_{X|Y}$ is the subject of Bayesian inference. However, this is generally done in a context where the variable $X$ has low dimensionality and parameterizes a hand-crafted model. Our approach, however, is free of such a model and the variable $X$ can be of very high dimensionality.

B.1 Inner product on probability vectors

In order to define our strategy, we need to equip the spaces of probability distributions for $X$ and $Y$ with an inner product structure. Let us focus on $X$ , and assume that it takes discrete values to avoid unnecessary technicalities. The set of probability vectors is a convex subset of the real linear space $V_{X}=\mathbb{R}^{n}$ . Let us equip this space with the product

[TABLE]

for any $\mu,\mu^{\prime}\in V_{X}$ . We also write $\|\mu\|_{X}^{2}=\langle\mu,\mu\rangle_{X}$ . Importantly, this depends explicitly on the fixed probability vector $p_{X}(x)$ , which we took to be the marginal of $p(x,y)$ . If $p_{X}$ has full support, this makes $V_{X}$ into a real inner product space. The same can be done for the variable $Y$ , yielding the inner product $\langle\nu,\nu^{\prime}\rangle_{Y}$ for $\nu,\nu^{\prime}\in V_{Y}$ .

Had we interpreted $\mu$ and $\mu^{\prime}$ as tangent vectors to $V_{X}$ , considered as a manifold, this would be the Fisher information (Riemannian) metric, as in [2]. But this quantity is also meaningful for finite vectors: the induced norm distance between $p_{X}$ and any probability vector $q$ is the $\chi^{2}$ -divergence:

[TABLE]

The set of conditional probability distributions $p_{Y|X}$ form a stochastic map, i.e., a linear map $\mathcal{N}:V_{X}\rightarrow V_{Y}$ , $\mu\mapsto\mathcal{N}(\mu)$ , where

[TABLE]

for any $\mu\in V_{X}$ .

It is straightforward to check that the stochastic map $\mathcal{N}^{*}$ defined by

[TABLE]

is the transpose $\mathcal{N}^{*}$ of $\mathcal{N}$ with respect to the inner products we defined [22], i.e., for all $\nu\in V_{Y}$ and $\mu\in V_{X}$ ,

[TABLE]

Also, we observe that $\mathcal{N}(p_{X})=p_{Y}$ and $\mathcal{N}^{*}(p_{Y})=p_{X}$ .

B.2 Eigen-relevance decomposition

We can use the inner products on $V_{X}$ and $V_{Y}$ to define a singular value decomposition of the stochastic map $\mathcal{N}$ . That is, there is an orthonormal family $u_{1},\dots,u_{k}$ of $V_{X}$ and an orthonormal family $v_{1},\dots,v_{k}$ of $V_{Y}$ , such that

[TABLE]

for $j=1,\dots,k$ . For each $j$ , $\eta_{j}$ is a singular value of $\mathcal{N}$ , whose square we call the relevance of the vector $v_{j}$ . Moreover $\eta_{j}\in[0,1]$ since the $\chi^{2}$ divergence is contractive under any stochastic map. Given that $\mathcal{N}^{*}$ is the transpose of $\mathcal{N}$ :

[TABLE]

Equivalently, $u_{j}$ is an eigenvector of $\mathcal{N}^{*}\circ\mathcal{N}$ and $v_{j}$ is an eigenvectors of $\mathcal{N}\circ\mathcal{N}^{*}$ , both with eigenvalue $\eta_{j}^{2}$ .

Because $\mathcal{N}$ maps $p_{X}$ to $p_{Y}$ , we always have the dual eigenvectors $u_{0}=p_{X}$ and $v_{0}=p_{Y}$ with eigenvalue $1$ .

B.3 Low-rank approximation

Typically, the dimension $k$ of the space of probabilities is more than astronomically large. For instance, if the values of $X$ consists of small $256$ gray level images of $28\times 28$ pixels, then $k=256^{28^{2}}\simeq 10^{1888}$ . However, in many case, only very few of these dimensions may be relevant for the purpose of inferring other variables.

The core of our approach is to approximate $\mathcal{N}$ and $\mathcal{N}^{*}$ by restricting them to the span of the first $k_{0}$ eigenvectors $u_{j}$ and $v_{j}$ with largest singular values $\eta_{j}$ . That is, if we order the singular values $\eta_{j}$ , $j=1,\dots,k$ in decreasing order, we propose to use the approximations

[TABLE]

to $\mathcal{N}$ and $\mathcal{N}^{*}$ respectively, for some $k_{0}$ typically much smaller than $k$ , and any $\mu\in V_{X}$ , $\nu\in V_{Y}$ .

We denote the components of $\mathcal{N}_{0}$ and $\mathcal{N}_{0}^{*}$ by $q(y|x)$ and $q(x|y)$ , e.g.,

[TABLE]

Since $\mathcal{N}_{0}$ and $\mathcal{N}_{0}^{*}$ are adjoint, we can define $q(x,y)=q(x|y)p_{Y}(y)=q(y|x)p_{X}(x)$ . Although the marginals of $q(x,y)$ are the probability distributions $p_{X}$ and $p_{Y}$ , the numbers $q(x,y)$ are not necessarily positive.

The quality of this approximation for a given $k_{0}$ does not directly depend on the dimensionality of $X$ and $Y$ , but only on the amount of correlations between the two variables. Our aim is to use a $k_{0}$ small enough that the components of $\mathcal{N}_{0}$ and $\mathcal{N}_{0}^{*}$ can be computed explicitly.

Theorem 1.

$\mathcal{N}_{0}$ * is the map of rank $k_{0}$ which minimizes the average distance*

[TABLE]

Proof.

The low rank approximation $\mathcal{N}_{0}$ minimizes the distance $\|\mathcal{N}_{0}-\mathcal{N}\|_{\rm F}$ where

[TABLE]

is the Hilbert-Schmidt (or Frobenius) norm [23]. This follows from the fact that this is also the $l^{2}$ -norm of the vector of singular values of $\mathcal{M}$ . Let us find the explicit form of the trace. Each possible value $x$ of the variable $X$ is associated with a probability distribution $\delta_{x}(y)=1$ when $x=y$ and zero otherwise. These distributions form an orthogonal basis of $V_{X}$ , and have norms $\langle\delta_{x},\delta_{x}\rangle=1/p_{X}(x)$ . Therefore,

[TABLE]

∎

B.4 Relevant variables

We express the elements $\mu\in V_{X}$ and $\nu\in V_{Y}$ in terms of the marginals $p_{X}$ and $p_{Y}$ as simple products:

[TABLE]

for all $x,y$ , where $f$ and $g$ are real functions of $x$ and $y$ .

The inner products then simply become correlations among variables. Using also $\mu^{\prime}=p_{X}f^{\prime}$ and $\nu^{\prime}=p_{Y}g^{\prime}$ , we obtain

[TABLE]

These are simple expectation values with respect to $p$ , which we assumed is the type of quantity we can evaluate for arbitrary functions $f,f^{\prime},g,g^{\prime}$ .

Since $\mathcal{N}^{*}\mathcal{N}$ is self-adjoint in terms of this inner product, its eigenvectors $u_{i}$ are orthogonal, and hence the corresponding variables $a_{i}$ defined by $u_{i}(x)=p_{X}(x)a_{i}(x)$ are uncorrelated. Indeed,

[TABLE]

for all $i,j$ . Moreover, accounting for the eigenvector $u_{0}=p_{X}$ (corresponding to the constant feature $a_{0}(x)=1$ for all $x$ ),

[TABLE]

for all $i\neq 0$ . Hence we trivially have

[TABLE]

for all $i,j\neq 0$ .

Likewise for the eigenvectors of $\mathcal{N}\mathcal{N}^{*}$ . If $v_{i}(y)=p_{Y}(y)b_{i}(y)$ :

[TABLE]

for all $i,j\neq 0$ .

Importantly, this does not mean that the variables $u_{1},u_{2},\dots$ nor $v_{1},v_{2},\dots$ are “disentangled”, i.e., they are not statistically independent. These variables represent components in the space of probability vectors, rather than the “sample” space. They should be understood as spanning a subspace of the space of functions over the relevant independent variables. We discuss this in more detail in Section 5.3.

B.5 Corners of $\mathcal{N}$ and loss function

The final piece of puzzle we need, is the ability to express the components (corners) of $\mathcal{N}$ and $\mathcal{N}^{*}$ in the span of possible non-orthogonal families of variables.

Let us therefore consider two arbitrary families $f_{1},\dots,f_{k_{0}}$ and $g_{1},\dots,g_{k_{0}}$ of variables, which respectively represent the vectors $p_{X}f_{j}\in V_{X}$ and $p_{Y}g_{j}\in V_{Y}$ .

Firstly, we need matrices representing the components of the inner products on $V_{X}$ and $V_{Y}$ . Those are the symmetric matrices

[TABLE]

The components $N_{ij}$ of $\mathcal{N}$ are defined by

[TABLE]

Taking the inner product with $p_{Y}g_{k}$ , we obtain

[TABLE]

The left-hand side can be computed using Equ. 25. It is the matrix

[TABLE]

Therefore, in matrix notation, Equ. (47) is $A=LN$ , or

[TABLE]

The components $N^{*}_{ij}$ of $\mathcal{N}^{*}$ are obtained by just swapping $X$ and $Y$ , yielding

[TABLE]

Hence the singular values of the corner of $\mathcal{N}$ defined by the variables $f_{j}$ and $g_{j}$ are just the square-root of the eigenvalues of the matrix $N^{*}N=K^{-1}A^{\top}L^{-1}A$ . In order to find the variables $f_{j}$ and $g_{j}$ with the same span as the first $k_{0}$ eigenvectors $u_{j}$ , $v_{j}$ , we just need to maximize all the eigenvalues of $N^{*}N$ . A simple way to do this is to use (minus) the trace of $N^{*}N$ as loss function, since it is the sum of the square of the singular values. We call ${\rm Tr}\,(N^{*}N)$ the relevance of the subspaces defines by the variables $f_{j}$ ad $g_{i}$ for all $i,j$ . This yields the loss/cost function:

[TABLE]

Once optimal variables have been found, one can obtain the components of the eigenvectors in the span of $f_{1},\dots,f_{k_{0}}$ through standard numerical diagonalization of $N^{*}N$ .

B.6 Inference

The variables minimizing $C$ can be used to infer one variable from the other. For instance, given $y$ , the inferred probability distribution over $x$ is given by $p_{X|Y}(x|y)=\mathcal{N}^{*}(\delta_{y})(x),$ where $\delta_{y}(y^{\prime})$ is $1$ when $y=y^{\prime}$ and zero otherwise. In order to compute this, we first need the components of the distribution $\delta_{y}$ in terms of the family $p_{Y}g_{1},\dots,p_{Y}g_{k_{0}}$ , i.e., the real numbers $(\delta_{y})_{j}$ such that

[TABLE]

where $\langle r,p_{Y}\delta_{i}\rangle_{Y}=0$ for all $i$ . Taking the inner product with $p_{Y}g_{j}$ , we obtain

[TABLE]

where the left hand side is also just

[TABLE]

Therefore the components of $\delta_{y}$ are explicitly

[TABLE]

It follows that

[TABLE]

Then, for instance, the expected inferred value of $X$ is

[TABLE]

For the inference of $Y$ from $x$ , we have

[TABLE]

Appendix C Analytical example

When $p(x,y)$ is any multivariate Gaussian distribution, everything can be computed analytically. Let us consider here the one-dimensional case. We use $p(x)\;\propto\;\exp\left({-{x^{2}}/{2\tau^{2}}}\right)$ , and the conditional $p(y|x)\;\propto\;\exp\left({-{(y-x)^{2}}/{2\sigma^{2}}}\right)$ . That is, $y$ is equal to $x$ but with some added Gaussian noise. This gives

[TABLE]

It was show in [3], that the most relevant subspace of dimension $k_{0}$ on the variable $X$ is simply spanned by the variables

[TABLE]

$n=0,\dots,k_{0}-1$ . Similarly for $Y$ ;

[TABLE]

This independence of the relevant variables on the detailed parameters of $p$ is a general property of Gaussian joint distributions.

This means, for instance, that the most relevant feature ( $n=1$ ) for predicting the value of $X$ given $Y=y$ is simply $Y$ itself. The higher order variables have to do with inferring extra aspects of the probability distribution over $X$ .

A set of orthogonal variables can be obtain from the Gram-Schmidt procedure, which, if done from small to large $n$ much necessarily yield the eigenvectors $u_{n}$ and $v_{n}$ . For illustration purpose, let us work with the non-orthogonal vectors $f_{n}$ and $g_{n}$ , keeping only the first $k_{0}=3$ vectors.

The three matrices (correlators) we need can be easily computed:

[TABLE]

We obtain

[TABLE]

The eigenvalues of $M$ can be read on the diagonal, and the corresponding eigenvectors are $(1,0,0)$ , $(0,1,0)$ and $(-\tau^{2},0,1)$ , which means that the eigenfunctions are in order $u_{0}(x)=1$ , $u_{1}(x)=x$ and $u_{2}(x)=x^{2}-\tau^{2}.$

Because we are working with continuous variables, the true rank of $\mathcal{N}$ is infinite, even for any finite cutoff on the singular values. Nevertheless, it is instructive to see how the approximate inference fares for rank $k_{0}=3$ . Given the value $y$ for $Y$ , the inferred distribution over $X$ is

[TABLE]

The approximately inferred first and second moments of $X$ is given by integrating the above times $x$ (resp. $x^{2}$ ) over $x$ . We obtain

[TABLE]

which are actually exact: they are equal to the first two moments of $X$ over $p_{X|Y}$ as given in Eq. (59).

In fact, it is easy to see that this would be true for the first $k_{0}-1$ moments had we kept the $k_{0}$ most relevant variables.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Cédric Bény and Tobias J Osborne. Renormalisation as an inference problem. (ar Xiv:1310.3188) , 2013.
2[2] Cédric Bény and Tobias J Osborne. The renormalisation group via statistical inference. New J. Phys. , 17:083005, 2015. (ar Xiv:1402.4949) .
3[3] HO Lancaster. The structure of bivariate distributions. The Annals of Mathematical Statistics , 29(3):719–736, 1958.
4[4] Tomer Michaeli, Weiran Wang, and Karen Livescu. Nonparametric canonical correlation analysis. In International Conference on Machine Learning , pages 1967–1976, 2016.
5[5] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In International conference on machine learning , pages 1247–1255, 2013.
6[6] Harold Hotelling. Relations between two sets of variates. Biometrika , 28(3/4):321–377, 1936.
7[7] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. (ar Xiv:1312.6114) , 2013.
8[8] Raban Iten, Tony Metger, Henrik Wilming, Lídia Del Rio, and Renato Renner. Discovering physical concepts with neural networks. (ar Xiv:1807.10300) , 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Learning relevant features for statistical inference

Abstract

1 Introduction

2 Related work

3 Theory

3.1 Non-diagonal form and relevant variables

4 Algorithm

5 Experiments

5.1 Inference on occluded MNIST

5.2 Supervised learning

5.3 Structure of the relevant variables

5.4 Independent variables and generative model

6 Outlook

Acknowledgments

Appendix A Extra information about the algorithm

A.1 Alternative interpretation of the objective

A.2 Heuristic

Appendix B Theory in more details

B.1 Inner product on probability vectors

B.2 Eigen-relevance decomposition

B.3 Low-rank approximation

Theorem 1**.**

Proof.

B.4 Relevant variables

B.5 Corners of N\mathcal{N}N and loss function

B.6 Inference

Appendix C Analytical example

Theorem 1.

B.5 Corners of $\mathcal{N}$ and loss function