A Neural Network model with Bidirectional Whitening

Yuki Fujimoto; Toru Ohira

arXiv:1704.07147·stat.ML·July 11, 2018

A Neural Network model with Bidirectional Whitening

Yuki Fujimoto, Toru Ohira

PDF

TL;DR

This paper introduces a bidirectional whitening neural network model that enhances natural gradient descent in multilayer perceptrons by applying whitening during both feed-forward and back-propagation phases, improving learning efficiency.

Contribution

It extends whitened neural networks by incorporating whitening in both directions, offering a novel approach to natural gradient descent in neural networks.

Findings

01

Effective on MNIST handwritten character recognition

02

Improves training efficiency of multilayer perceptrons

03

Demonstrates the benefits of bidirectional whitening

Abstract

We present here a new model and algorithm which performs an efficient Natural gradient descent for Multilayer Perceptrons. Natural gradient descent was originally proposed from a point of view of information geometry, and it performs the steepest descent updates on manifolds in a Riemannian space. In particular, we extend an approach taken by the "Whitened neural networks" model. We make the whitening process not only in feed-forward direction as in the original model, but also in the back-propagation phase. Its efficacy is shown by an application of this "Bidirectional whitened neural networks" model to a handwritten character recognition data (MNIST data).

Equations92

a^{(i)}

a^{(i)}

=

z^{(i)}

w \equiv (vec (\overset{ˉ}{W}^{(1)})^{T}, \dots, vec (\overset{ˉ}{W}^{(N)})^{T})^{T}

w \equiv (vec (\overset{ˉ}{W}^{(1)})^{T}, \dots, vec (\overset{ˉ}{W}^{(N)})^{T})^{T}

w^{*}

w^{*}

w^{*}

w^{*}

=

G (w) \equiv E [\nabla l (X; w) \nabla l (X; w)^{T}]

G (w) \equiv E [\nabla l (X; w) \nabla l (X; w)^{T}]

g_{ij} (w) = E [\frac{\partial l}{\partial w _{i}} (X; w) \frac{\partial l}{\partial w _{j}} (X; w)] = \int \frac{\partial l}{\partial w _{i}} (x; w) \frac{\partial l}{\partial w _{j}} (x; w) p (x; w) d x

g_{ij} (w) = E [\frac{\partial l}{\partial w _{i}} (X; w) \frac{\partial l}{\partial w _{j}} (X; w)] = \int \frac{\partial l}{\partial w _{i}} (x; w) \frac{\partial l}{\partial w _{j}} (x; w) p (x; w) d x

w (t + 1) = w (t) - η (t) G^{- 1} (w (t)) \nabla M (w (t))

w (t + 1) = w (t) - η (t) G^{- 1} (w (t)) \nabla M (w (t))

p (x, y; w)

p (x, y; w)

\frac{\partial l}{\partial w} \equiv (vec (\frac{\partial l}{\partial W ˉ ^{(1)}})^{T}, \dots, vec (\frac{\partial l}{\partial W ˉ ^{(N)}})^{T})^{T}

\frac{\partial l}{\partial w} \equiv (vec (\frac{\partial l}{\partial W ˉ ^{(1)}})^{T}, \dots, vec (\frac{\partial l}{\partial W ˉ ^{(N)}})^{T})^{T}

G (w)

G (w)

G_{i, j}

\frac{\partial l}{\partial W ˉ ^{(i)}} = δ^{(i)} \overset{ˉ}{z}^{(i - 1)^{T}}

\frac{\partial l}{\partial W ˉ ^{(i)}} = δ^{(i)} \overset{ˉ}{z}^{(i - 1)^{T}}

G_{i, j}

G_{i, j}

G_{i, j}

G_{i, j}

\overset{˘}{G}

\overset{˘}{G}

\tilde{G}_{i, j} = O (i \neq = j)

\tilde{G}_{i, j} = O (i \neq = j)

z^{†^{(i - 1)}}

z^{†^{(i - 1)}}

a^{(i)}

z^{(i)}

\tilde{G}_{i, i} = E [\overset{ˉ}{z}^{†^{(i - 1)}} \overset{ˉ}{z}^{†^{(i - 1)^{T}}}] \otimes E [δ^{(i)} δ^{(i)^{T}}]

\tilde{G}_{i, i} = E [\overset{ˉ}{z}^{†^{(i - 1)}} \overset{ˉ}{z}^{†^{(i - 1)^{T}}}] \otimes E [δ^{(i)} δ^{(i)^{T}}]

E [\overset{ˉ}{z}^{†^{(i - 1)}} \overset{ˉ}{z}^{†^{(i - 1)^{T}}}] = I

E [\overset{ˉ}{z}^{†^{(i - 1)}} \overset{ˉ}{z}^{†^{(i - 1)^{T}}}] = I

1 E [z^{†^{(i - 1)}}] E [z^{†^{(i - 1)^{T}}}] E [z^{†^{(i - 1)}} z^{†^{(i - 1)^{T}}}] = I

1 E [z^{†^{(i - 1)}}] E [z^{†^{(i - 1)^{T}}}] E [z^{†^{(i - 1)}} z^{†^{(i - 1)^{T}}}] = I

E [z^{†^{(i - 1)}}]

E [z^{†^{(i - 1)}}]

E [z^{†^{(i - 1)}} z^{†^{(i - 1)^{T}}}]

c^{(i - 1)}

c^{(i - 1)}

\overset{ˇ}{Z}_{i - 1, i - 1} \equiv E [(z^{(i - 1)} - c^{(i - 1)}) (z^{(i - 1)} - c^{(i - 1)})^{T}]

\overset{ˇ}{Z}_{i - 1, i - 1} \equiv E [(z^{(i - 1)} - c^{(i - 1)}) (z^{(i - 1)} - c^{(i - 1)})^{T}]

E [z^{†^{(i - 1)}} z^{†^{(i - 1)^{T}}}] = U^{(i - 1)} \overset{ˇ}{Z}_{i - 1, i - 1} U^{(i - 1)^{T}} = I

E [z^{†^{(i - 1)}} z^{†^{(i - 1)^{T}}}] = U^{(i - 1)} \overset{ˇ}{Z}_{i - 1, i - 1} U^{(i - 1)^{T}} = I

\overset{ˇ}{Z}_{i - 1, i - 1} = P Λ P^{T}

\overset{ˇ}{Z}_{i - 1, i - 1} = P Λ P^{T}

U^{(i - 1)}

U^{(i - 1)}

W^{†^{(i)}} U^{(i - 1)} (z^{(i - 1)} - c^{(i - 1)}) + b^{†^{(i)}} = W_{n e w}^{†^{(i)}} U_{n e w}^{(i - 1)} (z^{(i - 1)} - c_{n e w}^{(i - 1)}) + b_{n e w}^{†^{(i)}}

W^{†^{(i)}} U^{(i - 1)} (z^{(i - 1)} - c^{(i - 1)}) + b^{†^{(i)}} = W_{n e w}^{†^{(i)}} U_{n e w}^{(i - 1)} (z^{(i - 1)} - c_{n e w}^{(i - 1)}) + b_{n e w}^{†^{(i)}}

W_{n e w}^{†^{(i)}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Neural Network model with Bidirectional Whitening

Yuki Fujimoto*∗* and Toru Ohira*∗∗*

Graduate School of Mathematics, Nagoya University, Nagoya, Japan

∗E-mail: [email protected] *∗∗*E-mail: [email protected]

Abstract

We present here a new model and algorithm which performs an efficient Natural gradient descent for Multilayer Perceptrons. Natural gradient descent was originally proposed from a point of view of information geometry, and it performs the steepest descent updates on manifolds in a Riemannian space. In particular, we extend an approach taken by the “Whitened neural networks” model. We make the whitening process not only in feed-forward direction as in the original model, but also in the back-propagation phase. Its efficacy is shown by an application of this “Bidirectional whitened neural networks” model to a handwritten character recognition data (MNIST data).

1 Introduction

Interests for developing and efficient learning algorithm for multilayer neural networks have grown rapidly due to recent upheaval of the deep learning and other machine learnings. Natural gradient descent(NGD) is considered as one of the strong methods. It was proposed from a point of view of information geometry[1], where neural networks are considered as manifolds in a Riemannian space with a measure given by the Fisher information matrix (FIM). Then, the learning process can be interpreted as an optimization problem of a function in a Riemannian space. The idea of applying the NGD to multilayer neural networks was initiated by Amari. Recently, it has regained interests from machine learning researchers[6, 9].

However, difficulty exists for using the NGD: the computational costs of estimating the FIM and obtaining its inverse is high. Much attention and research efforts have gone into solving this difficulty[4, 7, 8, 5, 10].

In this paper, we will focus on one of such approaches, and extend the work of [4]. In their approach “Whitened neural networks” model was proposed. There, a neural network architecture, whose FIM is closer to the identity matrix with less computational demands, is explored. Extra neurons and connections are added to achieve this whitening approximation. In particular, they have used this scheme for the forward direction of inputs to neurons and achieved lower computational costs.

Our main proposal in this paper is to further push the approximation of the FIM being closer to the identity by implementing the whitening process also in the back-propagation phase. This model, which we term as the “bidirectional whitened neural networks” model, will be described in the following. Its efficacy is also shown through its application to a handwritten character recognition data (MNIST data).

2 Multilayer Perceptron and Natural Gradient Descent

We present here a brief review of the Multilayer Perceptron and the Natural Gradient Descent, which we focus in this paper. The first level of approximation for the FIM is also discussed.

2.1 Multilayer Perceptron

Multilayer Perceptron is a model of neural networks which has feed-forward structure with no recurrent loops. They have multiple layers called input, hidden, and output, and neurons have all to all connections between successive layers. Let us consider a $N$ layer Perceptron, and set the values of the input as $\bm{z}^{(0)}=\bm{x}$ , the hidden layer values as $\bm{z}^{(i)}=\bm{h}^{(i)}$ , ( $1\leq i\leq N-1$ ), and the output of the entire network as $\bm{z}^{(N)}=f(\bm{x};\bm{w})$ .

This $f(\bm{x};\bm{w})$ can be viewed as a function of $\bm{x}$ by fixing the parameters $\bm{w}$ , and thus called as a “multilayer Perceptron function”. The rules of computing the value of the $i$ layer from the $i-1$ in the network is given as follows ( $1\leq i\leq N$ ).

[TABLE]

Here, $\bm{\phi}^{(i)}(\cdot)$ is an activation function applied to each element of $\bm{a}$ . Typically, the sigmoid function or $ReLU$ function are used for this activation function. Also, (2) is a shortened notation by setting $\bar{W}^{(i)}\equiv(\bm{b}^{(i)},W^{(i)}),\bar{\bm{z}}^{(i)}\equiv(1,\bm{z}^{(i)^{T}})^{T}$ .

Hence, the multilayer Perceptron function (MPF) is defined by setting

$\{(W^{(i)},\bm{b}^{(i)})\}$ . It is often convenient to denote these parameters by $\bm{w}$ , defined by

[TABLE]

where $\mathrm{vec}(A)$ means a compound vector of column vectors of a matrix $A$

The learning process of multilayer Perceptrons is an optimization problem set by the following statistical inference. The training data of input and output pairs is given as $D\equiv\{(\bm{x}_{k},\bm{y}_{k})\}_{k=1}^{K}$ . We assume this data set is generated by the same joint distribution $Q(X,Y)$ independently. In order to estimate this input output probabilistic relations, a statistical model $\{p(\bm{x},\bm{y};\bm{w})\}_{\bm{w}\in\Theta}$ is considered using the MPF. Here $p(\bm{x},\bm{y};\bm{w})$ is a joint probability density function and $\Theta\subset\mathbb{R}^{M}$ is a set of parameters. The problem is to find the parameter $\bm{w}$ which makes $p(\bm{x},\bm{y};\bm{w})$ as a best estimate of $Q(X,Y)$ . The maximum likelihood method is employed to obtain such $\bm{w^{*}}$ .

[TABLE]

It is known that this estimation is the same as the following minimization problem.

[TABLE]

Here, we have set the target function to minimize as $M(\bm{w})$ . Research on efficient algorithms for this optimization problem is the central issue in the following.

2.2 Natural Gradient Method

Natural Gradient Method is a steepest descent method in a Riemannian space. It is proposed from the information geometry where statistical models are manifolds in a Riemannian space with a metric of the Fisher Information Matrices[2]. Thus, we can view the learning by the multilayer Perceptrons as an optimization problem in a Riemannian space as presented in 2.1.

Let us start by defining the Fisher information matrix and the Natural Gradient Descent.

Definition: Fisher Information Matrix

We set $l(\bm{x};\bm{w})\equiv\log p(\bm{x};\bm{w})$ . For $\bm{w}\in\Theta$ , a square matrix $G(\bm{w})=(g_{ij}(\bm{w}))$ is defined as follows.

[TABLE]

(8) can be expressed by each elements as,

[TABLE]

We call this matrix $G$ the Fisher information matrix (FIM).

Definition: Natural Gradient Descent

We call the following gradient method as the Natural Gradient Descent (NGD).

[TABLE]

Here $\eta(t)$ is a rate of the learning.

Then, $-G^{-1}(\bm{w}(t))\nabla M(\bm{w}(t))$ is the direction of the maximal decrease of the target function $M$ given a fixed step size. We note that this NGD reduces to the ordinary gradient descent, when $G$ is the identity matrix.

2.3 Approximation of the Fisher Information Matrix

As discussed in the previous section, the FIM and its inverse play important roles in the calculation in the NGD. We, thus, present a preliminary approximation of the FIM in order to lessen the computational burdens[7].

Let us first compute the FIM for the multilayer Perceptrons. The probability density function associated with the multilayer Perceproton function (MPF) is given as follows.

[TABLE]

Also, the gradient vector are written concisely as in (4),

[TABLE]

Then, the FIM for the MLP is given as follows.

[TABLE]

Hence, the FIM for the MLP is composed of the block matrices $G_{i,j}$ .

If we further set $\delta^{(i)}_{j}=\dfrac{\partial l}{\partial a^{(i)}_{j}}$ , the following is obtained.

[TABLE]

By putting (15) into (14), the $G_{i,j}$ can now be expressed as

[TABLE]

(Here, $\otimes$ is the Kronecker product. )

For the efficient computation, it is essential to approximate this FIM. The preliminary approximation consists of two steps.

The first step approximation of $G_{i,j}$ is given as $\tilde{G}_{i,j}$ which is defined as follows.

[TABLE]

This approximation means that we are inter-changing the expectation of the Kronecker products with the Kronecker products of the expectations. The matrix $\tilde{G}$ , whose elements are given by replacing $G_{i,j}$ of (13) with $\tilde{G}_{i,j}$ , is the first step approximation of the FIM. We note that the FIM is decomposed into two parts by this approximation: $\bar{\bm{z}}^{(i-1)}$ (the feed-forward phase part) and $\bm{\delta}^{(i)}$ (the back-propagtaing phase part).

We perform the second step approximation on $\tilde{G}$ to obtain $\breve{G}$ .

[TABLE]

Here, $\mathop{\mathrm{diag}}\nolimits(\cdots)$ denotes a block diagonal matrix, whose non-zero diagonals are given by the elements. In other words, $\breve{G}$ is obtained from $\tilde{G}$ by setting non-diagonal elements as the zero matrix,

[TABLE]

This approximation allows us to compute the FIM layer by layer independently.

3 Whitened Neural Networks

In this section, we present algorithms which aim to perform Natural Gradient Descent efficiently with the approximated FIM, $\breve{G}$ .

3.1 Natural Gradient Descent by Whitening

Let us first describe Whitened Neural Networks[4]. The main idea of this method is to perform the NGD by reconfiguring the network and parameters, so that the FIM becomes closer to the identity matrix. When the FIM is the identity matrix, the NGD is the same as the ordinary gradient descent, thus can be implemented simply with less computational costs.

3.1.1 Whitened Neural Network

The architecture of the Whitened Neural Networks (WNN) is obtained by changing (1) through (3) into the following form.

[TABLE]

Here $\{(U^{(i-1)},\bm{c}^{(i-1)})\}$ are the new parameters introduced as “Whitening” parameters. $\{(W^{\dagger^{(i)}},\bm{b}^{\dagger^{(i)}})\}$ are the new model parameters associated with this new architecture. These are the ones which we want to estimate and update using gradient descent methods as in the normal multilayer Perceptrons.

We present in the Figure 1 the new architecture defined by (20), (21), (22). It shows the $i-1$ th layer to the $i$ th layer. We note the gray layer in the Figure 1 is the new inserted layer for the purpose of “whitening”. This change of network configuration is the essence of WNN.

From (2.3), the approximated FIM $\tilde{G}_{i,i}$ in the WNN, then, is expressed as the following.

[TABLE]

The essential idea of the whitening is to make $\breve{G}$ closer to the identity by defining the whitening parameters $\{(U^{(i-1)},\bm{c}^{(i-1)})\}$ as

[TABLE]

for each $i$ and performs the gradient descent. (Our idea, which will be described later in 3.2, is to further extend the whitening to the latter factor $E\left[\bm{\delta}^{(i)}\bm{\delta}^{(i)^{T}}\right]$ in (23))

3.1.2 Updating of the Whitening Parameters

We calculate here explicitly $\{(U^{(i-1)},\bm{c}^{(i-1)})\}$ , which satisfies the condition (24).

As $\bar{\bm{z}}^{\dagger^{(i-1)}}=(1,\bm{z}^{\dagger^{(i-1)^{T}}})^{T}$ , (24) can be decomposed into

[TABLE]

Thus,

[TABLE]

are required to satisfy this condition.

Let us look at these conditions. (26) can be satisfied by

[TABLE]

Also, for (27), we first set the matrix $\check{Z}_{i-1,i-1}$ by the following

[TABLE]

Then, (27) becomes

[TABLE]

Because $\check{Z}_{i-1,i-1}$ is a symmetric matrix, there exists a orthogonal matrix $P$ , which makes it diagonal.

[TABLE]

Here $\Lambda$ is the diagonalized matrix. Then, if we set

[TABLE]

the condition (30) is approximately satisfied. (Here, $\varepsilon$ is a small positive constant to avoid division by zero. )

By this process, called the whitening process, according to (28) and (32), we update the whitening parameters satisfying (24). We note that, in this updating, the calculation of $\bm{z}^{(i-1)}$ in feed-forward phase is essential.

3.1.3 Updating of the model parameters

We now turn our attention to the updating of the model parameters $\{(W^{\dagger^{(i)}},\bm{b}^{\dagger^{(i)}})\}$ . We need to pay attention so that the inclusion of the whitening process and the associated layer does not change the value of the multilayer Perceptron function (MPF) itself. In concrete, we need to do the following. Let us assume the whitening parameters $\{(U^{(i-1)},\bm{c}^{(i-1)})\}$ are updated to $\{(U^{(i-1)}_{new},\bm{c}^{(i-1)}_{new})\}$ . We want to keep the value of (21) unchanged by this updating. This places a constrains in the way we update the model parameters $\{(W^{\dagger^{(i)}}_{new},\bm{b}^{\dagger^{(i)}}_{new})\}$ . Namely, for any value of $\bm{z}^{(i-1)}$ , the following must be satisfied.

[TABLE]

We can obtain the following by solving these equations.

[TABLE]

By putting together (20) and (21), we can set $\{(W^{(i)},\bm{b}^{(i)})\}$ as

[TABLE]

Using these $\{(W^{(i)},\bm{b}^{(i)})\}$ , we can re-write (34)and (35) as

[TABLE]

Thus, we can keep MPF the same by updating whitening parameters first as in (28) and (32) and then update model parameters with (38) and (39).

As we change model parameters, the values of $E[\bm{z}^{(i-1)}],\check{Z}_{i-1,i-1}$ changes, which in turn requires the update of the whitening parameters to keep the FIM close to the identity matrix. However, it is computationally expensive to update both set of parameters at every iterations. In particular, the update of the whitening parameters for a layer of $M$ neurons takes computation of the order of $O(M^{3})$ . Thus, in actual implementations, the update of the whitening parameters are performed at certain fixed time intervals[4], though this makes a gradual digression from the NGD for that time interval between the successive updating of the whitening parameters.

The method and algorithm described above is called “Projected Natural Gradient Descent”(PRONG)[4], which is outlined in Algorithm 1.

3.2 Extension of Whitening

Here, we describe our proposal of the new extended whitening algorithms based on 3.1.

In the whitening method described above, in order to keep the approximated FIM, $\breve{G}$ , closer to the identity matrix, updating of the whitening parameters $\{(U^{(i)},\bm{c}^{(i)})\}$ are performed. This makes the first factor $E\left[\bar{\bm{z}}^{\dagger^{(i-1)}}\bar{\bm{z}}^{\dagger^{(i-1)^{T}}}\right]$ in

[TABLE]

closer to the identity matrix.

The main idea of our method is to make the second factor $E\left[\bm{\delta}^{(i)}\bm{\delta}^{(i)^{T}}\right]$ toward the identity as well, so that $\tilde{G}_{i,i}$ is even better approximated by the identity matrix. This turns out that we implement whitening process not only in the feed-forward phase but also in the back-propagating phase.

3.2.1 Bidirectional Whitened Neural Networks

In order to perform the back-whitening, we modify the forward-whitening process described by (20), (21) and (22) into the following.

[TABLE]

Here, $\{R^{(i)^{T}}\}$ is a newly introduced parameter, called the back-whitening parameter.

We show, as in Figure 1, the architecture of this extended method defined by (41), (42), (43), (44) in the Figure 2. The dark gray part in the Figure 2 is the newly introduced layer to accommodate the back-whitening parameter $\{R^{(i)^{T}}\}$ .

As mentioned above, this proposed method performs whitening process both in feed-forward and back-propagating phase. Thus, we call this new architecture as the Bidirectional Whitened Neural Networks (BWNN).

We introduce a new parameter $\bm{\delta}^{\dagger^{(i)}}$ in place of $\bm{\delta}^{(i)}$ as in the following.

[TABLE]

Then, the approximation of $\tilde{G}_{i,i}$ is then expressed as

[TABLE]

In analogy with Section 3.1, we will fix the back-whitening parameter $\{R^{(i)^{T}}\}$ so that

[TABLE]

3.2.2 Updating of the back-whitening parameter

Let us explicitly find $\{R^{(i)^{T}}\}$ to satisfy (47). From (45), we have

[TABLE]

Thus, $\bm{\delta}^{\dagger^{(i)}}$ is a linear transformation of $\bm{\delta}^{(i)}$ , which can be written as

[TABLE]

By inserting (49) into (47), we obtain

[TABLE]

Hence, in analogy with (32), $R^{(i)}$ which satisfies (47) is given by the following

[TABLE]

Here, $\Lambda,P$ are the diagonalized and the orthogonal matrices associated with $D_{i,i}$ , and $\varepsilon$ is the small positive parameter to avoid a division by zero.

Altogether, as in the case of the forward-whitening parameters, (47) is satisfied by updating of the back-whitening parameters according to (51), which, in turn, depends on the calculation of $\bm{\delta}^{(i)}$ in the back-propagating phase.

3.2.3 Updating of the model parameters

As in the feed-forward phase, we update the model parameters $\{(W^{\dagger^{(i)}},\bm{b}^{\dagger^{(i)}})\}$ so that the values of the multilayer Perceptron function are kept the same when the back-whitening parameters are updated.

In order to achieve this, the model parameters $\{(W^{\dagger^{(i)}},\bm{b}^{\dagger^{(i)}})\}$ need to be updated as follows, given the back-whitening parameters are updated from $R^{(i)}$ to $R_{new}^{(i)}$ .

[TABLE]

We will call the above algorithm as “Bidirectional Projected Natural Gradient Descent”(BPRONG) because it performs whitening both in feed-forward and back-propagaing phase. Its outline is shown in Algorithm 2. Also, as in the forward-whitening, we can perform the back-whitening update in a fixed intervals. They can both be done at the same time, or independently. In the following section, we will employ the latter method for a numerical application.

4 Numerical Experiment

In order to see the efficacy of our proposed method BPRONG in 3.2, we have applied it to a problem of hand-written character (digits) recognition using the MNIST data set (http://yann.lecun.com/exdb/mnist/) and compared against three other methods: ordinary Stochastic Gradient Descent(SGD), Batch

Normalization(BN)[5], and PRONG. The network architecture is common to all the compared methods with 5 layers of 784-100-100-100-10 neurons from input to output. Also, common learning rate of $0.01$ is taken and the mini-batch size is $100$ . The training data contains $60000$ sets and the test data has $10000$ . We call updates of $600$ as $1$ epoch, and plot, at each epoch, the training loss with the training set, and the validation loss with the test data sets.

We observe the advantage of BPRONG with respect to the iteration numbers both in the training and the validation losses as shown in Figures 3 and 4. With respect to the actual computation times, BPRONG is faster than PRONG, and about the same speed as the BN (Figures 5 and 6). This is due to the fact that eigenvalue decomposition associated with the whitening is computationally costly to offset the advantage over BN with respect to iteration numbers.

Altogether, our proposed method, BPRONG, has shown its potential. If we can find methods to speed up the whitening process, BPRONG can show its effectiveness further.

5 Discussion

We presented here an extended model of the previously proposed Whitened Neural Networks[4] as a method to realize the Natural Gradient Descent. Our extension, which we call Bidirectional Whitened Neural Networks, aims to make the Fisher Information Matrix closer to the identity matrix. It has shown its potential as an efficient method thorough a numerical application to a hand-written digits recognition problem.

We note two points as topics to be investigated further. First, the proposed model should be tested for larger and deeper network architectures for a check of its efficacy and stability. It may require further modifications for improvements on these aspects, particularly by exploring matrix decomposition methods. Secondly, we want to find more dynamical way for whitening process. In other words, we would like to keep the Fisher Information Matrix constantly closer to the identity by continuous whitenings. Though it is computationally more expensive, we may build on some previous studies, such as adaptive calculations of transforming matrices[3].

Bibliography10

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Shun-ichi Amari. Natural gradient works efficiently in learning. Neural computation , 10(2):251–276, 1998.
2[2] Shun-ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry (Translations of Mathematical Monographs) . American Mathematical Society, 2007.
3[3] J-F Cardoso and Beate H Laheld. Equivariant adaptive source separation. IEEE Transactions on signal processing , 44(12):3017–3030, 1996.
4[4] Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and Koray Kavukcuoglu. Natural neural networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28 , pages 2071–2079. Curran Associates, Inc., 2015.
5[5] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) , pages 448–456, 2015.
6[6] James Martens. New insights and perspectives on the natural gradient method. ar Xiv preprint ar Xiv:1412.1193 , 2014.
7[7] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. ar Xiv preprint ar Xiv:1503.05671 , 2015.
8[8] Hyeyoung Park, Shun-ichi Amari, and Kenji Fukumizu. Adaptive natural gradient learning algorithms for various stochastic models. Neural Networks , 13(7):755–764, 2000.