Training products of expert capsules with mixing by dynamic routing

Michael Hauser

arXiv:1907.11643·cs.LG·July 29, 2019

Training products of expert capsules with mixing by dynamic routing

Michael Hauser

PDF

Open Access

TL;DR

This paper introduces an unsupervised learning algorithm for capsule networks that uses an energy-based model with dynamic routing, enabling realistic image generation from learned distributions.

Contribution

It proposes a novel energy function for capsule networks aligned with dynamic routing, facilitating unsupervised training and image generation.

Findings

01

Successfully trained capsule networks on standard vision datasets.

02

Able to generate realistic images from the learned distribution.

03

Demonstrated the effectiveness of energy-based models with dynamic routing.

Abstract

This study develops an unsupervised learning algorithm for products of expert capsules with dynamic routing. Analogous to binary-valued neurons in Restricted Boltzmann Machines, the magnitude of a squashed capsule firing takes values between zero and one, representing the probability of the capsule being on. This analogy motivates the design of an energy function for capsule networks. In order to have an efficient sampling procedure where hidden layer nodes are not connected, the energy function is made consistent with dynamic routing in the sense of the probability of a capsule firing, and inference on the capsule network is computed with the dynamic routing between capsules procedure. In order to optimize the log-likelihood of the visible layer capsules, the gradient is found in terms of this energy function. The developed unsupervised learning algorithm is used to train a capsule…

Equations36

- E (v, h) = i, j \sum w_{ij} v_{i}, h_{j} + i \sum b_{i} v_{i} + j \sum c_{j} h_{j}

- E (v, h) = i, j \sum w_{ij} v_{i}, h_{j} + i \sum b_{i} v_{i} + j \sum c_{j} h_{j}

\frac{\partial lo g p}{\partial w _{ij}} (v) = p (h_{j} = 1∣ v) v_{i} - v \sum p (v) p (h_{j} = 1∣ v) v_{i}

\frac{\partial lo g p}{\partial w _{ij}} (v) = p (h_{j} = 1∣ v) v_{i} - v \sum p (v) p (h_{j} = 1∣ v) v_{i}

z_{j}^{(l + 1)} = i \sum c_{ij}^{(l)} W_{ij}^{(l)} \cdot x_{i}^{(l)}

z_{j}^{(l + 1)} = i \sum c_{ij}^{(l)} W_{ij}^{(l)} \cdot x_{i}^{(l)}

x_{j}^{(l + 1)} = squash (z_{j}^{(l + 1)}) = \frac{∥ z _{j}^{(l + 1)} ∥ ^{2}}{1 + ∥ z _{j}^{(l + 1)} ∥ ^{2}} \frac{z _{j}^{(l + 1)}}{∥ z _{j}^{(l + 1)} ∥}

x_{j}^{(l + 1)} = squash (z_{j}^{(l + 1)}) = \frac{∥ z _{j}^{(l + 1)} ∥ ^{2}}{1 + ∥ z _{j}^{(l + 1)} ∥ ^{2}} \frac{z _{j}^{(l + 1)}}{∥ z _{j}^{(l + 1)} ∥}

P (j = on ∣ x_{1}^{(l)}, \dots, x_{I}^{(l)}) = ∥ squash (z_{j}^{(l + 1)}) ∥

P (j = on ∣ x_{1}^{(l)}, \dots, x_{I}^{(l)}) = ∥ squash (z_{j}^{(l + 1)}) ∥

z_{j}^{(l + 1)} = unsquash (x_{j}^{(l + 1)}) = \frac{∥ x _{j}^{(l + 1)} ∥}{1 - ∥ x _{j}^{(l + 1)} ∥} \frac{x _{j}^{(l + 1)}}{∥ x _{j}^{(l + 1)} ∥}

z_{j}^{(l + 1)} = unsquash (x_{j}^{(l + 1)}) = \frac{∥ x _{j}^{(l + 1)} ∥}{1 - ∥ x _{j}^{(l + 1)} ∥} \frac{x _{j}^{(l + 1)}}{∥ x _{j}^{(l + 1)} ∥}

E (x^{(l)}, ∥ x^{(l + 1)} ∥) = - j \sum lo g (∥ i \sum c_{ij}^{(l)} W_{ij}^{(l)} \cdot x_{i}^{(l)} ∥^{2}) ∥ x_{j}^{(l + 1)} ∥

E (x^{(l)}, ∥ x^{(l + 1)} ∥) = - j \sum lo g (∥ i \sum c_{ij}^{(l)} W_{ij}^{(l)} \cdot x_{i}^{(l)} ∥^{2}) ∥ x_{j}^{(l + 1)} ∥

P (x^{(l)}, ∥ x^{(l + 1)} ∥) = \frac{1}{Z} exp (- E (x^{(l)}, ∥ x^{(l + 1)} ∥))

P (x^{(l)}, ∥ x^{(l + 1)} ∥) = \frac{1}{Z} exp (- E (x^{(l)}, ∥ x^{(l + 1)} ∥))

P (x^{(l)}, ∥ x^{(l + 1)} ∥) = \frac{1}{Z} Π_{j} [exp (lo g (∥ i \sum c_{ij}^{(l)} W_{ij}^{(l)} \cdot x_{i}^{(l)} ∥^{2}) ∥ x_{j}^{(l + 1)} ∥)]

P (x^{(l)}, ∥ x^{(l + 1)} ∥) = \frac{1}{Z} Π_{j} [exp (lo g (∥ i \sum c_{ij}^{(l)} W_{ij}^{(l)} \cdot x_{i}^{(l)} ∥^{2}) ∥ x_{j}^{(l + 1)} ∥)]

P (x^{(l)}) = \frac{1}{Z} Π_{j} [1 + exp (lo g (∥ i \sum c_{ij}^{(l)} W_{ij}^{(l)} \cdot x_{i}^{(l)} ∥^{2}))]

P (x^{(l)}) = \frac{1}{Z} Π_{j} [1 + exp (lo g (∥ i \sum c_{ij}^{(l)} W_{ij}^{(l)} \cdot x_{i}^{(l)} ∥^{2}))]

P (∥ x^{(l + 1)} ∥∣ x^{(l)}) = Π_{j} P (∥ x_{j}^{(l + 1)} ∥∣ x^{(l)})

P (∥ x^{(l + 1)} ∥∣ x^{(l)}) = Π_{j} P (∥ x_{j}^{(l + 1)} ∥∣ x^{(l)})

P (∥ x_{j}^{(l + 1)} ∥ = 1∣ x^{(l)}) = \frac{∥ \sum _{i} c _{ij}^{(l)} W _{ij}^{(l)} \cdot x _{i}^{(l)} ∥ ^{2}}{1 + ∥ \sum _{i} c _{ij}^{(l)} W _{ij}^{(l)} \cdot x _{i}^{(l)} ∥ ^{2}}

P (∥ x_{j}^{(l + 1)} ∥ = 1∣ x^{(l)}) = \frac{∥ \sum _{i} c _{ij}^{(l)} W _{ij}^{(l)} \cdot x _{i}^{(l)} ∥ ^{2}}{1 + ∥ \sum _{i} c _{ij}^{(l)} W _{ij}^{(l)} \cdot x _{i}^{(l)} ∥ ^{2}}

P (j = on ∣ x^{(l)}) = ∥ x_{j}^{(l + 1)} ∥ = P (∥ x_{j}^{(l + 1)} ∥ = 1∣ x^{(l)})

P (j = on ∣ x^{(l)}) = ∥ x_{j}^{(l + 1)} ∥ = P (∥ x_{j}^{(l + 1)} ∥ = 1∣ x^{(l)})

lo g P (x^{(l)}) = lo g ∥ x^{(l + 1)} ∥ = 0 \sum 1 e^{- E (x^{(l)}, ∥ x^{(l + 1)} ∥)} - lo g x^{(l)} \sum ∥ x^{(l + 1)} ∥ = 0 \sum 1 e^{- E (x^{(l)}, ∥ x^{(l + 1)} ∥)}

lo g P (x^{(l)}) = lo g ∥ x^{(l + 1)} ∥ = 0 \sum 1 e^{- E (x^{(l)}, ∥ x^{(l + 1)} ∥)} - lo g x^{(l)} \sum ∥ x^{(l + 1)} ∥ = 0 \sum 1 e^{- E (x^{(l)}, ∥ x^{(l + 1)} ∥)}

\frac{\partial}{\partial W _{ij}^{(l)}} lo g P (x^{(l)}) = ∥ x^{(l + 1)} ∥ = 0 \sum 1 P (∥ x^{(l + 1)} ∥∣ x^{(l)}) \frac{\partial E}{\partial W _{ij}^{(l)}} (x^{(l)}, ∥ x^{(l + 1)} ∥) - x^{(l)} \sum P (x^{(l)}) ∥ x^{(l + 1)} ∥ = 0 \sum 1 P (∥ x^{(l + 1)} ∥∣ x^{(l)}) \frac{\partial E}{\partial W _{ij}^{(l)}} (x^{(l)}, ∥ x^{(l + 1)} ∥)

\frac{\partial}{\partial W _{ij}^{(l)}} lo g P (x^{(l)}) = ∥ x^{(l + 1)} ∥ = 0 \sum 1 P (∥ x^{(l + 1)} ∥∣ x^{(l)}) \frac{\partial E}{\partial W _{ij}^{(l)}} (x^{(l)}, ∥ x^{(l + 1)} ∥) - x^{(l)} \sum P (x^{(l)}) ∥ x^{(l + 1)} ∥ = 0 \sum 1 P (∥ x^{(l + 1)} ∥∣ x^{(l)}) \frac{\partial E}{\partial W _{ij}^{(l)}} (x^{(l)}, ∥ x^{(l + 1)} ∥)

\frac{\partial E}{\partial W _{ij}^{(l)}} (x^{(l)}, ∥ x^{(l + 1)} ∥) = \frac{\partial}{\partial W _{ij}^{(l)}} j^{'} \sum lo g (∥ i^{'} \sum c_{i^{'} j^{'}}^{(l)} W_{i^{'} j^{'}}^{(l)} \cdot x_{i^{'}}^{(l)} ∥^{2}) ∥ x_{j^{'}}^{(l + 1)} ∥ = \frac{\partial}{\partial W _{ij}^{(l)}} lo g (∥ i^{'} \sum c_{i^{'} j}^{(l)} W_{i^{'} j}^{(l)} \cdot x_{i^{'}}^{(l)} ∥^{2}) ∥ x_{j}^{(l + 1)} ∥

\frac{\partial E}{\partial W _{ij}^{(l)}} (x^{(l)}, ∥ x^{(l + 1)} ∥) = \frac{\partial}{\partial W _{ij}^{(l)}} j^{'} \sum lo g (∥ i^{'} \sum c_{i^{'} j^{'}}^{(l)} W_{i^{'} j^{'}}^{(l)} \cdot x_{i^{'}}^{(l)} ∥^{2}) ∥ x_{j^{'}}^{(l + 1)} ∥ = \frac{\partial}{\partial W _{ij}^{(l)}} lo g (∥ i^{'} \sum c_{i^{'} j}^{(l)} W_{i^{'} j}^{(l)} \cdot x_{i^{'}}^{(l)} ∥^{2}) ∥ x_{j}^{(l + 1)} ∥

∥ x^{(l + 1)} ∥ = 0 \sum 1 P (∥ x^{(l + 1)} ∥∣ x^{(l)}) \frac{\partial E}{\partial W _{ij}^{(l)}} (x^{(l)}, ∥ x^{(l + 1)} ∥) = ∥ x_{j}^{(l + 1)} ∥ = 0 \sum 1 P (∥ x_{j}^{(l + 1)} ∥∣ x^{(l)}) \frac{\partial}{\partial W _{ij}^{(l)}} lo g (∥ z_{j}^{(l + 1)} ∥^{2}) ∥ x_{j}^{(l + 1)} ∥ ∥ x_{- j}^{(l + 1)} ∥ = 0 \sum 1 P (∥ x_{- j}^{(l + 1)} ∥∣ x^{(l)}) = P (∥ x_{j}^{(l + 1)} ∥ = 1∣ x^{(l)}) \frac{\partial}{\partial W _{ij}^{(l)}} lo g (∥ z_{j}^{(l + 1)} ∥^{2})

∥ x^{(l + 1)} ∥ = 0 \sum 1 P (∥ x^{(l + 1)} ∥∣ x^{(l)}) \frac{\partial E}{\partial W _{ij}^{(l)}} (x^{(l)}, ∥ x^{(l + 1)} ∥) = ∥ x_{j}^{(l + 1)} ∥ = 0 \sum 1 P (∥ x_{j}^{(l + 1)} ∥∣ x^{(l)}) \frac{\partial}{\partial W _{ij}^{(l)}} lo g (∥ z_{j}^{(l + 1)} ∥^{2}) ∥ x_{j}^{(l + 1)} ∥ ∥ x_{- j}^{(l + 1)} ∥ = 0 \sum 1 P (∥ x_{- j}^{(l + 1)} ∥∣ x^{(l)}) = P (∥ x_{j}^{(l + 1)} ∥ = 1∣ x^{(l)}) \frac{\partial}{\partial W _{ij}^{(l)}} lo g (∥ z_{j}^{(l + 1)} ∥^{2})

\frac{\partial}{\partial W _{ij}^{(l)}} lo g P (x^{(l)}) = 2 c_{ij}^{(l)} (\frac{z _{j}^{(l + 1)} x _{i}^{(l)}}{1 + ∥ z _{j}^{(l + 1)} ∥ ^{2}} - x^{(l)} \sum P (x^{(l)}) \frac{z _{j}^{(l + 1)} x _{i}^{(l)}}{1 + ∥ z _{j}^{(l + 1)} ∥ ^{2}})

\frac{\partial}{\partial W _{ij}^{(l)}} lo g P (x^{(l)}) = 2 c_{ij}^{(l)} (\frac{z _{j}^{(l + 1)} x _{i}^{(l)}}{1 + ∥ z _{j}^{(l + 1)} ∥ ^{2}} - x^{(l)} \sum P (x^{(l)}) \frac{z _{j}^{(l + 1)} x _{i}^{(l)}}{1 + ∥ z _{j}^{(l + 1)} ∥ ^{2}})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Medical Image Segmentation Techniques

MethodsCapsule Network

Full text

Training products of expert capsules

with mixing by dynamic routing

Michael Hauser

Children’s Hospital of Philadelphia

[email protected] Michael Hauser has been supported by a postdoctoral fellowship at the Center for Autism Research in the Children’s Hospital of Philadelphia. Any opinions, findings and conclusions or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the sponsoring agencies.

Abstract

This study develops an unsupervised learning algorithm for products of expert capsules with dynamic routing. Analogous to binary-valued neurons in Restricted Boltzmann Machines, the magnitude of a squashed capsule firing takes values between zero and one, representing the probability of the capsule being on. This analogy motivates the design of an energy function for capsule networks. In order to have an efficient sampling procedure where hidden layer nodes are not connected, the energy function is made consistent with dynamic routing in the sense of the probability of a capsule firing, and inference on the capsule network is computed with the dynamic routing between capsules procedure. In order to optimize the log-likelihood of the visible layer capsules, the gradient is found in terms of this energy function. The developed unsupervised learning algorithm is used to train a capsule network on standard vision datasets, and is able to generate realistic looking images from its learned distribution.

1 Introduction

Products of experts models [1] were designed as an alternative to mixture models, where the sum over distributions is replaced by a product. Because there is a product, as opposed to a sum, if an individual expert votes a low probability, then the entire product will have low probability. This is in contrast to a standard mixture model, where it only takes a single distribution to vote high probability for the data point to have high probability of existing, effectively ignoring the votes of all of the other distributions. A product of experts allows each expert to specialize on a single aspect of the data, and to disqualify a data point from existing only on this single aspect, whereas a sum of experts requires each distribution to model all aspects of the data, since within the sum of experts each distribution is essentially acting independently of the others.

Capsule networks [2] with dynamic routing [3] depart in many significant ways from neural networks. Artificial neural networks originate from the McCulloch-Pitts model [4], where a scalar neuron fires if the weighted sum of the neuron’s scalar inputs reaches a threshold. For example, the outputs of the softmax and max-pool functions depend not only on the previous layer inputs, but also on the outputs from the other nodes at the current layer; besides these exceptions most standard activation functions, such as the sigmoid and ReLU, follow the McCulloch-Pitts design.

As an alternative to scalar-valued neurons, capsules are vector valued, which has at least two advantages. First, because they are vector valued, much more complex routing mechanisms, such as routing by agreement [3], can be used to pass data forward through the network, as opposed to a simple weighted sum of inputs. Second, vectors have a magnitude and an orientation, which allows capsules to fire if their magnitude reaches a threshold, while their orientation can be used to represent the instantiation parameters of the data.

A major difficulty with capsule networks is that because they are still in their infancy, far fewer algorithmic tools exist to train and develop them. For example capsule networks have been trained via backpropagation [5, 2, 3] as well as expectation-maximization [6]. For unsupervised learning they have been trained as an autoencoder [2] as well as in a generative-adversarial setting [7, 8], but these are built off of backpropagation, which itself was designed for neural networks without dynamic routing. In their supervised settings they have been paired with a fully-connected decoder [3, 6], but this is to compel the network to learn invariant vector representations and help defend against adversarial attacks [9].

For these reasons we design a product of expert capsules in an analogous way to a product of expert neurons. From this, an efficient unsupervised learning procedure is developed to train this product of expert capsules in a bottom up fashion, using the dynamic routing procedure to mix the model distribution. With a given energy function compatible with dynamic routing, the contrastive divergence is minimized to learn the underlying density.

2 Algorithm review

This section will briefly review roduct of experts as well as dynamic routing between capsules.

2.1 Product of experts learning

As mentioned in the Introduction, product of experts models combine densities as products as opposed to sums [1, 10]. If we divide the experts into a visible layer and hidden layer of binary valued experts, the energy of this configuration is given as

[TABLE]

From this, the probability of a visible-hidden configuration is $p\left(v,h\right)=\frac{1}{Z}e^{-E\left(v,h\right)}$ , and marginalized over the hidden layer gives $p\left(v\right)=\sum_{h}p\left(v,h\right)=\frac{1}{Z}\sum_{h}e^{-E\left(v,h\right)}$ . Because there are no intra-layer connections, we can efficiently sample the hidden layer nodes as $p\left(h_{j}=1|v\right)=\nicefrac{{1}}{{1+\exp{\left(-\sum_{i}w_{ij}v_{j}+c_{j}\right)}}}$ and the visible layer nodes as $p\left(v_{i}=1|h\right)=\nicefrac{{1}}{{1+\exp{\left(-\sum_{j}w_{ij}h_{i}+b_{i}\right)}}}$ . The gradient of the log-likelihood is given by

[TABLE]

The first term on the right-hand side of this equation is generated by the data, while the second term is the expectation over the product of experts model itself. Truly estimating the second term with MCMC requires mixing the density to infinity. Instead one usually minimizes the contrastive divergence [1], which requires running the Markov chain once.

2.2 Routing by agreement in capsule networks

This section will briefly review capsule networks and dynamic routing [3].

At layer $l$ , given a collection $i=1,2,\dots,I$ of vector-valued capsules $x^{(l)}_{i}$ and matrix-valued prediction maps $W^{(l)}_{ij}$ , the pre-activation, vector-valued predicted capsule $j$ from capsule $i$ is $z^{(l+1)}_{j|i}=W^{(l)}_{ij}\cdot x^{(l)}_{i}$ , where at layer $l+1$ we have the collection $j=1,2,\dots,J$ of capsules. We then take a weighted average of all of the predictions to yield the final, pre-activation, vector-valued capsule $j$ at layer $l+1$ , i.e.

[TABLE]

where the scalar-valued $c^{(l)}_{ij}$ ’s are determined by routing by agreement and each lower layer capsule can only make a finite amount of predictions, i.e. $\sum_{i}c^{(l)}_{ij}=1$ .

Routing by agreement is a procedure that iteratively re-weighs each of the individual predictions to yield a final, collective prediction. If an individual prediction $z^{(l+1)}_{j|i}$ agrees well with the collective prediction $z^{(l+1)}_{j}$ , then we would like to increase the routing weight $c^{(l)}_{ij}$ connecting these capsules. Agreement has been measured as both the inner product [3], as well as the cosine distance [6], between the individual predictions and the squashed collective prediction. In this paper we use the cosine distance. The updated weights are then renormlized $\sum_{i}c^{(l)}_{ij}=1$ , and the process is repeated. (If the cosine distance is used, then squashing can happen outside this loop since the cosine distance takes the inner product of unit vectors, so the magnitude isn’t needed.)

The squashing function [3] we use is defined as follows:

[TABLE]

The intention of the squashing function is to scale the magnitude of $z^{(l+1)}_{j}$ between [math] and $1$ while keeping the orientation the same. We can then interpret capsule $j$ is on if the magnitude of the squashing function is close to $1$ , and $j$ is off if the magnitude is close to [math]:

[TABLE]

Later in the paper we need the inverse map of the squash map, which we call unsquash:

[TABLE]

This is just the pre-image of the squashed vector so $\left(\textnormal{unsquash}\circ\textnormal{squash}\right)\left(z\right)=\textnormal{identity}_{\mathbb{R}^{d}}\left(z\right)=z$ , in a similar way as the logit map is the inverse map of the sigmoid.

3 Capsule networks as product of experts

This section develops the product of expert capsules model, where a capsule being either on or off, measured by the magnitude of the capsule vector after squashing, is analogous to the binary action of a neuron firing in a Restricted Boltzmann Machine.

We denote an individual capsule by $x^{(l)}_{i}$ for $i=1,2,\dots,I$ . We also write $x^{(l)}=\left(x^{(l)}_{1},x^{(l)}_{2},\dots,x^{(l)}_{I}\right)$ to be the collection of all capsules on that layer. Similarly $\|x^{(l+1)}_{j}\|$ is the norm of capsule $j$ for $j=1,2,\dots,J$ , and $\|x^{(l+1)}\|=\left(\|x^{(l+1)}_{1}\|,\|x^{(l+1)}_{2}\|,\dots,\|x^{(l+1)}_{J}\|\right)$ .

3.1 Energy function and conditional probabilities

Before beginning we would like to make a subtle, yet important point. We note that if we want the magnitude of the squashed capsule to play an analogous role to the probability of a binary neuron firing, $P\left(\|x^{(l+1)}_{j}\|=1|x^{(l)}\right)=\|\textnormal{squash}\left(z^{(l+1)}_{j}\right)\|$ , for $z^{(l+1)}_{j}=\sum_{i}c^{(l)}_{ij}W^{(l)}_{ij}\cdot x^{(l)}_{i}$ , the squashing part of the squashing function can be rewritten as a sigmoid activation, $\frac{\|z^{(l+1)}_{j}\|^{2}}{1+\|z^{(l+1)}_{j}\|^{2}}=\sigma\left(\log\|z^{(l+1)}_{j}\|^{2}\right)$ . Intuitively, $\|z^{(l+1)}_{j}\|\geq 0$ , whereas the sigmoid function should take arguments over all of $\mathbb{R}$ , so taking the logarithm of the norm maps positive numbers to all of $\mathbb{R}$ .

Define the energy across layers as follows:

[TABLE]

The probability of a certain configuration $\left(x^{(l)},\|x^{(l+1)}\|\right)$ is then defined

[TABLE]

where $Z$ is the normalizing partition function. This is a product of expert capsules:

[TABLE]

We marginalize this distribution over the layer $l+1$ capsules firing:

[TABLE]

In order to have an efficient means of sampling the hidden layer, it is necessary that at layer $l+1$ the nodes representing the capsules firing are not connected, so that the total probability is equal to the product of individual capsule probabilities, and so using equations 9 and 10 we have:

[TABLE]

From here it is straightforward to show that the conditional probability of capsule $j$ firing is equal to the magnitude of the squashed capsule:

[TABLE]

In this way, the energy function defined in Equation 7 is consistent with routing by agreement, in the sense that the magnitude of the squashed capsule represents the probability that the individual capsule is on.

[TABLE]

where again $x^{(l+1)}_{j}=\textnormal{squash}\left(z^{(l+1)}_{j}\right)$ for $z^{(l+1)}_{j}=\sum_{i}c^{(l)}_{ij}W^{(l)}_{ij}\cdot x^{(l)}_{i}$ .

Because of this consistency, we use dynamic routing between capsules for mixing in the MCMC estimation of the distribution. Note that we do not want to directly sample $x^{(l+1)}_{j}$ from a graphical model $P\left(x^{(l+1)}_{j}|x^{(l)}_{i}\right)$ . This is because $x^{(l+1)}_{j}$ is a squashed vector, implying that the components of the vector are dependent on each other from the squashing, so graphically the nodes at that given layer are connected, meaning that we cannot efficiently sample these nodes with MCMC. By explicitly decoupling $x^{(l+1)}_{j}$ into its magnitude and orientation, we are able to decouple the parts of the dynamic routing that we want to use from the parts we do not want to use, allowing us to efficiently sample the magnitude $\|x^{(l+1)}_{j}\|\sim P\left(\|x^{(l+1)}_{j}\||x^{(l)}_{i}\right)$ from this distribution, since graphically these nodes are not connected, from Equation 11.

For inference, the energy model only requires $P\left(\|x^{(l+1)}_{j}\|=1|x^{(l)}\right)=\|x^{(l+1)}_{j}\|$ , which holds by construction. We then sample $x^{(l+1)}_{j}\sim P\left(x^{(l+1)}_{j}|x^{(l)}\right)$ as $x^{(l+1)}_{j}=\textnormal{squash}\left(\sum_{i}c^{(l)}_{ij}W^{(l)}_{ij}\cdot x^{(l)}_{i}\right)$ , from the dynamic routing, and again the norm of this is consistent with the energy model. Taking the norm of $x^{(l+1)}_{j}$ introduces a rotational invariance to the vector, which at first may seem like a problem since the instantiation parameters are stored in the rotational angles of the capsule. In fact this is not an issue, as we will see in Section 3.2, because we are taking the gradient of the log likelihood, and the gradient of the norm of a vector is dependent on the orientation of the vector itself, not just its magnitude. In this way, during the gradient descent, information from the orientation of the capsule is used to update the model parameters.

3.2 Gradient of the log-likelihood

In order to optimize our parameter weights we need the gradient of the log-likelihood to be tractably computable; the log-likelihood is given by:

[TABLE]

We can put the gradient of the log-likelihood in a compact form using the fact that $P\left(\|x^{(l+1)}\||x^{(l)}\right)=\nicefrac{{P\left(x^{(l)},\|x^{(l+1)}\|\right)}}{{P\left(x^{(l)}\right)}}=\\ \nicefrac{{\frac{1}{Z}\exp{-E\left(x^{(l)},\|x^{(l+1)}\|\right)}}}{{\frac{1}{Z}\sum^{1}_{\|x^{(l+1)}\|=0}\exp{-E\left(x^{(l)},\|x^{(l+1)}\|\right)}}}$ .

[TABLE]

Computing this is still not tractable as it involves an exponential number of sums. However, since the layer $l+1$ capsule activations are not connected, from Equation 11, we can apply an analogous factorization trick to that which is used in learning Restricted Boltzmann Machines. First we reduce the energy function:

[TABLE]

Using this, we are in a position to find the tractable gradient for the product of expert capsules:

[TABLE]

where $x^{(l+1)}_{-j}$ is analogous notation to the RBM case [1], and refers to all capsules at layer $l+1$ other than the $j^{th}$ capsule.

To further reduce Equation 17, where we write $z^{(l+1)}_{j}=\sum_{i^{\prime}}c^{(l)}_{i^{\prime}j}W^{(l)}_{i^{\prime}j}\cdot x^{(l)}_{i^{\prime}}$ , one has $\frac{\partial}{\partial W^{(l)}_{ij}}\log\left(\|z^{(l+1)}_{j}\|^{2}\right)=\frac{2c^{(l)}_{ij}}{\|z^{(l+1)}_{j}\|^{2}}z^{(l+1)}_{j}x^{(l)}_{i}$ . Similarly, $P\left(\|x^{(l+1)}_{j}\|=1|x^{(l)}\right)=\frac{\|z^{(l+1)}_{j}\|^{2}}{1+\|z^{(l+1)}_{j}\|^{2}}$ , and we have the final form of the gradient:

[TABLE]

For RBM learning, this is analogous to the update rule $\Delta w_{ij}=\left\langle v_{i}h_{j}\right\rangle_{\textnormal{data}}-\left\langle v_{i}h_{j}\right\rangle_{\textnormal{model}}$ .

Thus, beginning with the data $x^{(l)}_{i}$ , we use dynamic routing to sample $z^{(l+1)}_{j}=\sum_{i}c^{(l)}_{ij}W^{(l)}_{ij}\cdot x^{(l)}_{i}$ where the $c^{(l)}_{ij}$ ’s are determined by routing by agreement and $\sum_{i}c^{(l)}_{ij}=1$ .

We then squash this $x^{(l+1)}_{j}=\textnormal{squash}\left(z^{(l+1)}_{j}\right)$ and run dynamic routing in reverse to produce $\tilde{z}^{(l)}_{i}=\sum_{j}c^{(l)}_{ij}W^{(l)T}_{ij}\cdot x^{(l+1)}_{j}$ , using the same $c^{(l)}_{ij}$ ’s as before, with $\sum_{i}c^{(l)}_{ij}=1$ . This is squashed producing $\tilde{x}^{(l)}_{i}$ . Finally, we repeat the first step, again with the same $c^{(l)}_{ij}$ ’s such that $\sum_{i}c^{(l)}_{ij}=1$ to produce $\tilde{z}^{(l+1)}_{j}=\sum_{i}c^{(l)}_{ij}W^{(l)}_{ij}\cdot\tilde{x}^{(l)}_{i}$ and its squashed counterparts $\tilde{x}^{(l+1)}_{j}$ . This procedure is summarized in Algorithm 1.

We then use these values, which we are referring to as mixing by dynamic routing, for the gradient in Equation 18. As is usual, instead of minimizing the log likelihood we minimize the contrastive divergence, and so we only run this mixing once. The complete process is summarized in Algorithm 1.

4 Experiments

4.1 Network architecture

The outline of the experimental architecture can be seen in Figure 1. Other than the input/output channels being greyscale and rgb, the same architecture was used for both experiments. First an autoencoding convolutional network [11], with dropout [12] was used to learn the convolutional filter weights in an unsupervised way. The first filter bank is of size $\left[9,9,1,128\right]$ (for height $\times$ width $\times$ channels in $\times$ channels out), followed by a leaky ReLU activation, while the second filter bank is $\left[9,9,128,128\right]$ followed by a leaky ReLU activation. The transposed filters are used in the decoder, with first a leaky ReLU and a sigmoid at the output, to scale the pixel intensities between [math] and $1$ .

Once the convolutional autoencoder is sufficiently trained, these weights are held fixed and the $128$ -dimensional hidden layer is reshaped to $6\times 6\times 128/8=576$ capsules, each being $8$ -dimensional, and then squashed along these $8$ dimensions. Using the unsupervised product of expert capsules training algorithm described above, these capsules are mapped to $20$ capsules, each of $16$ -dimensions, to learn the $W^{(l)}_{ij}$ ’s in $z^{(l+1)}_{j}=\sum_{i}c^{(l)}_{ij}W^{(l)}_{ij}\cdot x^{(l)}_{i}$ , where $i=1,2,\dots,576$ and $j=1,2,\dots,20$ .

After the $W^{(l)}_{ij}$ ’s are trained, these weights are held fixed and we learn a decoder $z^{(l)}_{i}=\sum_{j}e^{(l+1)}_{ji}U^{(l+1)}_{ji}\cdot x^{(l+1)}_{j}$ so that we can sample from the $16$ -dimensional capsule space to generate the images. The update rule is $\Delta U^{(l+1)}_{ji}=2e^{(l+1)}_{ji}\left(\frac{z^{(l)}_{i}x^{(l+1)}_{j}}{1+\|z^{(l)}_{i}\|^{2}}-\frac{\tilde{z}^{(l)}_{i}\tilde{x}^{(l+1)}_{j}}{1+\|\tilde{z}^{(l)}_{i}\|^{2}}\right)$ , where (the data) $z^{(l)}_{i}$ and $x^{(l+1)}_{j}$ are fixed by the $W^{(l)}_{ij}$ ’s, while the (the model) $\tilde{z}^{(l)}_{i}=\sum_{j}e^{(l+1)}_{ji}U^{(l+1)}_{ji}x^{(l+1)}_{j}$ and $\tilde{x}^{(l+1)}_{j}=\textnormal{squash}\left(\sum_{j}e^{(l+1)}_{ji}U^{(l+1)T}_{ji}\tilde{x}^{(l)}_{j}\right)$ are generated with the $U^{(l+1)}_{ji}$ ’s. We implemented this network in TensorFlow [13], and training on a single gpu took about fifteen minutes.

4.2 Routing-Weighted Product of Expert Neurons

In Figure 2 there are $20$ columns for each of the $20$ capsules, and the $4$ rows are random samples from a $16$ -dimensional Gaussian distrbition, with the other $19$ capsules having [math] as their input. It is seen that each of the individual capsules learn specific objects, with different random samples generating images with different instantiation parameters of similar objects, such as stroke thicknesses and angles, or sleeves coming out of pants to create shirts and jackets. Interestingly some of the images are negatives of what is expected, where if the input $x\sim N\left(0,1\right)$ is replaced with $-x$ the image generated are no longer negatives, but the ones that were normal become negatives, as is seen in Figures 2(a) and 2(c).

We believe this stems from the fact that the $16$ dimensional point, as understood by the capsule, lives on the manifold $S^{15}\times\left(0,1\right)$ , where $S^{15}$ is the $15$ -dimensional sphere of angular orientations, and the $\left(0,1\right)$ is the capsule magnitude representing off or on. Not all object instantiation parameters should live on the sphere. For example rotating an object in a circle should live on $S^{1}$ , but increasing the size of an object should live on $\mathbb{R}$ , since an object’s size shouldn’t return to where it started if the size is monotonically increasing, as would happen on $S^{1}$ . In this way the pixel intensities can be reversed $180^{\circ}$ since they are coming from $S^{15}$ as opposed to $\mathbb{R}^{15}$ . When we restrict our sampling domain to the half of $S^{15}$ that is visited during training the problem is alleviated, as seen in Figures 2(b) and 2(d), suggesting that this idea is infact correct.

A more elegant solution, although outside the scope of this paper, would be to design a capsule on, say, $S^{m}\times\mathbb{R}^{n}\times\left(0,1\right)$ . Nevertheless, the unsupervised learning procedure developed here for the capsules is distinctly learning recognizable objects.

5 Conclusions

This work developed capsule networks with routing by agreement within a product of experts formulation. Observing that the magnitudes of the hidden layer capsules are not connected, we design an energy function that is consistent with dynamic routing, in the sense that the binary action of a hidden layer capsule firing is equal to the probability of a capsule being on, when calculated with dynamic routing by agreement. We then use dynamic routing to mix the distribution. The gradient of the log likelihood is found and used to minimize the contrastive divergence. A simple network architecture is set up to test this unsupervised learning algorithm, and is able to generate images similar to those of the datasets it was trained on.

Bibliography13

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation , 14(8):1771–1800, 2002.
2[2] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In International Conference on Artificial Neural Networks , pages 44–51. Springer, 2011.
3[3] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in neural information processing systems , pages 3856–3866, 2017.
4[4] Warren S Mc Culloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics , 5(4):115–133, 1943.
5[5] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
6[6] Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. 2018.
7[7] Ayush Jaiswal, Wael Abd Almageed, Yue Wu, and Premkumar Natarajan. Capsulegan: Generative adversarial capsule network. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 0–0, 2018.
8[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems , pages 2672–2680, 2014.