Training products of expert capsules with mixing by dynamic routing
Michael Hauser

TL;DR
This paper introduces an unsupervised learning algorithm for capsule networks that uses an energy-based model with dynamic routing, enabling realistic image generation from learned distributions.
Contribution
It proposes a novel energy function for capsule networks aligned with dynamic routing, facilitating unsupervised training and image generation.
Findings
Successfully trained capsule networks on standard vision datasets.
Able to generate realistic images from the learned distribution.
Demonstrated the effectiveness of energy-based models with dynamic routing.
Abstract
This study develops an unsupervised learning algorithm for products of expert capsules with dynamic routing. Analogous to binary-valued neurons in Restricted Boltzmann Machines, the magnitude of a squashed capsule firing takes values between zero and one, representing the probability of the capsule being on. This analogy motivates the design of an energy function for capsule networks. In order to have an efficient sampling procedure where hidden layer nodes are not connected, the energy function is made consistent with dynamic routing in the sense of the probability of a capsule firing, and inference on the capsule network is computed with the dynamic routing between capsules procedure. In order to optimize the log-likelihood of the visible layer capsules, the gradient is found in terms of this energy function. The developed unsupervised learning algorithm is used to train a capsule…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Medical Image Segmentation Techniques
MethodsCapsule Network
Training products of expert capsules
with mixing by dynamic routing
Michael Hauser
Children’s Hospital of Philadelphia
[email protected] Michael Hauser has been supported by a postdoctoral fellowship at the Center for Autism Research in the Children’s Hospital of Philadelphia. Any opinions, findings and conclusions or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the sponsoring agencies.
Abstract
This study develops an unsupervised learning algorithm for products of expert capsules with dynamic routing. Analogous to binary-valued neurons in Restricted Boltzmann Machines, the magnitude of a squashed capsule firing takes values between zero and one, representing the probability of the capsule being on. This analogy motivates the design of an energy function for capsule networks. In order to have an efficient sampling procedure where hidden layer nodes are not connected, the energy function is made consistent with dynamic routing in the sense of the probability of a capsule firing, and inference on the capsule network is computed with the dynamic routing between capsules procedure. In order to optimize the log-likelihood of the visible layer capsules, the gradient is found in terms of this energy function. The developed unsupervised learning algorithm is used to train a capsule network on standard vision datasets, and is able to generate realistic looking images from its learned distribution.
1 Introduction
Products of experts models [1] were designed as an alternative to mixture models, where the sum over distributions is replaced by a product. Because there is a product, as opposed to a sum, if an individual expert votes a low probability, then the entire product will have low probability. This is in contrast to a standard mixture model, where it only takes a single distribution to vote high probability for the data point to have high probability of existing, effectively ignoring the votes of all of the other distributions. A product of experts allows each expert to specialize on a single aspect of the data, and to disqualify a data point from existing only on this single aspect, whereas a sum of experts requires each distribution to model all aspects of the data, since within the sum of experts each distribution is essentially acting independently of the others.
Capsule networks [2] with dynamic routing [3] depart in many significant ways from neural networks. Artificial neural networks originate from the McCulloch-Pitts model [4], where a scalar neuron fires if the weighted sum of the neuron’s scalar inputs reaches a threshold. For example, the outputs of the softmax and max-pool functions depend not only on the previous layer inputs, but also on the outputs from the other nodes at the current layer; besides these exceptions most standard activation functions, such as the sigmoid and ReLU, follow the McCulloch-Pitts design.
As an alternative to scalar-valued neurons, capsules are vector valued, which has at least two advantages. First, because they are vector valued, much more complex routing mechanisms, such as routing by agreement [3], can be used to pass data forward through the network, as opposed to a simple weighted sum of inputs. Second, vectors have a magnitude and an orientation, which allows capsules to fire if their magnitude reaches a threshold, while their orientation can be used to represent the instantiation parameters of the data.
A major difficulty with capsule networks is that because they are still in their infancy, far fewer algorithmic tools exist to train and develop them. For example capsule networks have been trained via backpropagation [5, 2, 3] as well as expectation-maximization [6]. For unsupervised learning they have been trained as an autoencoder [2] as well as in a generative-adversarial setting [7, 8], but these are built off of backpropagation, which itself was designed for neural networks without dynamic routing. In their supervised settings they have been paired with a fully-connected decoder [3, 6], but this is to compel the network to learn invariant vector representations and help defend against adversarial attacks [9].
For these reasons we design a product of expert capsules in an analogous way to a product of expert neurons. From this, an efficient unsupervised learning procedure is developed to train this product of expert capsules in a bottom up fashion, using the dynamic routing procedure to mix the model distribution. With a given energy function compatible with dynamic routing, the contrastive divergence is minimized to learn the underlying density.
2 Algorithm review
This section will briefly review roduct of experts as well as dynamic routing between capsules.
2.1 Product of experts learning
As mentioned in the Introduction, product of experts models combine densities as products as opposed to sums [1, 10]. If we divide the experts into a visible layer and hidden layer of binary valued experts, the energy of this configuration is given as
[TABLE]
From this, the probability of a visible-hidden configuration is , and marginalized over the hidden layer gives . Because there are no intra-layer connections, we can efficiently sample the hidden layer nodes as and the visible layer nodes as . The gradient of the log-likelihood is given by
[TABLE]
The first term on the right-hand side of this equation is generated by the data, while the second term is the expectation over the product of experts model itself. Truly estimating the second term with MCMC requires mixing the density to infinity. Instead one usually minimizes the contrastive divergence [1], which requires running the Markov chain once.
2.2 Routing by agreement in capsule networks
This section will briefly review capsule networks and dynamic routing [3].
At layer , given a collection of vector-valued capsules and matrix-valued prediction maps , the pre-activation, vector-valued predicted capsule from capsule is , where at layer we have the collection of capsules. We then take a weighted average of all of the predictions to yield the final, pre-activation, vector-valued capsule at layer , i.e.
[TABLE]
where the scalar-valued ’s are determined by routing by agreement and each lower layer capsule can only make a finite amount of predictions, i.e. .
Routing by agreement is a procedure that iteratively re-weighs each of the individual predictions to yield a final, collective prediction. If an individual prediction agrees well with the collective prediction , then we would like to increase the routing weight connecting these capsules. Agreement has been measured as both the inner product [3], as well as the cosine distance [6], between the individual predictions and the squashed collective prediction. In this paper we use the cosine distance. The updated weights are then renormlized , and the process is repeated. (If the cosine distance is used, then squashing can happen outside this loop since the cosine distance takes the inner product of unit vectors, so the magnitude isn’t needed.)
The squashing function [3] we use is defined as follows:
[TABLE]
The intention of the squashing function is to scale the magnitude of between [math] and while keeping the orientation the same. We can then interpret capsule is on if the magnitude of the squashing function is close to , and is off if the magnitude is close to [math]:
[TABLE]
Later in the paper we need the inverse map of the squash map, which we call unsquash:
[TABLE]
This is just the pre-image of the squashed vector so , in a similar way as the logit map is the inverse map of the sigmoid.
3 Capsule networks as product of experts
This section develops the product of expert capsules model, where a capsule being either on or off, measured by the magnitude of the capsule vector after squashing, is analogous to the binary action of a neuron firing in a Restricted Boltzmann Machine.
We denote an individual capsule by for . We also write to be the collection of all capsules on that layer. Similarly is the norm of capsule for , and .
3.1 Energy function and conditional probabilities
Before beginning we would like to make a subtle, yet important point. We note that if we want the magnitude of the squashed capsule to play an analogous role to the probability of a binary neuron firing, , for , the squashing part of the squashing function can be rewritten as a sigmoid activation, . Intuitively, , whereas the sigmoid function should take arguments over all of , so taking the logarithm of the norm maps positive numbers to all of .
Define the energy across layers as follows:
[TABLE]
The probability of a certain configuration is then defined
[TABLE]
where is the normalizing partition function. This is a product of expert capsules:
[TABLE]
We marginalize this distribution over the layer capsules firing:
[TABLE]
In order to have an efficient means of sampling the hidden layer, it is necessary that at layer the nodes representing the capsules firing are not connected, so that the total probability is equal to the product of individual capsule probabilities, and so using equations 9 and 10 we have:
[TABLE]
From here it is straightforward to show that the conditional probability of capsule firing is equal to the magnitude of the squashed capsule:
[TABLE]
In this way, the energy function defined in Equation 7 is consistent with routing by agreement, in the sense that the magnitude of the squashed capsule represents the probability that the individual capsule is on.
[TABLE]
where again for .
Because of this consistency, we use dynamic routing between capsules for mixing in the MCMC estimation of the distribution. Note that we do not want to directly sample from a graphical model . This is because is a squashed vector, implying that the components of the vector are dependent on each other from the squashing, so graphically the nodes at that given layer are connected, meaning that we cannot efficiently sample these nodes with MCMC. By explicitly decoupling into its magnitude and orientation, we are able to decouple the parts of the dynamic routing that we want to use from the parts we do not want to use, allowing us to efficiently sample the magnitude from this distribution, since graphically these nodes are not connected, from Equation 11.
For inference, the energy model only requires , which holds by construction. We then sample as , from the dynamic routing, and again the norm of this is consistent with the energy model. Taking the norm of introduces a rotational invariance to the vector, which at first may seem like a problem since the instantiation parameters are stored in the rotational angles of the capsule. In fact this is not an issue, as we will see in Section 3.2, because we are taking the gradient of the log likelihood, and the gradient of the norm of a vector is dependent on the orientation of the vector itself, not just its magnitude. In this way, during the gradient descent, information from the orientation of the capsule is used to update the model parameters.
3.2 Gradient of the log-likelihood
In order to optimize our parameter weights we need the gradient of the log-likelihood to be tractably computable; the log-likelihood is given by:
[TABLE]
We can put the gradient of the log-likelihood in a compact form using the fact that .
[TABLE]
Computing this is still not tractable as it involves an exponential number of sums. However, since the layer capsule activations are not connected, from Equation 11, we can apply an analogous factorization trick to that which is used in learning Restricted Boltzmann Machines. First we reduce the energy function:
[TABLE]
Using this, we are in a position to find the tractable gradient for the product of expert capsules:
[TABLE]
where is analogous notation to the RBM case [1], and refers to all capsules at layer other than the capsule.
To further reduce Equation 17, where we write , one has . Similarly, , and we have the final form of the gradient:
[TABLE]
For RBM learning, this is analogous to the update rule .
Thus, beginning with the data , we use dynamic routing to sample where the ’s are determined by routing by agreement and .
We then squash this and run dynamic routing in reverse to produce , using the same ’s as before, with . This is squashed producing . Finally, we repeat the first step, again with the same ’s such that to produce and its squashed counterparts . This procedure is summarized in Algorithm 1.
We then use these values, which we are referring to as mixing by dynamic routing, for the gradient in Equation 18. As is usual, instead of minimizing the log likelihood we minimize the contrastive divergence, and so we only run this mixing once. The complete process is summarized in Algorithm 1.
4 Experiments
4.1 Network architecture
The outline of the experimental architecture can be seen in Figure 1. Other than the input/output channels being greyscale and rgb, the same architecture was used for both experiments. First an autoencoding convolutional network [11], with dropout [12] was used to learn the convolutional filter weights in an unsupervised way. The first filter bank is of size (for height width channels in channels out), followed by a leaky ReLU activation, while the second filter bank is followed by a leaky ReLU activation. The transposed filters are used in the decoder, with first a leaky ReLU and a sigmoid at the output, to scale the pixel intensities between [math] and .
Once the convolutional autoencoder is sufficiently trained, these weights are held fixed and the -dimensional hidden layer is reshaped to capsules, each being -dimensional, and then squashed along these dimensions. Using the unsupervised product of expert capsules training algorithm described above, these capsules are mapped to capsules, each of -dimensions, to learn the ’s in , where and .
After the ’s are trained, these weights are held fixed and we learn a decoder so that we can sample from the -dimensional capsule space to generate the images. The update rule is , where (the data) and are fixed by the ’s, while the (the model) and are generated with the ’s. We implemented this network in TensorFlow [13], and training on a single gpu took about fifteen minutes.
4.2 Routing-Weighted Product of Expert Neurons
In Figure 2 there are columns for each of the capsules, and the rows are random samples from a -dimensional Gaussian distrbition, with the other capsules having [math] as their input. It is seen that each of the individual capsules learn specific objects, with different random samples generating images with different instantiation parameters of similar objects, such as stroke thicknesses and angles, or sleeves coming out of pants to create shirts and jackets. Interestingly some of the images are negatives of what is expected, where if the input is replaced with the image generated are no longer negatives, but the ones that were normal become negatives, as is seen in Figures 2(a) and 2(c).
We believe this stems from the fact that the dimensional point, as understood by the capsule, lives on the manifold , where is the -dimensional sphere of angular orientations, and the is the capsule magnitude representing off or on. Not all object instantiation parameters should live on the sphere. For example rotating an object in a circle should live on , but increasing the size of an object should live on , since an object’s size shouldn’t return to where it started if the size is monotonically increasing, as would happen on . In this way the pixel intensities can be reversed since they are coming from as opposed to . When we restrict our sampling domain to the half of that is visited during training the problem is alleviated, as seen in Figures 2(b) and 2(d), suggesting that this idea is infact correct.
A more elegant solution, although outside the scope of this paper, would be to design a capsule on, say, . Nevertheless, the unsupervised learning procedure developed here for the capsules is distinctly learning recognizable objects.
5 Conclusions
This work developed capsule networks with routing by agreement within a product of experts formulation. Observing that the magnitudes of the hidden layer capsules are not connected, we design an energy function that is consistent with dynamic routing, in the sense that the binary action of a hidden layer capsule firing is equal to the probability of a capsule being on, when calculated with dynamic routing by agreement. We then use dynamic routing to mix the distribution. The gradient of the log likelihood is found and used to minimize the contrastive divergence. A simple network architecture is set up to test this unsupervised learning algorithm, and is able to generate images similar to those of the datasets it was trained on.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation , 14(8):1771–1800, 2002.
- 2[2] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In International Conference on Artificial Neural Networks , pages 44–51. Springer, 2011.
- 3[3] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in neural information processing systems , pages 3856–3866, 2017.
- 4[4] Warren S Mc Culloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics , 5(4):115–133, 1943.
- 5[5] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
- 6[6] Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. 2018.
- 7[7] Ayush Jaiswal, Wael Abd Almageed, Yue Wu, and Premkumar Natarajan. Capsulegan: Generative adversarial capsule network. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 0–0, 2018.
- 8[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems , pages 2672–2680, 2014.
