Masked Capsule Autoencoders
Miles Everett, Mingjun Zhong, and Georgios Leontidis

TL;DR
This paper introduces Masked Capsule Autoencoders, a novel self-supervised pretraining method for Capsule Networks using masked image modelling, significantly improving their performance on complex, realistic datasets.
Contribution
It reformulates Capsule Networks to incorporate masked image modelling pretraining, achieving state-of-the-art results and demonstrating the benefits of self-supervised learning for capsules.
Findings
Capsule Networks benefit from self-supervised pretraining.
Achieved 9% improvement on Imagenette dataset.
State-of-the-art results for Capsule Networks on complex images.
Abstract
We propose Masked Capsule Autoencoders (MCAE), the first Capsule Network that utilises pretraining in a modern self-supervised paradigm, specifically the masked image modelling framework. Capsule Networks have emerged as a powerful alternative to Convolutional Neural Networks (CNNs). They have shown favourable properties when compared to Vision Transformers (ViT), but have struggled to effectively learn when presented with more complex data. This has led to Capsule Network models that do not scale to modern tasks. Our proposed MCAE model alleviates this issue by reformulating the Capsule Network to use masked image modelling as a pretraining stage before finetuning in a supervised manner. Across several experiments and ablations studies we demonstrate that similarly to CNNs and ViTs, Capsule Networks can also benefit from self-supervised pretraining, paving the way for further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques
MethodsCapsule Network
