TL;DR
GMML is a novel self-supervised learning method for vision transformers that effectively captures contextual information by manipulating groups of tokens, without needing complex training tricks or large batch sizes.
Contribution
It introduces GMML, a self-supervised pretraining approach that enhances context extraction in vision transformers without requiring momentum encoders or large batches.
Findings
GMML outperforms existing SSL methods on vision tasks.
It simplifies training by removing the need for momentum encoders.
GMML effectively captures semantic context in images.
Abstract
Vision transformers have generated significant interest in the computer vision community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry. This has motivated the research in self-supervised transformer pretraining, which does not need to decode the semantic information conveyed by labels to link it to the image properties, but rather focuses directly on extracting a concise representation of the image data that reflects the notion of similarity, and is invariant to nuisance factors. The key vehicle for the self-learning process used by the majority of self-learning methods is the generation of multiple views of the training data and the creation of pretext tasks which use these views to define the notion of image similarity, and data integrity. However, this approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSelf-Learning
