MST: Masked Self-Supervised Transformer for Visual Representation
Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang, Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang

TL;DR
MST introduces a masked self-supervised transformer that captures both local and global image features, improving performance on dense prediction tasks with minimal pre-training epochs.
Contribution
The paper proposes a novel masked token strategy based on self-attention maps and a global decoder, enhancing local feature learning in visual self-supervised pre-training.
Findings
Achieves 76.9% Top-1 accuracy with 300 epochs on ImageNet.
Outperforms supervised and DINO methods at same epoch.
Improves dense prediction results on COCO and Cityscapes.
Abstract
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Vision Transformer · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Label Smoothing · Residual Connection · Dense Connections · Softmax
