SERE: Exploring Feature Self-relation for Self-supervised Transformer
Zhong-Yu Li, Shanghua Gao, Ming-Ming Cheng

TL;DR
This paper introduces SERE, a self-supervised learning method for vision transformers that leverages feature self-relations across spatial and channel dimensions to improve representation quality for various vision tasks.
Contribution
The paper proposes a novel self-supervised learning approach that utilizes feature self-relations in ViT, addressing limitations of CNN-based strategies and enhancing relation modeling capabilities.
Findings
Improved downstream task performance with SERE
Enhanced relation modeling in ViT
Stable and stronger representations
Abstract
Learning representations with self-supervision for convolutional networks (CNN) has been validated to be effective for vision tasks. As an alternative to CNN, vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks. Recent works reveal that self-supervised learning helps unleash the great potential of ViT. Still, most works follow self-supervised strategies designed for CNN, e.g., instance-level discrimination of samples, but they ignore the properties of ViT. We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks. To enforce this property, we explore the feature SElf-RElation (SERE) for training self-supervised ViT. Specifically, instead of conducting self-supervised learning solely on feature embeddings from multiple views, we utilize the feature self-relations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Visual Attention and Saliency Detection
