Where are my Neighbors? Exploiting Patches Relations in Self-Supervised   Vision Transformer

Guglielmo Camporese; Elena Izzo; Lamberto Ballan

arXiv:2206.00481·cs.CV·October 14, 2022

Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

Guglielmo Camporese, Elena Izzo, Lamberto Ballan

PDF

Open Access 2 Repos

TL;DR

This paper introduces RelViT, a self-supervised learning strategy for Vision Transformers that leverages patch relations to improve performance, especially on small datasets, without external annotations.

Contribution

RelViT is a novel SSL approach that optimizes all patch-related tokens in ViTs, significantly enhancing accuracy on small datasets compared to existing methods.

Findings

01

RelViT outperforms state-of-the-art SSL methods on multiple benchmarks.

02

Significant accuracy improvements on small datasets.

03

Effective exploitation of patch relations in ViTs.

Abstract

Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective Self-Supervised Learning (SSL) strategy to train ViTs, that without any external annotation or external data, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly the supervised task. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signals at each training step. We investigated our methods on several image benchmarks finding that RelViT improves the SSL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Vision and Imaging · Advanced Neural Network Applications