Efficient Training of Visual Transformers with Small Datasets
Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, Marco, De Nadai

TL;DR
This paper introduces a self-supervised training task for Visual Transformers that enhances their performance on small datasets by encouraging learning of spatial relations, making training more robust and improving accuracy.
Contribution
The paper proposes a novel self-supervised task that can be integrated with existing Visual Transformers to improve their data efficiency and robustness in small dataset regimes.
Findings
Self-supervised task improves VT accuracy on small datasets
Method is architecture-agnostic and easy to implement
Significant accuracy gains demonstrated across multiple datasets
Abstract
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data-hungry than common CNNs. In fact, some local properties of the visual domain which are embedded in the CNN architectural design, in VTs should be learned from samples. In this paper, we empirically analyse different VTs, comparing their robustness in a small training-set regime, and we show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. Moreover, we propose a self-supervised task which can extract additional information from images with only a negligible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
