Hybrid BYOL-ViT: Efficient approach to deal with small datasets
Safwen Naimi, Rien van Leeuwen, Wided Souidene, Slim Ben Saoud

TL;DR
This paper presents a hybrid approach combining self-supervised learning with Vision Transformers to improve performance on small datasets by leveraging low-level features, achieving significant accuracy gains.
Contribution
It introduces a method that uses self-supervised low-level features to enhance ViT performance on small datasets, reducing reliance on large labeled data.
Findings
Performance boost from 41.66% to 83.25% on STL-10 dataset.
Self-supervised low-level features improve ViT robustness.
Effective training of early network layers with unlabeled data.
Abstract
Supervised learning can learn large representational spaces, which are crucial for handling difficult learning tasks. However, due to the design of the model, classical image classification approaches struggle to generalize to new problems and new situations when dealing with small datasets. In fact, supervised learning can lose the location of image features which leads to supervision collapse in very deep architectures. In this paper, we investigate how self-supervision with strong and sufficient augmentation of unlabeled data can train effectively the first layers of a neural network even better than supervised learning, with no need for millions of labeled data. The main goal is to disconnect pixel data from annotation by getting generic task-agnostic low-level features. Furthermore, we look into Vision Transformers (ViT) and show that the low-level features derived from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
