A Survey of the Self Supervised Learning Mechanisms for Vision Transformers

Asifullah Khan; Anabia Sohail; Mustansar Fiaz; Mehdi Hassan; Tariq Habib Afridi; Sibghat Ullah Marwat; Farzeen Munir; Safdar Ali; Hannan Naseem; Muhammad Zaigham Zaheer; Kamran Ali; Tangina Sultana; Ziaurrehman Tanoli; Naeem Akhter

arXiv:2408.17059·cs.CV·August 26, 2025·6 cites

A Survey of the Self Supervised Learning Mechanisms for Vision Transformers

Asifullah Khan, Anabia Sohail, Mustansar Fiaz, Mehdi Hassan, Tariq Habib Afridi, Sibghat Ullah Marwat, Farzeen Munir, Safdar Ali, Hannan Naseem, Muhammad Zaigham Zaheer, Kamran Ali, Tangina Sultana, Ziaurrehman Tanoli, Naeem Akhter

PDF

Open Access

TL;DR

This survey reviews self-supervised learning methods for Vision Transformers, highlighting their design, applications, and challenges, emphasizing the importance of SSL in reducing reliance on labeled data for vision tasks.

Contribution

It provides a comprehensive taxonomy, comparative analysis, and insights into SSL techniques specifically tailored for Vision Transformers, addressing current challenges and future directions.

Findings

01

SSL methods improve ViT performance with limited labeled data

02

A taxonomy categorizes SSL techniques based on representations and tasks

03

Comparative analysis highlights strengths and limitations of existing SSL approaches

Abstract

Advances in deep learning are re-defining how visual data is processed and understand by the machines. Vision Transformers (ViTs) have recently demonstrated prominent performance in computer vision related tasks. However, their performance improves with increasing numbers of labeled data, indicating reliance on labeled data. Humanly annotated data are difficult to acquire and thus shifted the focus from traditional annotations to unsupervised learning strategies that learn structures inside the data. In response to this challenge, self-supervised learning (SSL) has emerged as a promising technique. SSL utilize inherent relationships within the data as a form of supervision. This technique can reduce the dependence on manual annotations and offers a more scalable and resource-effective approach to training models. Taking these strengths into account, it is necessary to assess the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection

MethodsFocus