Teaching Matters: Investigating the Role of Supervision in Vision Transformers
Matthew Walmer, Saksham Suri, Kamal Gupta, Abhinav Shrivastava

TL;DR
This paper investigates how different supervision methods influence Vision Transformers' behaviors, revealing diverse learning patterns, the emergence of Offset Local Attention Heads, and the competitive performance of self-supervised approaches.
Contribution
It provides a comprehensive comparison of ViTs trained with various supervision paradigms and uncovers novel behaviors like Offset Local Attention Heads not previously documented.
Findings
ViTs learn diverse behaviors depending on training method.
Self-supervised methods can match or outperform supervised ones.
Offset Local Attention Heads are a consistent phenomenon across models.
Abstract
Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, their behavior under different learning paradigms is not well explored. We compare ViTs trained through different methods of supervision, and show that they learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. These are self-attention heads that attend to a token adjacent to the current token with a fixed directional offset, a phenomenon that to the best of our knowledge has not been highlighted in any prior work. Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method. We find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning
