Emergence of Human-Like Attention in Self-Supervised Vision Transformers: an eye-tracking study
Takuto Yamamoto, Hirosato Akahoshi, Shigeru Kitazawa

TL;DR
This study investigates whether self-supervised Vision Transformers trained with DINO can develop human-like visual attention patterns, revealing that such models closely mimic human gaze behavior and exhibit biologically plausible attention mechanisms.
Contribution
The paper demonstrates that self-supervised DINO-trained ViTs develop attention patterns similar to humans, unlike supervised models, providing insights into biological visual perception.
Findings
DINO-trained ViTs closely mimic human gaze patterns
Attention clusters correspond to foreground, objects, and background
Self-supervised training leads to more human-like attention mechanisms
Abstract
Many models of visual attention have been proposed so far. Traditional bottom-up models, like saliency models, fail to replicate human gaze patterns, and deep gaze prediction models lack biological plausibility due to their reliance on supervised learning. Vision Transformers (ViTs), with their self-attention mechanisms, offer a new approach but often produce dispersed attention patterns if trained with supervised learning. This study explores whether self-supervised DINO (self-DIstillation with NO labels) training enables ViTs to develop attention mechanisms resembling human visual attention. Using video stimuli to capture human gaze dynamics, we found that DINO-trained ViTs closely mimic human attention patterns, while those trained with supervised learning deviate significantly. An analysis of self-attention heads revealed three distinct clusters: one focusing on foreground objects,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Gaze Tracking and Assistive Technology · Visual perception and processing mechanisms
MethodsAttention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Residual Connection · Multi-Head Attention · Vision Transformer · self-DIstillation with NO labels
