Emergence of Human-Like Attention in Self-Supervised Vision Transformers: an eye-tracking study

Takuto Yamamoto; Hirosato Akahoshi; Shigeru Kitazawa

arXiv:2410.22768·q-bio.NC·May 28, 2025·2 cites

Emergence of Human-Like Attention in Self-Supervised Vision Transformers: an eye-tracking study

Takuto Yamamoto, Hirosato Akahoshi, Shigeru Kitazawa

PDF

Open Access 1 Repo

TL;DR

This study investigates whether self-supervised Vision Transformers trained with DINO can develop human-like visual attention patterns, revealing that such models closely mimic human gaze behavior and exhibit biologically plausible attention mechanisms.

Contribution

The paper demonstrates that self-supervised DINO-trained ViTs develop attention patterns similar to humans, unlike supervised models, providing insights into biological visual perception.

Findings

01

DINO-trained ViTs closely mimic human gaze patterns

02

Attention clusters correspond to foreground, objects, and background

03

Self-supervised training leads to more human-like attention mechanisms

Abstract

Many models of visual attention have been proposed so far. Traditional bottom-up models, like saliency models, fail to replicate human gaze patterns, and deep gaze prediction models lack biological plausibility due to their reliance on supervised learning. Vision Transformers (ViTs), with their self-attention mechanisms, offer a new approach but often produce dispersed attention patterns if trained with supervised learning. This study explores whether self-supervised DINO (self-DIstillation with NO labels) training enables ViTs to develop attention mechanisms resembling human visual attention. Using video stimuli to capture human gaze dynamics, we found that DINO-trained ViTs closely mimic human attention patterns, while those trained with supervised learning deviate significantly. An analysis of self-attention heads revealed three distinct clusters: one focusing on foreground objects,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KitazawaLab/vit-human-attention-comparison
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Gaze Tracking and Assistive Technology · Visual perception and processing mechanisms

MethodsAttention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Residual Connection · Multi-Head Attention · Vision Transformer · self-DIstillation with NO labels