Vision Transformer attention alignment with human visual perception in aesthetic object evaluation
Miguel Carrasco, C\'esar Gonz\'alez-Mart\'in, Jos\'e Aranda, Luis Oliveros

TL;DR
This study compares human visual attention with Vision Transformer attention mechanisms during aesthetic object evaluation, revealing that some ViT heads can approximate human focus patterns, especially for specific features.
Contribution
It provides an empirical analysis of how ViT attention maps align with human gaze patterns in aesthetic evaluation, highlighting specific attention heads that mimic human visual focus.
Findings
Attention head #12 aligns best with human gaze patterns.
Optimal correlation at Gaussian sigma=2.4.
Certain ViT heads differ significantly from human attention.
Abstract
Visual attention mechanisms play a crucial role in human perception and aesthetic evaluation. Recent advances in Vision Transformers (ViTs) have demonstrated remarkable capabilities in computer vision tasks, yet their alignment with human visual attention patterns remains underexplored, particularly in aesthetic contexts. This study investigates the correlation between human visual attention and ViT attention mechanisms when evaluating handcrafted objects. We conducted an eye-tracking experiment with 30 participants (9 female, 21 male, mean age 24.6 years) who viewed 20 artisanal objects comprising basketry bags and ginger jars. Using a Pupil Labs eye-tracker, we recorded gaze patterns and generated heat maps representing human visual attention. Simultaneously, we analyzed the same objects using a pre-trained ViT model with DINO (Self-DIstillation with NO Labels), extracting attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Aesthetic Perception and Analysis · Face Recognition and Perception
