Image Memorability Prediction with Vision Transformers
Thomas Hagen, Thomas Espeseth

TL;DR
This paper introduces ViTMem, a vision transformer-based model for predicting image memorability, demonstrating it outperforms CNN models and is sensitive to semantic content, advancing computational memorability prediction.
Contribution
The paper presents ViTMem, a novel vision transformer-based model that surpasses CNNs in predicting image memorability and captures semantic influences.
Findings
ViTMem performs equal or better than state-of-the-art CNN models.
ViTMem is particularly sensitive to semantic content affecting memorability.
ViT-derived models can replace CNNs for memorability prediction.
Abstract
Behavioral studies have shown that the memorability of images is similar across groups of people, suggesting that memorability is a function of the intrinsic properties of images, and is unrelated to people's individual experiences and traits. Deep learning networks can be trained on such properties and be used to predict memorability in new data sets. Convolutional neural networks (CNN) have pioneered image memorability prediction, but more recently developed vision transformer (ViT) models may have the potential to yield even better predictions. In this paper, we present the ViTMem, a new memorability model based on ViT, and evaluate memorability predictions obtained by it with state-of-the-art CNN-derived models. Results showed that ViTMem performed equal to or better than state-of-the-art models on all data sets. Additional semantic level analyses revealed that ViTMem is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Aesthetic Perception and Analysis · Image and Video Quality Assessment
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Linear Layer · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer
