Image Memorability Prediction with Vision Transformers

Thomas Hagen; Thomas Espeseth

arXiv:2301.08647·cs.CV·January 23, 2023·1 cites

Image Memorability Prediction with Vision Transformers

Thomas Hagen, Thomas Espeseth

PDF

Open Access

TL;DR

This paper introduces ViTMem, a vision transformer-based model for predicting image memorability, demonstrating it outperforms CNN models and is sensitive to semantic content, advancing computational memorability prediction.

Contribution

The paper presents ViTMem, a novel vision transformer-based model that surpasses CNNs in predicting image memorability and captures semantic influences.

Findings

01

ViTMem performs equal or better than state-of-the-art CNN models.

02

ViTMem is particularly sensitive to semantic content affecting memorability.

03

ViT-derived models can replace CNNs for memorability prediction.

Abstract

Behavioral studies have shown that the memorability of images is similar across groups of people, suggesting that memorability is a function of the intrinsic properties of images, and is unrelated to people's individual experiences and traits. Deep learning networks can be trained on such properties and be used to predict memorability in new data sets. Convolutional neural networks (CNN) have pioneered image memorability prediction, but more recently developed vision transformer (ViT) models may have the potential to yield even better predictions. In this paper, we present the ViTMem, a new memorability model based on ViT, and evaluate memorability predictions obtained by it with state-of-the-art CNN-derived models. Results showed that ViTMem performed equal to or better than state-of-the-art models on all data sets. Additional semantic level analyses revealed that ViTMem is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Aesthetic Perception and Analysis · Image and Video Quality Assessment

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Linear Layer · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer