Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

Sree Bhattacharyya; Yaman Kumar Singla; Sudhir Yarram; Somesh Kumar Singh; Harini S I; James Z. Wang

arXiv:2511.20854·cs.CV·November 27, 2025

Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

Sree Bhattacharyya, Yaman Kumar Singla, Sudhir Yarram, Somesh Kumar Singh, Harini S I, James Z. Wang

PDF

Open Access 2 Datasets

TL;DR

This paper introduces a large-scale unsupervised dataset with over 82,000 videos and recall data, enabling improved modeling of visual memorability and retrieval tasks without relying on expensive human annotations.

Contribution

It presents the first large-scale unsupervised dataset for visual memorability modeling, along with models that outperform existing methods in recall generation and tip-of-the-tongue retrieval.

Findings

01

Unsupervised dataset effectively models memorability signals.

02

Fine-tuned vision-language models outperform GPT-4o in description generation.

03

Contrastive training enables multimodal tip-of-the-tongue retrieval.

Abstract

Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques