Enhancing Video Memorability Prediction with Text-Motion Cross-modal Contrastive Loss and Its Application in Video Summarization

Zhiyi Zhu; Xiaoyu Wu; Youwei Lu

arXiv:2506.08649·cs.CV·June 11, 2025

Enhancing Video Memorability Prediction with Text-Motion Cross-modal Contrastive Loss and Its Application in Video Summarization

Zhiyi Zhu, Xiaoyu Wu, Youwei Lu

PDF

Open Access

TL;DR

This paper introduces a novel cross-modal contrastive loss to improve motion feature representation in video memorability prediction, achieving state-of-the-art results and applying it to enhance video summarization.

Contribution

The paper proposes the Text-Motion Cross-modal Contrastive Loss (TMCCL) to better utilize motion cues and introduces MWCVS to improve video summarization using memorability prediction.

Findings

01

Achieved state-of-the-art performance on two datasets.

02

Demonstrated effectiveness of memorability in video summarization.

03

Enhanced motion feature representation through TMCCL.

Abstract

Video memorability refers to the ability of videos to be recalled after viewing, playing a crucial role in creating content that remains memorable. Existing models typically focus on extracting multimodal features to predict video memorability scores but often fail to fully utilize motion cues. The representation of motion features is compromised during the fine-tuning phase of the motion feature extractor due to a lack of labeled data. In this paper, we introduce the Text-Motion Cross-modal Contrastive Loss (TMCCL), a multimodal video memorability prediction model designed to enhance the representation of motion features. We tackle the challenge of improving motion feature representation by leveraging text description similarities across videos to establish positive and negative motion sample sets for a given target. This enhancement allows the model to learn similar feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection

MethodsFocus