Cross-modal Contrastive Learning with Asymmetric Co-attention Network   for Video Moment Retrieval

Love Panta; Prashant Shrestha; Brabeem Sapkota; Amrita Bhattarai,; Suresh Manandhar; Anand Kumar Sah

arXiv:2312.07435·cs.CV·December 13, 2023·1 cites

Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval

Love Panta, Prashant Shrestha, Brabeem Sapkota, Amrita Bhattarai,, Suresh Manandhar, Anand Kumar Sah

PDF

Open Access 1 Repo

TL;DR

This paper introduces a cross-modal contrastive learning approach with an asymmetric co-attention network for video moment retrieval, addressing information asymmetry and improving performance with fewer parameters.

Contribution

It proposes an asymmetric co-attention network combined with momentum contrastive loss to enhance video-text interaction and retrieval accuracy, with efficient parameter usage.

Findings

01

Outperforms state-of-the-art on TACoS dataset

02

Achieves comparable results on ActivityNet Captions

03

Uses fewer parameters than baseline models

Abstract

Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities. Recent work in image-text pretraining has demonstrated that most existing pretrained models suffer from information asymmetry due to the difference in length between visual and textual sequences. We question whether the same problem also exists in the video-text domain with an auxiliary need to preserve both spatial and temporal information. Thus, we evaluate a recently proposed solution involving the addition of an asymmetric co-attention network for video grounding tasks. Additionally, we incorporate momentum contrastive loss for robust, discriminative representation learning in both modalities. We note that the integration of these supplementary modules yields better performance compared to state-of-the-art models on the TACoS dataset and comparable results on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

love481/Cross-modal-Contrastive-Learning-with-Asymmetric-Co-attention-Network-for-Video-Moment-Retrieval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning