Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Ziyang Wang; Yi-Lin Sung; Feng Cheng; Gedas Bertasius; Mohit Bansal

arXiv:2309.10091·cs.CV·September 20, 2023·2 cites

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Ziyang Wang, Yi-Lin Sung, Feng Cheng, Gedas Bertasius, Mohit Bansal

PDF

Open Access 1 Repo

TL;DR

UCoFiA introduces a unified coarse-to-fine alignment model for video-text retrieval, effectively capturing multi-granularity similarities and improving retrieval accuracy over previous methods.

Contribution

The paper proposes a novel unified model that combines multi-granularity cross-modal similarity with an interactive aggregation and normalization, enhancing video-text retrieval performance.

Findings

01

Outperforms previous state-of-the-art CLIP-based methods on multiple benchmarks.

02

Achieves 2.4%, 1.4%, and 1.3% improvements in text-to-video retrieval R@1 on MSR-VTT, Activity-Net, and DiDeMo.

03

Demonstrates the effectiveness of multi-granular alignment and similarity normalization.

Abstract

The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained alignment between visual and textual information. However, retrieving the correct video according to the text query is often challenging as it requires the ability to reason about both high-level (scene) and low-level (object) visual clues and how they relate to the text query. To this end, we propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA. Specifically, our model captures the cross-modal similarity information at different granularity levels. To alleviate the effect of irrelevant visual clues, we also apply an Interactive Similarity Aggregation module (ISA) to consider the importance of different visual features while aggregating the cross-modal similarity to obtain a similarity score for each granularity. Finally, we apply the Sinkhorn-Knopp algorithm to normalize the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ziyang412/ucofia
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization