Fine-grained Text-Video Retrieval with Frozen Image Encoders

Zuozhuo Dai; Fangtao Shao; Qingkun Su; Zilong Dong; Siyu Zhu

arXiv:2307.09972·cs.CV·July 20, 2023·2 cites

Fine-grained Text-Video Retrieval with Frozen Image Encoders

Zuozhuo Dai, Fangtao Shao, Qingkun Su, Zilong Dong, Siyu Zhu

PDF

Open Access

TL;DR

This paper introduces CrossTVR, a two-stage text-video retrieval system that enhances fine-grained multimodal interaction using a decoupled cross attention module and leverages frozen CLIP models for improved scalability and performance.

Contribution

It proposes a novel two-stage architecture with a decoupled cross attention module and employs frozen CLIP models to improve fine-grained retrieval in text-video tasks.

Findings

01

Outperforms state-of-the-art methods on text-video retrieval datasets.

02

Demonstrates scalability to larger pre-trained vision models like ViT-G.

03

Effective in capturing fine-grained spatial and temporal multimodal information.

Abstract

State-of-the-art text-video retrieval (TVR) methods typically utilize CLIP and cosine similarity for efficient retrieval. Meanwhile, cross attention methods, which employ a transformer decoder to compute attention between each text query and all frames in a video, offer a more comprehensive interaction between text and videos. However, these methods lack important fine-grained spatial information as they directly compute attention between text and video-level tokens. To address this issue, we propose CrossTVR, a two-stage text-video retrieval architecture. In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection. In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions. Additionally, we employ the frozen CLIP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training