Fine-grained Text-Video Retrieval with Frozen Image Encoders
Zuozhuo Dai, Fangtao Shao, Qingkun Su, Zilong Dong, Siyu Zhu

TL;DR
This paper introduces CrossTVR, a two-stage text-video retrieval system that enhances fine-grained multimodal interaction using a decoupled cross attention module and leverages frozen CLIP models for improved scalability and performance.
Contribution
It proposes a novel two-stage architecture with a decoupled cross attention module and employs frozen CLIP models to improve fine-grained retrieval in text-video tasks.
Findings
Outperforms state-of-the-art methods on text-video retrieval datasets.
Demonstrates scalability to larger pre-trained vision models like ViT-G.
Effective in capturing fine-grained spatial and temporal multimodal information.
Abstract
State-of-the-art text-video retrieval (TVR) methods typically utilize CLIP and cosine similarity for efficient retrieval. Meanwhile, cross attention methods, which employ a transformer decoder to compute attention between each text query and all frames in a video, offer a more comprehensive interaction between text and videos. However, these methods lack important fine-grained spatial information as they directly compute attention between text and video-level tokens. To address this issue, we propose CrossTVR, a two-stage text-video retrieval architecture. In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection. In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions. Additionally, we employ the frozen CLIP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
