Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives
Haoyu Zhao, Jiaxi Gu, Shicong Wang, Xing Zhang, Hang Xu, Zuxuan Wu, Yu-Gang Jiang

TL;DR
This paper introduces a coarse-to-fine learning framework with an inference pipeline that leverages repeated keywords to enhance video-language retrieval accuracy without extra training, addressing computational costs and fine-grained alignment.
Contribution
It proposes a novel coarse-to-fine objective and an inference pipeline utilizing repetition and voting to improve retrieval performance without additional pre-training.
Findings
Achieves 2.1% higher Recall@1 on MSR-VTT
Improves performance on DiDeMo by 1.6%
Outperforms previous methods on four benchmarks
Abstract
The explosive growth of video streaming presents challenges in achieving high accuracy and low training costs for video-language retrieval. However, existing methods rely on large-scale pre-training to improve video retrieval performance, resulting in significant computational demands. Additionally, the fine-grained information in videos and texts remains underexplored. To alleviate these problems, we propose a novel framework to learn fine-grained features for better alignment and introduce an inference pipeline to improve performance without additional training. Specifically, we employ coarse-to-fine objectives to understand the semantic information of video-text pairs, including contrastive and matching learning. The fine-grained data used for training is obtained through the Granularity-Aware Representation module, which is designed based on similarity analysis between video frames…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
