Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives

Haoyu Zhao; Jiaxi Gu; Shicong Wang; Xing Zhang; Hang Xu; Zuxuan Wu; Yu-Gang Jiang

arXiv:2508.14812·cs.CV·August 21, 2025

Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives

Haoyu Zhao, Jiaxi Gu, Shicong Wang, Xing Zhang, Hang Xu, Zuxuan Wu, Yu-Gang Jiang

PDF

Open Access

TL;DR

This paper introduces a coarse-to-fine learning framework with an inference pipeline that leverages repeated keywords to enhance video-language retrieval accuracy without extra training, addressing computational costs and fine-grained alignment.

Contribution

It proposes a novel coarse-to-fine objective and an inference pipeline utilizing repetition and voting to improve retrieval performance without additional pre-training.

Findings

01

Achieves 2.1% higher Recall@1 on MSR-VTT

02

Improves performance on DiDeMo by 1.6%

03

Outperforms previous methods on four benchmarks

Abstract

The explosive growth of video streaming presents challenges in achieving high accuracy and low training costs for video-language retrieval. However, existing methods rely on large-scale pre-training to improve video retrieval performance, resulting in significant computational demands. Additionally, the fine-grained information in videos and texts remains underexplored. To alleviate these problems, we propose a novel framework to learn fine-grained features for better alignment and introduce an inference pipeline to improve performance without additional training. Specifically, we employ coarse-to-fine objectives to understand the semantic information of video-text pairs, including contrastive and matching learning. The fine-grained data used for training is obtained through the Granularity-Aware Representation module, which is designed based on similarity analysis between video frames…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques