Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach
Zechen Bai, Tianjun Xiao, Tong He, Pichao Wang, Zheng Zhang, Thomas, Brox, Mike Zheng Shou

TL;DR
This paper presents a data-centric approach to improve text-video retrieval by enriching textual representations and using large language models to generate diverse queries, achieving state-of-the-art results.
Contribution
It introduces a novel framework that segments videos, generates diverse queries with LLMs, and employs query selection to enhance retrieval accuracy and efficiency.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Enriches textual representations to better match video content.
Uses LLM-generated diverse queries to improve retrieval robustness.
Abstract
As online video content rapidly grows, the task of text-video retrieval (TVR) becomes increasingly important. A key challenge in TVR is the information asymmetry between video and text: videos are inherently richer in information, while their textual descriptions often capture only fragments of this complexity. This paper introduces a novel, data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content. During training, videos are segmented into event-level clips and captioned to ensure comprehensive coverage. During retrieval, a large language model (LLM) generates semantically diverse queries to capture a broader range of possible matches. To enhance retrieval efficiency, we propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training · Focus
