Bridging Information Asymmetry in Text-video Retrieval: A Data-centric   Approach

Zechen Bai; Tianjun Xiao; Tong He; Pichao Wang; Zheng Zhang; Thomas; Brox; Mike Zheng Shou

arXiv:2408.07249·cs.CV·March 11, 2025

Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach

Zechen Bai, Tianjun Xiao, Tong He, Pichao Wang, Zheng Zhang, Thomas, Brox, Mike Zheng Shou

PDF

Open Access

TL;DR

This paper presents a data-centric approach to improve text-video retrieval by enriching textual representations and using large language models to generate diverse queries, achieving state-of-the-art results.

Contribution

It introduces a novel framework that segments videos, generates diverse queries with LLMs, and employs query selection to enhance retrieval accuracy and efficiency.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Enriches textual representations to better match video content.

03

Uses LLM-generated diverse queries to improve retrieval robustness.

Abstract

As online video content rapidly grows, the task of text-video retrieval (TVR) becomes increasingly important. A key challenge in TVR is the information asymmetry between video and text: videos are inherently richer in information, while their textual descriptions often capture only fragments of this complexity. This paper introduces a novel, data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content. During training, videos are segmented into event-level clips and captioned to ensure comprehensive coverage. During retrieval, a large language model (LLM) generates semantically diverse queries to capture a broader range of possible matches. To enhance retrieval efficiency, we propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training · Focus