Modality-Balanced Embedding for Video Retrieval
Xun Wang, Bingqing Ke, Xuanping Li, Fangyu Liu, Mingyu Zhang, Xiao, Liang, Qiushi Xiao, Cheng Luo, Yue Yu

TL;DR
This paper introduces MBVR, a method to address modality bias in video retrieval models, improving the balanced use of text, vision, and audio modalities for more accurate video search results.
Contribution
We propose MBVR, a novel approach with modality-shuffled samples and dynamic margin to balance modality attention in video retrieval models, enhancing retrieval accuracy.
Findings
Empirical improvement in retrieval accuracy on real-world datasets.
Statistically significant boost observed in large-scale platform deployment.
Effective and efficient solution to modality bias in video retrieval.
Abstract
Video search has become the main routine for users to discover videos relevant to a text query on large short-video sharing platforms. During training a query-video bi-encoder model using online search logs, we identify a modality bias phenomenon that the video encoder almost entirely relies on text matching, neglecting other modalities of the videos such as vision, audio. This modality imbalanceresults from a) modality gap: the relevance between a query and a video text is much easier to learn as the query is also a piece of text, with the same modality as the video text; b) data bias: most training samples can be solved solely by text matching. Here we share our practices to improve the first retrieval stage including our solution for the modality imbalance issue. We propose MBVR (short for Modality Balanced Video Retrieval) with two key components: manually generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
