Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

Dohwan Ko; Ji Soo Lee; Minhyuk Choi; Zihang Meng; Hyunwoo J. Kim

arXiv:2507.23284·cs.CV·September 30, 2025

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

Dohwan Ko, Ji Soo Lee, Minhyuk Choi, Zihang Meng, Hyunwoo J. Kim

PDF

Open Access 1 Datasets

TL;DR

This paper introduces BLiM, a bidirectional likelihood estimation framework with candidate prior normalization for improved text-video retrieval, effectively reducing bias and enhancing relevance detection in large-scale multi-modal datasets.

Contribution

The paper proposes a novel bidirectional likelihood estimation method with a training-free candidate prior normalization to mitigate bias in multi-modal large language model-based retrieval.

Findings

01

BLiM outperforms previous models by 6.4 R@1 on average across benchmarks.

02

Candidate Prior Normalization effectively reduces candidate prior bias.

03

The approach enhances relevance detection in various multi-modal tasks.

Abstract

Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ikodoh/BLiM-Data
dataset· 115 dl
115 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies