Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization

Bingqing Zhang; Zhuo Cao; Heming Du; Yang Li; Xue Li; Jiajun Liu; Sen Wang

arXiv:2507.15504·cs.CV·July 25, 2025

Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization

Bingqing Zhang, Zhuo Cao, Heming Du, Yang Li, Xue Li, Jiajun Liu, Sen Wang

PDF

Open Access

TL;DR

This paper introduces UMIVR, an interactive text-to-video retrieval system that explicitly quantifies uncertainties to refine user queries and improve retrieval accuracy through targeted clarifications.

Contribution

It proposes a novel uncertainty quantification framework for interactive TVR, enabling explicit measurement and reduction of ambiguities without additional training.

Findings

01

Achieves 69.2% Recall@1 after 10 rounds on MSR-VTT-1k

02

Effectively reduces retrieval ambiguity through uncertainty-guided questioning

03

Demonstrates significant improvements over baseline methods

Abstract

Despite recent advances, Text-to-video retrieval (TVR) is still hindered by multiple inherent uncertainties, such as ambiguous textual queries, indistinct text-video mappings, and low-quality video frames. Although interactive systems have emerged to address these challenges by refining user intent through clarifying questions, current methods typically rely on heuristic or ad-hoc strategies without explicitly quantifying these uncertainties, limiting their effectiveness. Motivated by this gap, we propose UMIVR, an Uncertainty-Minimizing Interactive Text-to-Video Retrieval framework that explicitly quantifies three critical uncertainties-text ambiguity, mapping uncertainty, and frame uncertainty-via principled, training-free metrics: semantic entropy-based Text Ambiguity Score (TAS), Jensen-Shannon divergence-based Mapping Uncertainty Score (MUS), and a Temporal Quality-based Frame…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques