HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval

Zhiwei Chen; Yupeng Hu; Zixu Li; Zhiheng Fu; Haokun Wen; Weili Guan

arXiv:2512.02792·cs.CV·December 16, 2025

HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, Weili Guan

PDF

Open Access

TL;DR

The paper introduces HUD, a hierarchical uncertainty-aware network that improves composed video and image retrieval by addressing multi-modal query understanding and semantic disambiguation, achieving state-of-the-art results.

Contribution

HUD is the first framework to leverage the disparity in information density between video and text for enhanced multi-modal query disambiguation.

Findings

01

Achieves state-of-the-art performance on three benchmark datasets.

02

Effectively disambiguates modification subjects in multi-modal queries.

03

Enhances semantic focus for more accurate retrieval.

Abstract

Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization