TL;DR
This paper introduces Holmes, a hierarchical evidential learning framework that explicitly models uncertainty in partially relevant video retrieval, improving performance by aggregating multi-granular cross-modal evidence.
Contribution
Holmes is a novel framework that combines hierarchical evidential learning with soft query-clip alignment to better handle uncertainty and sparse supervision in video retrieval.
Findings
Holmes outperforms state-of-the-art methods on benchmark datasets.
The framework effectively models uncertainty using Dirichlet distributions.
Adaptive query-clip alignment improves dense evidence accumulation.
Abstract
Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this, we propose Holmes, a hierarchical evidential learning framework that aggregates multi-granular cross-modal evidence to quantify and model uncertainty explicitly. At the inter-video level, similarity scores are interpreted as evidential support and modeled via a Dirichlet distribution. Based on the proposed three-fold principle, we perform fine-grained query identification, which then guides…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
