Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval

CH Cho; WJ Moon; W Jun; MS Jung; JP Heo

arXiv:2506.07471·cs.CV·June 10, 2025

Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval

CH Cho, WJ Moon, W Jun, MS Jung, JP Heo

PDF

Open Access 1 Video

TL;DR

This paper introduces ARL, a novel framework for partially relevant video retrieval that explicitly models and leverages ambiguity in text-video pairs, improving retrieval accuracy by hierarchical and fine-grained learning.

Contribution

The paper proposes Ambiguity-Restrained representation Learning (ARL), incorporating ambiguity detection and multi-level semantic modeling for better PRVR performance.

Findings

01

ARL effectively detects ambiguous pairs using uncertainty and similarity criteria.

02

Hierarchical learning improves semantic understanding of ambiguous text-video pairs.

03

Fine-grained frame-level modeling enhances retrieval accuracy in untrimmed videos.

Abstract

Partially Relevant Video Retrieval~(PRVR) aims to retrieve a video where a specific segment is relevant to a given text query. Typical training processes of PRVR assume a one-to-one relationship where each text query is relevant to only one video. However, we point out the inherent ambiguity between text and video content based on their conceptual scope and propose a framework that incorporates this ambiguity into the model learning process. Specifically, we propose Ambiguity-Restrained representation Learning~(ARL) to address ambiguous text-video pairs. Initially, ARL detects ambiguous pairs based on two criteria: uncertainty and similarity. Uncertainty represents whether instances include commonly shared context across the dataset, while similarity indicates pair-wise semantic overlap. Then, with the detected ambiguous pairs, our ARL hierarchically learns the semantic relationship via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques