Learning video retrieval models with relevance-aware online mining
Alex Falcon, Giuseppe Serra, Oswald Lanz

TL;DR
This paper introduces Relevance-Aware Negatives and Positives mining (RANP), a novel method for improving cross-modal video retrieval by better selecting training samples based on semantic relevance, leading to state-of-the-art results.
Contribution
The paper proposes RANP, a new technique for selecting negatives and positives in training video-text retrieval models, addressing the issue of wrongly penalizing valid positives.
Findings
Achieves +5.3% nDCG on EPIC-Kitchens-100
Achieves +3.0% mAP on EPIC-Kitchens-100
Improves retrieval performance by better sample mining
Abstract
Due to the amount of videos and related captions uploaded every hour, deep learning-based solutions for cross-modal video retrieval are attracting more and more attention. A typical approach consists in learning a joint text-video embedding space, where the similarity of a video and its associated caption is maximized, whereas a lower similarity is enforced with all the other captions, called negatives. This approach assumes that only the video and caption pairs in the dataset are valid, but different captions - positives - may also describe its visual contents, hence some of them may be wrongly penalized. To address this shortcoming, we propose the Relevance-Aware Negatives and Positives mining (RANP) which, based on the semantics of the negatives, improves their selection while also increasing the similarity of other valid positives. We explore the influence of these techniques on two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
