Domain Adaptation in Multi-View Embedding for Cross-Modal Video Retrieval
Jonathan Munro, Michael Wray, Diane Larlus, Gabriela Csurka, Dima, Damen

TL;DR
This paper introduces an unsupervised domain adaptation method for cross-modal video retrieval, aligning video embeddings across different domains to improve retrieval accuracy without requiring annotations in the target domain.
Contribution
It proposes a novel iterative domain alignment approach using pseudo-labeling and cross-domain ranking, specifically addressing the domain gap in uncaptioned video retrieval tasks.
Findings
Outperforms source-only and other alignment methods
Effective in fine-grained action video retrieval
Establishes a new benchmark for unsupervised domain adaptation
Abstract
Given a gallery of uncaptioned video sequences, this paper considers the task of retrieving videos based on their relevance to an unseen text query. To compensate for the lack of annotations, we rely instead on a related video gallery composed of video-caption pairs, termed the source gallery, albeit with a domain gap between its videos and those in the target gallery. We thus introduce the problem of Unsupervised Domain Adaptation for Cross-modal Video Retrieval, along with a new benchmark on fine-grained actions. We propose a novel iterative domain alignment method by means of pseudo-labelling target videos and cross-domain (i.e. source-target) ranking. Our approach adapts the embedding space to the target gallery, consistently outperforming source-only as well as marginal and conditional alignment methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
