SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries
Xirong Li, Fangming Zhou, Chaoxi Xu, Jiaqi Ji, Gang Yang

TL;DR
This paper introduces SEA, a novel method for video retrieval using textual queries that leverages multiple sentence encoders across diverse common spaces, improving accuracy and robustness in cross-modal matching.
Contribution
SEA is the first to support multi-space matching with multi-loss learning, effectively exploiting diverse sentence encoders for better video retrieval performance.
Findings
SEA outperforms state-of-the-art methods on four benchmarks.
Multi-space multi-loss learning enhances matching accuracy.
SEA is simple to implement and adaptable to new encoders.
Abstract
Retrieving unlabeled videos by textual queries, known as Ad-hoc Video Search (AVS), is a core theme in multimedia data management and retrieval. The success of AVS counts on cross-modal representation learning that encodes both query sentences and videos into common spaces for semantic similarity computation. Inspired by the initial success of previously few works in combining multiple sentence encoders, this paper takes a step forward by developing a new and general method for effectively exploiting diverse sentence encoders. The novelty of the proposed method, which we term Sentence Encoder Assembly (SEA), is two-fold. First, different from prior art that use only a single common space, SEA supports text-video matching in multiple encoder-specific common spaces. Such a property prevents the matching from being dominated by a specific encoder that produces an encoding vector much…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
