Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval
Damianos Galanopoulos, Vasileios Mezaris

TL;DR
This paper presents a novel multiple space learning approach for text-based video retrieval, effectively combining diverse textual and visual features to improve cross-modal similarity estimation.
Contribution
It introduces a new network architecture that learns multiple joint feature spaces and employs softmax-based similarity revision for enhanced retrieval accuracy.
Findings
Effective combination of textual and visual features improves retrieval performance.
Multiple space learning outperforms single space approaches in experiments.
The method achieves state-of-the-art results on large-scale datasets.
Abstract
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network. Source code is made publicly available at:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsSoftmax
