Are All Combinations Equal? Combining Textual and Visual Features with   Multiple Space Learning for Text-Based Video Retrieval

Damianos Galanopoulos; Vasileios Mezaris

arXiv:2211.11351·cs.CV·November 22, 2022

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

Damianos Galanopoulos, Vasileios Mezaris

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel multiple space learning approach for text-based video retrieval, effectively combining diverse textual and visual features to improve cross-modal similarity estimation.

Contribution

It introduces a new network architecture that learns multiple joint feature spaces and employs softmax-based similarity revision for enhanced retrieval accuracy.

Findings

01

Effective combination of textual and visual features improves retrieval performance.

02

Multiple space learning outperforms single space approaches in experiments.

03

The method achieves state-of-the-art results on large-scale datasets.

Abstract

In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network. Source code is made publicly available at:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bmezaris/texttovideoretrieval-ttimesv
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsSoftmax