Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text   Retrieval

Yizhen Chen; Jie Wang; Lijian Lin; Zhongang Qi; Jin Ma; Ying Shan

arXiv:2301.12644·cs.CV·January 31, 2023

Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

Yizhen Chen, Jie Wang, Lijian Lin, Zhongang Qi, Jin Ma, Ying Shan

PDF

Open Access 1 Video

TL;DR

This paper introduces the TABLE network, which explicitly integrates multi-modal tags for improved video-text retrieval, achieving state-of-the-art results across multiple benchmarks.

Contribution

It proposes a novel tagging-based approach with a multi-modal encoder and supervised tasks to enhance video-text alignment.

Findings

01

Achieves SOTA performance on MSR-VTT, MSVD, LSMDC, and DiDeMo.

02

Effectively leverages multi-modal tags for better retrieval accuracy.

03

Demonstrates the benefit of explicit multi-modal integration in video-text tasks.

Abstract

Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years. Most of the existing methods either transfer the knowledge of image-text pretraining model to video-text retrieval task without fully exploring the multi-modal information of videos, or simply fuse multi-modal features in a brute force manner without explicit guidance. In this paper, we integrate multi-modal information in an explicit manner by tagging, and use the tags as the anchors for better video-text alignment. Various pretrained experts are utilized for extracting the information of multiple modalities, including object, person, motion, audio, etc. To take full advantage of these information, we propose the TABLE (TAgging Before aLignmEnt) network, which consists of a visual encoder, a tag encoder, a text encoder, and a tag-guiding cross-modal encoder for jointly encoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization