Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video   Retrieval

Fan Hu; Aozhu Chen; Ziyue Wang; Fangming Zhou; Jianfeng; Dong; Xirong Li

arXiv:2112.01832·cs.MM·July 28, 2022·1 cites

Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

Fan Hu, Aozhu Chen, Ziyue Wang, Fangming Zhou, Jianfeng, Dong, Xirong Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces LAFF, a lightweight, interpretable feature fusion method for text-to-video retrieval that optimally combines features at multiple stages and ends, outperforming existing approaches on several benchmarks.

Contribution

LAFF is a novel, computationally efficient feature fusion framework that improves text-to-video retrieval by fusing features at multiple stages and ends within a unified model.

Findings

01

LAFF outperforms previous methods on five benchmark datasets.

02

LAFF provides interpretability for feature selection.

03

LAFF establishes a new baseline for text-to-video retrieval.

Abstract

In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ruc-aimc-lab/laff
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization