GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient   Partially Relevant Video Retrieval

Yuting Wang; Jinpeng Wang; Bin Chen; Ziyun Zeng; Shu-Tao Xia

arXiv:2310.05195·cs.CV·January 4, 2024·1 cites

GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval

Yuting Wang, Jinpeng Wang, Bin Chen, Ziyun Zeng, Shu-Tao Xia

PDF

Open Access 1 Repo 1 Video

TL;DR

GMMFormer introduces an efficient, implicit clip modeling approach using Gaussian-Mixture-Models within a Transformer architecture for partially relevant video retrieval, improving semantic discrimination and reducing redundancy.

Contribution

It proposes GMMFormer, a novel Transformer-based model that models video clips implicitly with Gaussian-Mixture-Models and enhances semantic differentiation with a query diverse loss.

Findings

01

Outperforms existing methods on three large-scale datasets.

02

Reduces storage overhead compared to scanning-based clip construction.

03

Achieves higher retrieval accuracy and efficiency.

Abstract

Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR methods adopt scanning-based clip construction to achieve explicit clip modeling, which is information-redundant and requires a large storage overhead. To solve the efficiency problem of PRVR methods, this paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly. During frame interactions, we incorporate Gaussian-Mixture-Model constraints to focus each frame on its adjacent frames instead of the whole video. Then generated representations will contain multi-scale clip information, achieving implicit clip modeling. In addition, PRVR methods ignore semantic differences between text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huangmozhi9527/GMMFormer
pytorchOfficial

Videos

GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Video Analysis and Summarization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Label Smoothing · Adam · Dropout · Absolute Position Encodings · Layer Normalization